How to Read Excel with Pandas Step by Step
pd.read_excel() has over a dozen parameters that interact in non-obvious ways. Skip engine and you hit ImportError; ignore dtype and numeric IDs silently become floats; miss skiprows on a report with a title block and your column headers land in the wrong row. This walkthrough moves through each decision point in sequence so the first call you write is correct.
For a full reference that covers openpyxl direct access, calamine, merged cells, and BytesIO loading, see Reading Excel Files with Python.
Prerequisites
# pip install pandas openpyxl
pip install pandas openpyxl
Verify the install:
# pip install pandas openpyxl
import pandas as pd
import openpyxl
print("pandas", pd.__version__)
print("openpyxl", openpyxl.__version__)
If you see ModuleNotFoundError: No module named 'openpyxl' after the pip install, your terminal is pointing at a different Python than the one running your scripts. Activate the correct virtualenv and reinstall.
Create a test workbook to follow along with each step:
# pip install openpyxl
from pathlib import Path
import openpyxl
TEST_PATH = Path("sample.xlsx")
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["order_id", "customer", "amount", "order_date"])
ws.append([1001, "Alice", 199.99, "2026-01-15"])
ws.append([1002, "Bob", 54.50, "2026-01-16"])
ws.append([1003, "Carol", 320.00, "2026-01-17"])
# Second sheet with different data
ws2 = wb.create_sheet("Returns")
ws2.append(["order_id", "reason", "refund"])
ws2.append([1001, "Wrong size", 199.99])
wb.save(TEST_PATH)
print("Created:", TEST_PATH.resolve())
Step 1: Discover Sheet Names
Before reading data, confirm what sheets exist. A KeyError caused by a sheet name that has an invisible trailing space is a surprisingly common failure in production:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("sample.xlsx")
try:
xl = pd.ExcelFile(EXCEL_PATH, engine="openpyxl")
print("Sheets:", xl.sheet_names) # ['Sales', 'Returns']
except FileNotFoundError:
raise SystemExit(f"File not found: {EXCEL_PATH}")
except Exception as e:
raise SystemExit(f"Cannot open: {e}")
pd.ExcelFile opens the workbook container without parsing any sheet data. It is cheap even on large files and gives you the exact sheet names as Python strings, including any whitespace that would cause a KeyError if you typed the name manually.
Step 2: Load a Single Sheet
The minimum correct call for a .xlsx file:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("sample.xlsx")
try:
df = pd.read_excel(
EXCEL_PATH,
sheet_name="Sales", # target sheet by exact name
engine="openpyxl", # required for .xlsx — never omit
)
print(df.shape) # (3, 4)
print(df.head())
except FileNotFoundError:
raise SystemExit(f"File not found: {EXCEL_PATH}")
except ImportError as e:
raise SystemExit(f"Engine missing — pip install openpyxl: {e}")
except KeyError as e:
raise SystemExit(f"Sheet not found: {e}")
The engine="openpyxl" argument is not optional in production code. Without it, pandas infers the backend from the file extension; on some system configurations this picks xlrd, which raises XLRDError for .xlsx files. See Fix xlrd Error Reading .xlsx Files if you hit that error.
Step 3: Handle Reports with Non-Zero Header Rows
Reports from accounting software, ERP systems, and generated exports often have a company name, print date, filter summary, or logo block above the actual data table. Use skiprows to remove those rows before pandas identifies the header:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("monthly_report.xlsx")
try:
# Rows 0–2 are a title block; row 3 is the real column header
df = pd.read_excel(
EXCEL_PATH,
skiprows=3, # discard rows 0, 1, 2
header=0, # the next row (original row 3) is now row 0 and becomes the header
skipfooter=2, # ignore the last 2 rows (totals lines)
engine="openpyxl",
)
print(df.columns.tolist())
except Exception as e:
raise SystemExit(f"Read error: {e}")
To skip specific non-contiguous rows — e.g., row 0 contains a logo, rows 3 and 4 are blank sub-headers — pass a list: skiprows=[0, 3, 4]. Use nrows=10 in a preview call first to count the exact rows to skip.
When header=None is passed instead, pandas treats every row as data and assigns integer column indices (0, 1, 2, …). This is useful when you want to handle the header row manually — for example, to forward-fill merged cell labels.
Step 4: Select Only the Columns You Need
Wide workbooks exported from ERP or BI systems often contain 40–80 columns. Loading all of them when you need 5 wastes memory and slows type inference. usecols restricts what is read from disk:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("large_export.xlsx")
try:
# Option A: column names — most readable and maintainable
df = pd.read_excel(
EXCEL_PATH,
usecols=["order_id", "customer", "amount", "order_date"],
engine="openpyxl",
)
# Option B: Excel letter range — useful when column names are unknown
df_range = pd.read_excel(EXCEL_PATH, usecols="A:D", engine="openpyxl")
# Option C: callable — keep columns matching a pattern
df_q = pd.read_excel(
EXCEL_PATH,
usecols=lambda c: str(c).startswith("Revenue"),
engine="openpyxl",
)
print(f"Loaded {df.shape[1]} columns, {df.shape[0]} rows")
print("Memory:", df.memory_usage(deep=True).sum() // 1024, "KB")
except Exception as e:
raise SystemExit(f"Read error: {e}")
When usecols is a list of names, those names must match the header row exactly — case-sensitive, including any whitespace in the original file. Strip column names after load if you suspect padding: df.columns = df.columns.str.strip().
Step 5: Fix Type Inference with dtype and parse_dates
Pandas type inference makes two predictable mistakes on Excel data. First, columns that look numeric to the sampler (but contain IDs like 001234) get cast to float64, adding .0 and dropping leading zeros. Second, date columns stored as strings or formatted text cells stay as object dtype rather than datetime64.
Fix both issues at the read_excel call:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("sample.xlsx")
try:
df = pd.read_excel(
EXCEL_PATH,
sheet_name="Sales",
dtype={
"order_id": str, # keep as "1001", not 1001.0
"customer": str,
"amount": float,
},
parse_dates=["order_date"], # coerce to datetime64[ns]
engine="openpyxl",
)
print(df.dtypes)
# order_id object
# customer object
# amount float64
# order_date datetime64[ns]
except Exception as e:
raise SystemExit(f"Read error: {e}")
If parse_dates fails silently and the column stays object, the cells contain text strings rather than Excel date serials. Use this fallback:
# pip install pandas
import pandas as pd
# After loading df above
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
# errors="coerce" inserts NaT for values that cannot be parsed
print(df["order_date"].isna().sum(), "unparseable dates")
The errors="coerce" argument makes failures visible as NaT rather than silently keeping the original strings.
Step 6: Load All Sheets at Once
When all sheets share the same column structure and you want to concatenate them, pass sheet_name=None. This returns an OrderedDict of {sheet_name: DataFrame} without requiring you to enumerate names in advance:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("annual_report.xlsx")
try:
all_sheets: dict[str, pd.DataFrame] = pd.read_excel(
EXCEL_PATH,
sheet_name=None, # load every sheet
dtype={"order_id": str},
engine="openpyxl",
)
frames = []
for name, df in all_sheets.items():
df = df.dropna(how="all") # drop fully blank rows
df["source_sheet"] = name # track which sheet each row came from
frames.append(df)
if not frames:
raise ValueError("No non-empty sheets found")
combined = pd.concat(frames, ignore_index=True)
print(f"Combined: {combined.shape[0]} rows across {len(frames)} sheets")
except Exception as e:
raise SystemExit(f"Error: {e}")
This sheet_name=None pattern is the standard entry point for consolidation workflows. See Merging Multiple Spreadsheets for how to extend this to workbooks from different sources.
Step 7: Validate the Result
Validate immediately after loading, before any downstream transformation or reporting. A malformed source file should raise an explicit error rather than produce a silent wrong result in a downstream report:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("sample.xlsx")
REQUIRED_COLS = {"order_id", "customer", "amount", "order_date"}
try:
df = pd.read_excel(
EXCEL_PATH,
sheet_name="Sales",
dtype={"order_id": str},
parse_dates=["order_date"],
engine="openpyxl",
)
missing_cols = REQUIRED_COLS - set(df.columns)
assert not missing_cols, f"Missing columns: {missing_cols}"
assert df.shape[0] > 0, "DataFrame is empty"
assert not df["order_id"].isna().any(), "Null order_ids found"
assert pd.api.types.is_datetime64_any_dtype(df["order_date"]), "order_date not datetime"
print(f"Validation passed — {df.shape[0]} rows")
print(df.dtypes)
except AssertionError as e:
raise SystemExit(f"Validation failed: {e}")
except Exception as e:
raise SystemExit(f"Load error: {e}")
These four assertions catch the four most common silent failures: missing columns (schema drift), empty sheet (wrong tab or empty export), null IDs (join will fail), and unparsed dates (time-series operations will break).
Troubleshooting Common Errors
| Error | Root cause | Fix |
|---|---|---|
ImportError: Missing optional dependency 'openpyxl' | openpyxl not installed | pip install openpyxl |
XLRDError: Excel xlsx file; not supported | xlrd ≥ 2.0 selected for .xlsx | Add engine="openpyxl" — see Fix xlrd Error Reading .xlsx Files |
KeyError: 'SheetName' | Sheet name has trailing space or wrong case | Print pd.ExcelFile(path).sheet_names to confirm exact string |
Columns named Unnamed: 0, Unnamed: 1 | Header row is not row 0 | Add skiprows=N after previewing with nrows=5 |
Numeric IDs shown as 1001.0 | Pandas inferred float64 | Add dtype={"order_id": str} |
EmptyDataError | skiprows value exceeded total row count | Preview with nrows=10 first to count intro rows |
Date column stays object dtype | Cells contain strings, not Excel date serials | Call pd.to_datetime(df["col"], errors="coerce") post-load |
Frequently Asked Questions
How do I read only the first N rows?
Pass nrows=100 to pd.read_excel(). Useful for previewing large files before committing to a full load.
How do I read multiple specific sheets by name?
Pass a list: sheet_name=["Q1", "Q2", "Q3"]. Pandas returns a dict of DataFrames keyed by sheet name.
Can I skip the first few rows of a report header?
Yes. skiprows=3 skips rows 0, 1, and 2; the next row becomes the header. For non-contiguous rows, pass a list: skiprows=[0, 2, 5].
Why is my date column still object dtype after parse_dates?
The cell values are text strings rather than Excel date serials (integers). Call pd.to_datetime(df["col"], errors="coerce") after loading to force the conversion.
What is the difference between header=None and header=0?header=0 (the default) uses row 0 as column names. header=None treats every row as data and assigns integer column indices. Use header=None when you need to manipulate the header row yourself before assigning it — for example, to forward-fill merged cell labels.
How do I load a .xls file instead of .xlsx?
Install xlrd < 2.0 and pass engine="xlrd". The openpyxl engine cannot open legacy .xls binary files.
Related
- Reading Excel Files with Python — full reference: engines, openpyxl direct access, calamine, merged cells, BytesIO
- Fix xlrd Error Reading .xlsx Files — resolve
XLRDErrorandImportErrorfor xlsx files - Automating Excel Report Generation — next step after loading: generate formatted output workbooks
- Merging Multiple Spreadsheets — combine DataFrames from multiple workbooks
Part of Reading Excel Files with Python.