Reading Excel Files with Python
Excel workbooks arrive in two distinct binary formats — .xlsx (Office Open XML, used since Excel 2007) and .xls (legacy BIFF binary format from Excel 97–2003). Each format requires a different Python parsing engine. Code that skips engine selection either picks the wrong backend and raises XLRDError, or uses the correct backend but misses key parameters and corrupts types or column alignment.
This guide walks through every decision point in order: which engine to install, how to inspect a workbook before loading it, how to target specific sheets and columns, how to coerce types at load time, how to use openpyxl directly when pandas is not enough, and how to handle the real-world edge cases that appear once your scripts move from test files to production data.
For downstream work, see Automating Excel Report Generation once data is loaded, or Merging Multiple Spreadsheets when combining data from several workbooks.
Prerequisites
Install the required libraries before running any snippet in this guide:
# pip install pandas openpyxl
pip install pandas openpyxl
# for legacy .xls files only:
pip install "xlrd>=1.2,<2.0"
# for the fast calamine engine (optional):
pip install python-calamine
Create a minimal test workbook to validate your setup before pointing scripts at real data:
# pip install openpyxl
from pathlib import Path
import openpyxl
TEST_PATH = Path("test_workbook.xlsx")
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["order_id", "customer", "amount", "order_date"])
ws.append([1001, "Alice", 199.99, "2026-01-15"])
ws.append([1002, "Bob", 54.50, "2026-01-16"])
ws.append([1003, "Carol", 320.00, "2026-01-17"])
wb.save(TEST_PATH)
print("Test file written:", TEST_PATH.resolve())
Engine Selection
pandas.read_excel delegates file I/O to a backend engine. The choice is determined by file format, not personal preference. Passing the wrong engine raises an immediate error; omitting the engine parameter lets pandas guess, which can pick incorrectly when multiple engines are installed.
| Engine | Formats | Install | When NOT to use |
|---|---|---|---|
openpyxl | .xlsx, .xlsm | pip install openpyxl | Very large files where read speed is critical |
calamine | .xlsx, .xls, .xlsb, .ods | pip install python-calamine | When you need formula strings, cell styles, or comments |
xlrd | .xls only | pip install "xlrd<2.0" | Any .xlsx or .xlsb file — raises XLRDError |
Hitting
XLRDError: Excel xlsx file; not supported? See Fix xlrd Error Reading .xlsx Files for the exact fix.
Step 1: Inspect Before Loading
Check sheet names and preview a few rows before committing to a full workbook load. On a 100 MB file this saves significant time:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/sales_q3.xlsx")
try:
xl = pd.ExcelFile(EXCEL_PATH, engine="openpyxl")
print("Sheets:", xl.sheet_names)
# Preview first 5 rows without loading the whole file
preview = xl.parse(xl.sheet_names[0], nrows=5)
print(preview)
print("Columns:", preview.columns.tolist())
except FileNotFoundError:
raise SystemExit(f"File not found: {EXCEL_PATH}")
except Exception as e:
raise SystemExit(f"Cannot open workbook: {e}")
pd.ExcelFile opens the workbook container without parsing any sheet data. xl.parse() reads only the rows you request. Use this diagnostic step to determine correct values for skiprows, header, and usecols before writing the final load call.
Step 2: Basic read_excel Call
The minimum correct call for a modern .xlsx file:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/report.xlsx")
try:
df = pd.read_excel(
EXCEL_PATH,
engine="openpyxl", # always pass explicitly
)
print(df.shape)
print(df.dtypes)
except FileNotFoundError:
raise SystemExit(f"File missing: {EXCEL_PATH}")
except ImportError as e:
raise SystemExit(f"Engine not installed — pip install openpyxl: {e}")
The engine argument is always required in production code. Without it, pandas infers the backend from the file extension, and on systems where xlrd is installed alongside openpyxl, the inference can pick xlrd and fail.
Step 3: sheet_name — Targeting Sheets
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/workbook.xlsx")
try:
# By name
df_q3 = pd.read_excel(EXCEL_PATH, sheet_name="Q3", engine="openpyxl")
# By zero-based index (first sheet)
df_first = pd.read_excel(EXCEL_PATH, sheet_name=0, engine="openpyxl")
# All sheets → OrderedDict of {name: DataFrame}
all_sheets = pd.read_excel(EXCEL_PATH, sheet_name=None, engine="openpyxl")
for name, df in all_sheets.items():
print(f"{name}: {df.shape}")
# Specific subset of sheets
subset = pd.read_excel(
EXCEL_PATH, sheet_name=["Q1", "Q2", "Q3"], engine="openpyxl"
)
except KeyError as e:
raise SystemExit(f"Sheet not found: {e}")
except Exception as e:
raise SystemExit(f"Load error: {e}")
The sheet_name=None pattern is the correct entry point for the consolidation workflow in Merging Multiple Spreadsheets — it returns all sheets without requiring you to enumerate names in advance.
Step 4: header and skiprows
Business reports rarely have data starting at row 1. A title block, company name, print date, and blank separator row are common before the real header:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/monthly_report.xlsx")
try:
df = pd.read_excel(
EXCEL_PATH,
skiprows=3, # skip rows 0, 1, 2 (title block)
header=0, # row 3 (now row 0 after skip) becomes the header
skipfooter=2, # ignore last 2 rows (grand-total lines)
engine="openpyxl",
)
print(df.columns.tolist())
except Exception as e:
raise SystemExit(f"Read error: {e}")
For non-contiguous rows to skip — e.g., row 0 is a logo, rows 1–2 are blank, row 3 is a subtitle — pass a list: skiprows=[0, 1, 2, 3]. The row immediately after the last skipped row becomes the header if header=0.
Step 5: usecols — Limiting Column Scope
Loading only the columns you need can reduce memory use by 70–90% on wide exports:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/large_export.xlsx")
try:
# By name — most explicit and readable
df = pd.read_excel(
EXCEL_PATH,
usecols=["order_id", "customer", "amount", "order_date"],
engine="openpyxl",
)
# By Excel column letter range — useful when column names are unknown
df_range = pd.read_excel(EXCEL_PATH, usecols="A:D,F", engine="openpyxl")
# By callable — keep columns whose name starts with "Revenue"
df_rev = pd.read_excel(
EXCEL_PATH,
usecols=lambda c: str(c).startswith("Revenue"),
engine="openpyxl",
)
print("Memory:", df.memory_usage(deep=True).sum() // 1024, "KB")
except Exception as e:
raise SystemExit(f"Read error: {e}")
When usecols is a list of names, those names must match the header row exactly — including case and leading/trailing whitespace. Strip column names after load with df.columns = df.columns.str.strip() if the source file has invisible padding.
Step 6: dtype and parse_dates
Type inference is lossy for two categories of data: identifiers (IDs become floats) and dates (stored as integers or strings, not converted to datetime). Fix both at load time:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/orders.xlsx")
try:
df = pd.read_excel(
EXCEL_PATH,
dtype={
"order_id": str, # keeps "1001" not 1001.0
"customer_id": str, # preserves leading zeros on codes like "00123"
"amount": float,
"qty": int,
},
parse_dates=["order_date", "ship_date"],
engine="openpyxl",
)
print(df.dtypes)
assert pd.api.types.is_datetime64_any_dtype(df["order_date"]), "Date parse failed"
except AssertionError as e:
print(f"Type check: {e}")
# Fallback: coerce after load
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
except Exception as e:
raise SystemExit(f"Read error: {e}")
If parse_dates silently fails (the column stays object), the Excel cells contain text strings rather than date serials. The fallback pd.to_datetime(..., errors="coerce") converts what it can and inserts NaT for values that cannot be parsed, making failures visible rather than silent.
Step 7: Loading Multiple Sheets into One DataFrame
Stack identically structured sheets into a single DataFrame before analysis or reporting:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/annual_sales.xlsx")
try:
all_sheets = pd.read_excel(EXCEL_PATH, sheet_name=None, engine="openpyxl")
frames = []
for name, df in all_sheets.items():
df = df.dropna(how="all") # drop rows that are entirely blank
df["quarter"] = name # tag source sheet
frames.append(df)
if not frames:
raise ValueError("No sheets with data found")
combined = pd.concat(frames, ignore_index=True)
print(f"Combined: {combined.shape[0]} rows across {len(frames)} sheets")
except Exception as e:
raise SystemExit(f"Concat error: {e}")
Step 8: Reading with openpyxl Directly
Use openpyxl directly — instead of via pandas — when you need data that pandas discards: cell background colors, font styles, comment text, formula strings (not results), or merged cell geometry.
# pip install openpyxl
from pathlib import Path
import openpyxl
EXCEL_PATH = Path("data/styled_report.xlsx")
try:
# data_only=True returns cached formula results, not formula strings
# read_only=True streams rows without loading full workbook into RAM
wb = openpyxl.load_workbook(EXCEL_PATH, read_only=True, data_only=True)
ws = wb["Summary"]
rows = []
for row in ws.iter_rows(min_row=2, values_only=True):
rows.append(row)
wb.close()
print(f"Read {len(rows)} data rows via openpyxl")
except FileNotFoundError:
raise SystemExit(f"File not found: {EXCEL_PATH}")
except KeyError as e:
raise SystemExit(f"Sheet not found: {e}")
read_only=True streams rows without loading the full workbook into RAM — essential for files over 50 MB. Note that read_only=True and data_only=True cannot both access cell styles; for styling metadata, open without read_only.
Edge Cases and Variants
xlsb Binary Format
.xlsb files are not supported by openpyxl. The only engine that handles them is calamine:
# pip install pandas python-calamine
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/large_model.xlsb")
try:
df = pd.read_excel(EXCEL_PATH, engine="calamine")
print(df.shape)
except ImportError:
raise SystemExit("Run: pip install python-calamine")
except Exception as e:
raise SystemExit(f"Read error: {e}")
Multi-level Column Headers
Reports exported from pivot tables often have two header rows forming a MultiIndex:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/pivot_export.xlsx")
try:
df = pd.read_excel(
EXCEL_PATH,
header=[0, 1], # rows 0 and 1 both contribute to the column index
engine="openpyxl",
)
print(df.columns) # MultiIndex tuples like ('Revenue', 'Q1')
# Flatten to single-level
df.columns = ["_".join(str(c) for c in col).strip() for col in df.columns]
print(df.columns.tolist())
except Exception as e:
raise SystemExit(f"Read error: {e}")
Merged Cell Headers
Merged cells export a value only in the top-left cell; adjacent cells in the merge read as NaN. Forward-fill restores the intended labels:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/merged_headers.xlsx")
try:
df = pd.read_excel(EXCEL_PATH, header=None, engine="openpyxl")
df.iloc[0] = df.iloc[0].ffill() # propagate merged labels rightward
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
print(df.columns.tolist())
except Exception as e:
raise SystemExit(f"Read error: {e}")
Reading from a BytesIO Buffer
When Excel files arrive over HTTP, from an email attachment, or from S3, avoid writing to disk first:
# pip install pandas openpyxl requests
import io
from pathlib import Path
import pandas as pd
import requests
URL = "https://example.com/data/report.xlsx"
try:
response = requests.get(URL, timeout=30)
response.raise_for_status()
buf = io.BytesIO(response.content)
df = pd.read_excel(buf, engine="openpyxl")
print(df.shape)
except requests.RequestException as e:
raise SystemExit(f"Download failed: {e}")
except Exception as e:
raise SystemExit(f"Parse error: {e}")
The same pattern works with boto3 for S3: fetch the object body as bytes, wrap in BytesIO, pass to read_excel.
Validation
Validate immediately after load, before any downstream processing:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("data/orders.xlsx")
REQUIRED_COLS = {"order_id", "customer_id", "amount", "order_date"}
try:
df = pd.read_excel(
EXCEL_PATH,
dtype={"order_id": str, "customer_id": str},
parse_dates=["order_date"],
engine="openpyxl",
)
missing = REQUIRED_COLS - set(df.columns)
assert not missing, f"Missing columns: {missing}"
assert df.shape[0] > 0, "DataFrame is empty"
assert df["amount"].dtype in (float, "float64"), "amount not numeric"
assert pd.api.types.is_datetime64_any_dtype(df["order_date"]), "order_date not parsed"
assert not df["order_id"].isna().any(), "Null order_ids present"
print(f"Validation passed — {df.shape[0]} rows")
except AssertionError as e:
raise SystemExit(f"Validation failed: {e}")
except Exception as e:
raise SystemExit(f"Load error: {e}")
Performance and Scale
| File size | Recommended approach |
|---|---|
| < 10 MB | pd.read_excel(..., engine="openpyxl") |
| 10–100 MB | Add usecols and explicit dtype to cut allocations |
| > 100 MB | openpyxl with read_only=True streaming, or engine="calamine" |
Any .xlsb | engine="calamine" only — openpyxl cannot open this format |
read_excel does not support chunked reading. For very large workbooks, split by sheet (sheet_name=None) and process each sheet separately, or convert to CSV/Parquet first if you control the source.
openpyxl with read_only=True is the memory-efficient option for .xlsx files above 100 MB. It streams rows as an iterator and does not hold the full parsed workbook in RAM.
Troubleshooting
| Error | Root cause | Fix |
|---|---|---|
XLRDError: Excel xlsx file; not supported | xlrd ≥ 2.0 used for .xlsx | Pass engine="openpyxl" — see Fix xlrd Error Reading .xlsx Files |
ImportError: Missing optional dependency 'openpyxl' | Engine not installed | pip install openpyxl |
KeyError: 'SheetName' | Sheet name has spaces or wrong case | Print pd.ExcelFile(path).sheet_names to confirm exact name |
ValueError: File is not a zip file | File is corrupted or mislabeled as .xlsx | Check first 4 bytes to confirm format; pass correct engine |
Columns named Unnamed: 0, Unnamed: 1 | Header row is not row 0 | Add skiprows=N to skip intro rows; preview with nrows=5 first |
Numeric IDs display as 1001.0 | Pandas inferred float64 from int-looking cells | Add dtype={"order_id": str} |
EmptyDataError | skiprows removed all rows including the header | Use nrows=10 to diagnose row layout before setting skiprows |
Complete Working Script
# pip install pandas openpyxl
"""read_excel_all_sheets.py — load, validate, and export every sheet to CSV."""
import argparse
from pathlib import Path
import pandas as pd
def load_workbook(path: Path, engine: str = "openpyxl") -> dict[str, pd.DataFrame]:
try:
return pd.read_excel(path, sheet_name=None, engine=engine)
except FileNotFoundError:
raise SystemExit(f"File not found: {path}")
except ImportError as e:
raise SystemExit(f"Engine not installed — pip install {engine}: {e}")
except Exception as e:
raise SystemExit(f"Cannot open workbook: {e}")
def clean_sheet(df: pd.DataFrame, name: str) -> pd.DataFrame:
df = df.dropna(how="all").reset_index(drop=True)
df.columns = df.columns.astype(str).str.strip()
if df.empty:
print(f" WARN: sheet '{name}' has no data after cleaning")
return df
def main() -> None:
parser = argparse.ArgumentParser(description="Export all Excel sheets to CSV")
parser.add_argument("file", type=Path, help="Path to .xlsx file")
parser.add_argument(
"--engine",
default="openpyxl",
choices=["openpyxl", "calamine", "xlrd"],
help="Parsing engine",
)
parser.add_argument(
"--out-dir",
type=Path,
default=Path("."),
help="Output directory for CSV files",
)
args = parser.parse_args()
args.out_dir.mkdir(parents=True, exist_ok=True)
sheets = load_workbook(args.file, engine=args.engine)
for name, df in sheets.items():
df = clean_sheet(df, name)
out = args.out_dir / f"{name}.csv"
df.to_csv(out, index=False)
print(f" {name}: {df.shape[0]} rows → {out}")
if __name__ == "__main__":
main()
Run it as:
python read_excel_all_sheets.py data/annual_report.xlsx --out-dir output/
FAQ
Can Python read password-protected Excel files?
Not directly. Use msoffcrypto-tool to decrypt the file into a BytesIO buffer, then pass the buffer to pd.read_excel. The decryption step requires the password as a string.
Why does read_excel return NaN for empty cells instead of blank strings?
Pandas uses NaN as its universal missing value indicator. Pass keep_default_na=False to suppress this, or call df.fillna("") after loading if downstream code needs empty strings.
Is openpyxl or calamine faster for large read-only workbooks?calamine is typically 2–5x faster for pure data reads because it skips formula parsing and cell style processing. Use openpyxl when you need formatting metadata, formula strings, or cell comments.
How do I read an Excel file from a URL or S3?
Download the file bytes into a BytesIO buffer and pass the buffer to pd.read_excel. See the "Reading from a BytesIO Buffer" section above for an HTTP example. For S3, use boto3.client("s3").get_object(...)["Body"].read() to get the bytes.
What is the difference between skiprows and header?skiprows removes rows from the file before pandas processes it. header tells pandas which remaining row (by index) contains column names. They work together: skiprows=3, header=0 means "discard the first 3 rows, then treat the very next row as the header."
How do I read only specific rows from a large workbook?pd.read_excel does not support row-range reads like CSV chunking. To read a window of rows, use openpyxl with min_row/max_row in ws.iter_rows(), or load the full sheet and slice the DataFrame afterward.
Related
- Fix xlrd Error Reading .xlsx Files — exact fix for
XLRDErrorandImportErroron xlsx - How to Read Excel with Pandas, Step by Step — beginner walkthrough of each
read_excelparameter - Automating Excel Report Generation — build on loaded data to produce formatted output workbooks
- Merging Multiple Spreadsheets — combine DataFrames from multiple workbooks
- Cleaning Messy CSV Data with pandas — cleaning patterns that apply equally to Excel-sourced DataFrames