Fix xlrd Error Reading .xlsx Files
Running pd.read_excel("file.xlsx") raises one of two errors depending on your environment:
XLRDError: Excel xlsx file; not supported
or:
ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for xls support.
Use pip or conda to install xlrd.
Both mean the same thing: pandas attempted to open a .xlsx file with xlrd, and xlrd cannot do it. The fix is to install openpyxl and pass engine="openpyxl".
Root Cause
xlrd dropped .xlsx support in version 2.0 (released October 2020). Before 2.0, xlrd handled both .xls (legacy binary) and .xlsx (Open XML). After 2.0, it handles only .xls.
When you call pd.read_excel("file.xlsx") without an explicit engine argument, pandas picks a backend. On systems where xlrd is installed and openpyxl is not (or on older pandas versions), the inference picks xlrd. The result is one of the two errors above.
The two error variants map to distinct situations:
| Error | Situation |
|---|---|
XLRDError: Excel xlsx file; not supported | xlrd ≥ 2.0 is installed; it opens the file but rejects the .xlsx format immediately |
ImportError: Missing optional dependency 'xlrd' | Pandas resolved to xlrd as the required backend but it is not installed at all |
Both are fixed the same way: install openpyxl and pass engine="openpyxl".
Minimal Diagnostic
Before applying any fix, confirm what is actually installed in your environment:
# no extra install needed
import importlib
for pkg in ("xlrd", "openpyxl", "python_calamine", "pandas"):
try:
mod = importlib.import_module(pkg.replace("-", "_"))
print(f"{pkg}: {getattr(mod, '__version__', 'installed')}")
except ImportError:
print(f"{pkg}: NOT INSTALLED")
Expected output after the fix:
xlrd: 2.0.1 ← still present, but only used for .xls files
openpyxl: 3.1.2 ← handles .xlsx
python_calamine: NOT INSTALLED ← optional fast engine
pandas: 2.2.2
Also confirm that you are running in the correct Python environment:
which python
python -c "import sys; print(sys.executable)"
pip show openpyxl
If pip show openpyxl shows "not installed" but you ran pip install openpyxl earlier, the pip command targeted a different Python interpreter. Re-run pip install using the exact Python shown above: python -m pip install openpyxl.
Fix: Install openpyxl and Pass engine="openpyxl"
# pip install openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("report.xlsx") # the .xlsx file that was failing
try:
df = pd.read_excel(
EXCEL_PATH,
engine="openpyxl", # replaces the failing xlrd backend
)
print(df.shape)
print(df.head())
except FileNotFoundError:
raise SystemExit(f"File not found: {EXCEL_PATH}")
except ImportError as e:
raise SystemExit(f"openpyxl not installed — run: pip install openpyxl\n{e}")
except Exception as e:
raise SystemExit(f"Read error: {e}")
Two changes from the failing call:
pip install openpyxlinstalls the correct backend for.xlsx.engine="openpyxl"routes pandas explicitly toopenpyxlinstead of letting it guess.
Always pass engine explicitly in production code. Even on modern pandas where openpyxl is the documented default for .xlsx, explicit engine selection eliminates the dependency on inference behaviour across pandas versions and package combinations.
Variant 1: You Have a Genuine .xls File
If the file is actually a legacy .xls binary workbook (Excel 97–2003 format), openpyxl cannot read it. You need xlrd and engine="xlrd":
# pip install xlrd
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("legacy_data.xls") # confirmed .xls binary
try:
df = pd.read_excel(
EXCEL_PATH,
engine="xlrd", # xlrd 2.x works fine for genuine .xls files
)
print(df.shape)
except Exception as e:
raise SystemExit(f"Read error: {e}")
xlrd 2.x supports the .xls format; there is no need to pin xlrd<2.0 for genuine .xls files. The version restriction only matters if you have old code that used xlrd to open .xlsx.
To confirm the actual file format without opening Excel, inspect the first bytes. .xls binary files start with the OLE2 magic number D0 CF 11 E0; .xlsx files start with the ZIP header 50 4B 03 04:
# no extra install needed
from pathlib import Path
def detect_format(path: Path) -> str:
magic = path.read_bytes()[:4]
if magic == b"PK\x03\x04":
return "xlsx — ZIP-based Open XML"
if magic[:4] == b"\xd0\xcf\x11\xe0":
return "xls — OLE2 binary (legacy)"
return f"unknown format — first 4 bytes: {magic.hex()}"
print(detect_format(Path("mystery_file.xls")))
Run this on the file before deciding which engine to use. A file named .xls that returns "ZIP-based Open XML" is actually an .xlsx that was renamed — pass engine="openpyxl".
Variant 2: You Explicitly Passed engine="xlrd" in Code
If the error appears in code that already specifies an engine, find and replace the argument:
# Before — fails for .xlsx files:
df = pd.read_excel(path, engine="xlrd")
# After — correct for .xlsx:
df = pd.read_excel(path, engine="openpyxl")
Search your codebase for engine="xlrd" and replace each occurrence with engine="openpyxl" for any path that points to .xlsx files. Keep engine="xlrd" only for paths that are confirmed .xls binary workbooks.
A quick codebase search:
grep -rn 'engine="xlrd"' .
Review each result and update as needed.
Variant 3: Use calamine for Faster Reads
If raw read performance matters — for example, loading a 50 MB .xlsx on every CI run — the calamine engine is typically 2–5x faster than openpyxl because it skips formula and style parsing:
# pip install python-calamine
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("large_report.xlsx")
try:
df = pd.read_excel(
EXCEL_PATH,
engine="calamine", # handles .xlsx, .xls, .xlsb, .ods
)
print(df.shape)
except ImportError:
raise SystemExit("Run: pip install python-calamine")
except Exception as e:
raise SystemExit(f"Read error: {e}")
calamine is read-only and does not expose formula strings, cell styles, or comments. For anything beyond tabular data extraction, use openpyxl. See Reading Excel Files with Python for a full engine comparison.
Variant 4: The "File is not a zip file" Error
If the error is ValueError: File is not a zip file rather than XLRDError, the file is either corrupted or its extension does not match its content. An .xlsx file with corrupted bytes, or a genuine .xls file that was renamed to .xlsx, produces this error.
# no extra install needed
from pathlib import Path
path = Path("suspect_file.xlsx")
magic = path.read_bytes()[:4]
print(magic.hex())
# PK0304 → valid .xlsx (ZIP)
# d0cf11e0 → actually .xls (OLE2 binary) — use engine="xlrd"
# anything else → file may be corrupted
If the file is actually .xls masquerading as .xlsx, change the extension reference and use engine="xlrd". If the file is corrupted, it needs to be re-exported from the source system.
Verification
After applying the fix, confirm that the load succeeds and that column types look correct:
# pip install pandas openpyxl
from pathlib import Path
import pandas as pd
EXCEL_PATH = Path("report.xlsx")
try:
df = pd.read_excel(EXCEL_PATH, engine="openpyxl")
assert df.shape[0] > 0, "DataFrame is empty — check sheet name and skiprows"
assert df.shape[1] > 0, "No columns loaded"
print(f"OK — {df.shape[0]} rows, {df.shape[1]} columns")
print(df.dtypes)
except AssertionError as e:
raise SystemExit(f"Validation failed: {e}")
except Exception as e:
raise SystemExit(f"Load error after fix: {e}")
A non-empty shape and a complete dtypes printout without a traceback confirms the engine error is resolved. If the error persists despite installing openpyxl, double-check that pip install openpyxl ran in the same Python environment that is executing the script.
Related
- Reading Excel Files with Python — full engine comparison: openpyxl, xlrd, calamine, with decision tree SVG
- How to Read Excel with Pandas, Step by Step — beginner walkthrough of all
read_excelparameters - Automating Excel Report Generation — once files load cleanly, automate report generation
Part of Reading Excel Files with Python.