Reading Excel Files with Python

Excel workbooks arrive in two distinct binary formats — .xlsx (Office Open XML, used since Excel 2007) and .xls (legacy BIFF binary format from Excel 97–2003). Each format requires a different Python parsing engine. Code that skips engine selection either picks the wrong backend and raises XLRDError, or uses the correct backend but misses key parameters and corrupts types or column alignment.

This guide walks through every decision point in order: which engine to install, how to inspect a workbook before loading it, how to target specific sheets and columns, how to coerce types at load time, how to use openpyxl directly when pandas is not enough, and how to handle the real-world edge cases that appear once your scripts move from test files to production data.

For downstream work, see Automating Excel Report Generation once data is loaded, or Merging Multiple Spreadsheets when combining data from several workbooks.

Prerequisites

Install the required libraries before running any snippet in this guide:

# pip install pandas openpyxl
pip install pandas openpyxl
# for legacy .xls files only:
pip install "xlrd>=1.2,<2.0"
# for the fast calamine engine (optional):
pip install python-calamine

Create a minimal test workbook to validate your setup before pointing scripts at real data:

# pip install openpyxl
from pathlib import Path
import openpyxl

TEST_PATH = Path("test_workbook.xlsx")
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["order_id", "customer", "amount", "order_date"])
ws.append([1001, "Alice", 199.99, "2026-01-15"])
ws.append([1002, "Bob", 54.50, "2026-01-16"])
ws.append([1003, "Carol", 320.00, "2026-01-17"])
wb.save(TEST_PATH)
print("Test file written:", TEST_PATH.resolve())

Engine Selection

pandas.read_excel delegates file I/O to a backend engine. The choice is determined by file format, not personal preference. Passing the wrong engine raises an immediate error; omitting the engine parameter lets pandas guess, which can pick incorrectly when multiple engines are installed.

Excel engine selection decision tree Shows how .xlsx and .xls file formats map to openpyxl, calamine, or xlrd engines, all producing a pandas DataFrame output. Excel File .xlsx / .xls / .xlsb? .xlsx .xlsb .xls engine="openpyxl" full features, formatting engine="calamine" faster, read-only or engine="calamine" only supported engine engine="xlrd" xlrd < 2.0 required pandas DataFrame
EngineFormatsInstallWhen NOT to use
openpyxl.xlsx, .xlsmpip install openpyxlVery large files where read speed is critical
calamine.xlsx, .xls, .xlsb, .odspip install python-calamineWhen you need formula strings, cell styles, or comments
xlrd.xls onlypip install "xlrd<2.0"Any .xlsx or .xlsb file — raises XLRDError

Hitting XLRDError: Excel xlsx file; not supported? See Fix xlrd Error Reading .xlsx Files for the exact fix.

Step 1: Inspect Before Loading

Check sheet names and preview a few rows before committing to a full workbook load. On a 100 MB file this saves significant time:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/sales_q3.xlsx")

try:
    xl = pd.ExcelFile(EXCEL_PATH, engine="openpyxl")
    print("Sheets:", xl.sheet_names)
    # Preview first 5 rows without loading the whole file
    preview = xl.parse(xl.sheet_names[0], nrows=5)
    print(preview)
    print("Columns:", preview.columns.tolist())
except FileNotFoundError:
    raise SystemExit(f"File not found: {EXCEL_PATH}")
except Exception as e:
    raise SystemExit(f"Cannot open workbook: {e}")

pd.ExcelFile opens the workbook container without parsing any sheet data. xl.parse() reads only the rows you request. Use this diagnostic step to determine correct values for skiprows, header, and usecols before writing the final load call.

Step 2: Basic read_excel Call

The minimum correct call for a modern .xlsx file:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/report.xlsx")

try:
    df = pd.read_excel(
        EXCEL_PATH,
        engine="openpyxl",    # always pass explicitly
    )
    print(df.shape)
    print(df.dtypes)
except FileNotFoundError:
    raise SystemExit(f"File missing: {EXCEL_PATH}")
except ImportError as e:
    raise SystemExit(f"Engine not installed — pip install openpyxl: {e}")

The engine argument is always required in production code. Without it, pandas infers the backend from the file extension, and on systems where xlrd is installed alongside openpyxl, the inference can pick xlrd and fail.

Step 3: sheet_name — Targeting Sheets

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/workbook.xlsx")

try:
    # By name
    df_q3 = pd.read_excel(EXCEL_PATH, sheet_name="Q3", engine="openpyxl")

    # By zero-based index (first sheet)
    df_first = pd.read_excel(EXCEL_PATH, sheet_name=0, engine="openpyxl")

    # All sheets → OrderedDict of {name: DataFrame}
    all_sheets = pd.read_excel(EXCEL_PATH, sheet_name=None, engine="openpyxl")
    for name, df in all_sheets.items():
        print(f"{name}: {df.shape}")

    # Specific subset of sheets
    subset = pd.read_excel(
        EXCEL_PATH, sheet_name=["Q1", "Q2", "Q3"], engine="openpyxl"
    )
except KeyError as e:
    raise SystemExit(f"Sheet not found: {e}")
except Exception as e:
    raise SystemExit(f"Load error: {e}")

The sheet_name=None pattern is the correct entry point for the consolidation workflow in Merging Multiple Spreadsheets — it returns all sheets without requiring you to enumerate names in advance.

Step 4: header and skiprows

Business reports rarely have data starting at row 1. A title block, company name, print date, and blank separator row are common before the real header:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/monthly_report.xlsx")

try:
    df = pd.read_excel(
        EXCEL_PATH,
        skiprows=3,        # skip rows 0, 1, 2 (title block)
        header=0,          # row 3 (now row 0 after skip) becomes the header
        skipfooter=2,      # ignore last 2 rows (grand-total lines)
        engine="openpyxl",
    )
    print(df.columns.tolist())
except Exception as e:
    raise SystemExit(f"Read error: {e}")

For non-contiguous rows to skip — e.g., row 0 is a logo, rows 1–2 are blank, row 3 is a subtitle — pass a list: skiprows=[0, 1, 2, 3]. The row immediately after the last skipped row becomes the header if header=0.

Step 5: usecols — Limiting Column Scope

Loading only the columns you need can reduce memory use by 70–90% on wide exports:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/large_export.xlsx")

try:
    # By name — most explicit and readable
    df = pd.read_excel(
        EXCEL_PATH,
        usecols=["order_id", "customer", "amount", "order_date"],
        engine="openpyxl",
    )

    # By Excel column letter range — useful when column names are unknown
    df_range = pd.read_excel(EXCEL_PATH, usecols="A:D,F", engine="openpyxl")

    # By callable — keep columns whose name starts with "Revenue"
    df_rev = pd.read_excel(
        EXCEL_PATH,
        usecols=lambda c: str(c).startswith("Revenue"),
        engine="openpyxl",
    )
    print("Memory:", df.memory_usage(deep=True).sum() // 1024, "KB")
except Exception as e:
    raise SystemExit(f"Read error: {e}")

When usecols is a list of names, those names must match the header row exactly — including case and leading/trailing whitespace. Strip column names after load with df.columns = df.columns.str.strip() if the source file has invisible padding.

Step 6: dtype and parse_dates

Type inference is lossy for two categories of data: identifiers (IDs become floats) and dates (stored as integers or strings, not converted to datetime). Fix both at load time:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/orders.xlsx")

try:
    df = pd.read_excel(
        EXCEL_PATH,
        dtype={
            "order_id": str,        # keeps "1001" not 1001.0
            "customer_id": str,     # preserves leading zeros on codes like "00123"
            "amount": float,
            "qty": int,
        },
        parse_dates=["order_date", "ship_date"],
        engine="openpyxl",
    )
    print(df.dtypes)
    assert pd.api.types.is_datetime64_any_dtype(df["order_date"]), "Date parse failed"
except AssertionError as e:
    print(f"Type check: {e}")
    # Fallback: coerce after load
    df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
except Exception as e:
    raise SystemExit(f"Read error: {e}")

If parse_dates silently fails (the column stays object), the Excel cells contain text strings rather than date serials. The fallback pd.to_datetime(..., errors="coerce") converts what it can and inserts NaT for values that cannot be parsed, making failures visible rather than silent.

Step 7: Loading Multiple Sheets into One DataFrame

Stack identically structured sheets into a single DataFrame before analysis or reporting:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/annual_sales.xlsx")

try:
    all_sheets = pd.read_excel(EXCEL_PATH, sheet_name=None, engine="openpyxl")

    frames = []
    for name, df in all_sheets.items():
        df = df.dropna(how="all")       # drop rows that are entirely blank
        df["quarter"] = name            # tag source sheet
        frames.append(df)

    if not frames:
        raise ValueError("No sheets with data found")

    combined = pd.concat(frames, ignore_index=True)
    print(f"Combined: {combined.shape[0]} rows across {len(frames)} sheets")
except Exception as e:
    raise SystemExit(f"Concat error: {e}")

Step 8: Reading with openpyxl Directly

Use openpyxl directly — instead of via pandas — when you need data that pandas discards: cell background colors, font styles, comment text, formula strings (not results), or merged cell geometry.

# pip install openpyxl
from pathlib import Path
import openpyxl

EXCEL_PATH = Path("data/styled_report.xlsx")

try:
    # data_only=True returns cached formula results, not formula strings
    # read_only=True streams rows without loading full workbook into RAM
    wb = openpyxl.load_workbook(EXCEL_PATH, read_only=True, data_only=True)
    ws = wb["Summary"]

    rows = []
    for row in ws.iter_rows(min_row=2, values_only=True):
        rows.append(row)

    wb.close()
    print(f"Read {len(rows)} data rows via openpyxl")
except FileNotFoundError:
    raise SystemExit(f"File not found: {EXCEL_PATH}")
except KeyError as e:
    raise SystemExit(f"Sheet not found: {e}")

read_only=True streams rows without loading the full workbook into RAM — essential for files over 50 MB. Note that read_only=True and data_only=True cannot both access cell styles; for styling metadata, open without read_only.

Edge Cases and Variants

xlsb Binary Format

.xlsb files are not supported by openpyxl. The only engine that handles them is calamine:

# pip install pandas python-calamine
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/large_model.xlsb")

try:
    df = pd.read_excel(EXCEL_PATH, engine="calamine")
    print(df.shape)
except ImportError:
    raise SystemExit("Run: pip install python-calamine")
except Exception as e:
    raise SystemExit(f"Read error: {e}")

Multi-level Column Headers

Reports exported from pivot tables often have two header rows forming a MultiIndex:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/pivot_export.xlsx")

try:
    df = pd.read_excel(
        EXCEL_PATH,
        header=[0, 1],      # rows 0 and 1 both contribute to the column index
        engine="openpyxl",
    )
    print(df.columns)       # MultiIndex tuples like ('Revenue', 'Q1')
    # Flatten to single-level
    df.columns = ["_".join(str(c) for c in col).strip() for col in df.columns]
    print(df.columns.tolist())
except Exception as e:
    raise SystemExit(f"Read error: {e}")

Merged Cell Headers

Merged cells export a value only in the top-left cell; adjacent cells in the merge read as NaN. Forward-fill restores the intended labels:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/merged_headers.xlsx")

try:
    df = pd.read_excel(EXCEL_PATH, header=None, engine="openpyxl")
    df.iloc[0] = df.iloc[0].ffill()    # propagate merged labels rightward
    df.columns = df.iloc[0]
    df = df.iloc[1:].reset_index(drop=True)
    print(df.columns.tolist())
except Exception as e:
    raise SystemExit(f"Read error: {e}")

Reading from a BytesIO Buffer

When Excel files arrive over HTTP, from an email attachment, or from S3, avoid writing to disk first:

# pip install pandas openpyxl requests
import io
from pathlib import Path
import pandas as pd
import requests

URL = "https://example.com/data/report.xlsx"

try:
    response = requests.get(URL, timeout=30)
    response.raise_for_status()
    buf = io.BytesIO(response.content)
    df = pd.read_excel(buf, engine="openpyxl")
    print(df.shape)
except requests.RequestException as e:
    raise SystemExit(f"Download failed: {e}")
except Exception as e:
    raise SystemExit(f"Parse error: {e}")

The same pattern works with boto3 for S3: fetch the object body as bytes, wrap in BytesIO, pass to read_excel.

Validation

Validate immediately after load, before any downstream processing:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("data/orders.xlsx")
REQUIRED_COLS = {"order_id", "customer_id", "amount", "order_date"}

try:
    df = pd.read_excel(
        EXCEL_PATH,
        dtype={"order_id": str, "customer_id": str},
        parse_dates=["order_date"],
        engine="openpyxl",
    )

    missing = REQUIRED_COLS - set(df.columns)
    assert not missing, f"Missing columns: {missing}"
    assert df.shape[0] > 0, "DataFrame is empty"
    assert df["amount"].dtype in (float, "float64"), "amount not numeric"
    assert pd.api.types.is_datetime64_any_dtype(df["order_date"]), "order_date not parsed"
    assert not df["order_id"].isna().any(), "Null order_ids present"

    print(f"Validation passed — {df.shape[0]} rows")
except AssertionError as e:
    raise SystemExit(f"Validation failed: {e}")
except Exception as e:
    raise SystemExit(f"Load error: {e}")

Performance and Scale

File sizeRecommended approach
< 10 MBpd.read_excel(..., engine="openpyxl")
10–100 MBAdd usecols and explicit dtype to cut allocations
> 100 MBopenpyxl with read_only=True streaming, or engine="calamine"
Any .xlsbengine="calamine" only — openpyxl cannot open this format

read_excel does not support chunked reading. For very large workbooks, split by sheet (sheet_name=None) and process each sheet separately, or convert to CSV/Parquet first if you control the source.

openpyxl with read_only=True is the memory-efficient option for .xlsx files above 100 MB. It streams rows as an iterator and does not hold the full parsed workbook in RAM.

Troubleshooting

ErrorRoot causeFix
XLRDError: Excel xlsx file; not supportedxlrd ≥ 2.0 used for .xlsxPass engine="openpyxl" — see Fix xlrd Error Reading .xlsx Files
ImportError: Missing optional dependency 'openpyxl'Engine not installedpip install openpyxl
KeyError: 'SheetName'Sheet name has spaces or wrong casePrint pd.ExcelFile(path).sheet_names to confirm exact name
ValueError: File is not a zip fileFile is corrupted or mislabeled as .xlsxCheck first 4 bytes to confirm format; pass correct engine
Columns named Unnamed: 0, Unnamed: 1Header row is not row 0Add skiprows=N to skip intro rows; preview with nrows=5 first
Numeric IDs display as 1001.0Pandas inferred float64 from int-looking cellsAdd dtype={"order_id": str}
EmptyDataErrorskiprows removed all rows including the headerUse nrows=10 to diagnose row layout before setting skiprows

Complete Working Script

# pip install pandas openpyxl
"""read_excel_all_sheets.py — load, validate, and export every sheet to CSV."""
import argparse
from pathlib import Path
import pandas as pd


def load_workbook(path: Path, engine: str = "openpyxl") -> dict[str, pd.DataFrame]:
    try:
        return pd.read_excel(path, sheet_name=None, engine=engine)
    except FileNotFoundError:
        raise SystemExit(f"File not found: {path}")
    except ImportError as e:
        raise SystemExit(f"Engine not installed — pip install {engine}: {e}")
    except Exception as e:
        raise SystemExit(f"Cannot open workbook: {e}")


def clean_sheet(df: pd.DataFrame, name: str) -> pd.DataFrame:
    df = df.dropna(how="all").reset_index(drop=True)
    df.columns = df.columns.astype(str).str.strip()
    if df.empty:
        print(f"  WARN: sheet '{name}' has no data after cleaning")
    return df


def main() -> None:
    parser = argparse.ArgumentParser(description="Export all Excel sheets to CSV")
    parser.add_argument("file", type=Path, help="Path to .xlsx file")
    parser.add_argument(
        "--engine",
        default="openpyxl",
        choices=["openpyxl", "calamine", "xlrd"],
        help="Parsing engine",
    )
    parser.add_argument(
        "--out-dir",
        type=Path,
        default=Path("."),
        help="Output directory for CSV files",
    )
    args = parser.parse_args()

    args.out_dir.mkdir(parents=True, exist_ok=True)
    sheets = load_workbook(args.file, engine=args.engine)

    for name, df in sheets.items():
        df = clean_sheet(df, name)
        out = args.out_dir / f"{name}.csv"
        df.to_csv(out, index=False)
        print(f"  {name}: {df.shape[0]} rows → {out}")


if __name__ == "__main__":
    main()

Run it as:

python read_excel_all_sheets.py data/annual_report.xlsx --out-dir output/

FAQ

Can Python read password-protected Excel files? Not directly. Use msoffcrypto-tool to decrypt the file into a BytesIO buffer, then pass the buffer to pd.read_excel. The decryption step requires the password as a string.

Why does read_excel return NaN for empty cells instead of blank strings? Pandas uses NaN as its universal missing value indicator. Pass keep_default_na=False to suppress this, or call df.fillna("") after loading if downstream code needs empty strings.

Is openpyxl or calamine faster for large read-only workbooks?calamine is typically 2–5x faster for pure data reads because it skips formula parsing and cell style processing. Use openpyxl when you need formatting metadata, formula strings, or cell comments.

How do I read an Excel file from a URL or S3? Download the file bytes into a BytesIO buffer and pass the buffer to pd.read_excel. See the "Reading from a BytesIO Buffer" section above for an HTTP example. For S3, use boto3.client("s3").get_object(...)["Body"].read() to get the bytes.

What is the difference between skiprows and header?skiprows removes rows from the file before pandas processes it. header tells pandas which remaining row (by index) contains column names. They work together: skiprows=3, header=0 means "discard the first 3 rows, then treat the very next row as the header."

How do I read only specific rows from a large workbook?pd.read_excel does not support row-range reads like CSV chunking. To read a window of rows, use openpyxl with min_row/max_row in ws.iter_rows(), or load the full sheet and slice the DataFrame afterward.

Part of Python for Excel & CSV Data Processing.

Explore next