How to Read Excel with Pandas Step by Step

pd.read_excel() has over a dozen parameters that interact in non-obvious ways. Skip engine and you hit ImportError; ignore dtype and numeric IDs silently become floats; miss skiprows on a report with a title block and your column headers land in the wrong row. This walkthrough moves through each decision point in sequence so the first call you write is correct.

For a full reference that covers openpyxl direct access, calamine, merged cells, and BytesIO loading, see Reading Excel Files with Python.

Prerequisites

# pip install pandas openpyxl
pip install pandas openpyxl

Verify the install:

# pip install pandas openpyxl
import pandas as pd
import openpyxl
print("pandas", pd.__version__)
print("openpyxl", openpyxl.__version__)

If you see ModuleNotFoundError: No module named 'openpyxl' after the pip install, your terminal is pointing at a different Python than the one running your scripts. Activate the correct virtualenv and reinstall.

Create a test workbook to follow along with each step:

# pip install openpyxl
from pathlib import Path
import openpyxl

TEST_PATH = Path("sample.xlsx")
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "Sales"
ws.append(["order_id", "customer", "amount", "order_date"])
ws.append([1001, "Alice", 199.99, "2026-01-15"])
ws.append([1002, "Bob", 54.50, "2026-01-16"])
ws.append([1003, "Carol", 320.00, "2026-01-17"])

# Second sheet with different data
ws2 = wb.create_sheet("Returns")
ws2.append(["order_id", "reason", "refund"])
ws2.append([1001, "Wrong size", 199.99])
wb.save(TEST_PATH)
print("Created:", TEST_PATH.resolve())

Step 1: Discover Sheet Names

Before reading data, confirm what sheets exist. A KeyError caused by a sheet name that has an invisible trailing space is a surprisingly common failure in production:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("sample.xlsx")

try:
    xl = pd.ExcelFile(EXCEL_PATH, engine="openpyxl")
    print("Sheets:", xl.sheet_names)     # ['Sales', 'Returns']
except FileNotFoundError:
    raise SystemExit(f"File not found: {EXCEL_PATH}")
except Exception as e:
    raise SystemExit(f"Cannot open: {e}")

pd.ExcelFile opens the workbook container without parsing any sheet data. It is cheap even on large files and gives you the exact sheet names as Python strings, including any whitespace that would cause a KeyError if you typed the name manually.

Step 2: Load a Single Sheet

The minimum correct call for a .xlsx file:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("sample.xlsx")

try:
    df = pd.read_excel(
        EXCEL_PATH,
        sheet_name="Sales",    # target sheet by exact name
        engine="openpyxl",     # required for .xlsx — never omit
    )
    print(df.shape)            # (3, 4)
    print(df.head())
except FileNotFoundError:
    raise SystemExit(f"File not found: {EXCEL_PATH}")
except ImportError as e:
    raise SystemExit(f"Engine missing — pip install openpyxl: {e}")
except KeyError as e:
    raise SystemExit(f"Sheet not found: {e}")

The engine="openpyxl" argument is not optional in production code. Without it, pandas infers the backend from the file extension; on some system configurations this picks xlrd, which raises XLRDError for .xlsx files. See Fix xlrd Error Reading .xlsx Files if you hit that error.

Step 3: Handle Reports with Non-Zero Header Rows

Reports from accounting software, ERP systems, and generated exports often have a company name, print date, filter summary, or logo block above the actual data table. Use skiprows to remove those rows before pandas identifies the header:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("monthly_report.xlsx")

try:
    # Rows 0–2 are a title block; row 3 is the real column header
    df = pd.read_excel(
        EXCEL_PATH,
        skiprows=3,        # discard rows 0, 1, 2
        header=0,          # the next row (original row 3) is now row 0 and becomes the header
        skipfooter=2,      # ignore the last 2 rows (totals lines)
        engine="openpyxl",
    )
    print(df.columns.tolist())
except Exception as e:
    raise SystemExit(f"Read error: {e}")

To skip specific non-contiguous rows — e.g., row 0 contains a logo, rows 3 and 4 are blank sub-headers — pass a list: skiprows=[0, 3, 4]. Use nrows=10 in a preview call first to count the exact rows to skip.

When header=None is passed instead, pandas treats every row as data and assigns integer column indices (0, 1, 2, …). This is useful when you want to handle the header row manually — for example, to forward-fill merged cell labels.

Step 4: Select Only the Columns You Need

Wide workbooks exported from ERP or BI systems often contain 40–80 columns. Loading all of them when you need 5 wastes memory and slows type inference. usecols restricts what is read from disk:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("large_export.xlsx")

try:
    # Option A: column names — most readable and maintainable
    df = pd.read_excel(
        EXCEL_PATH,
        usecols=["order_id", "customer", "amount", "order_date"],
        engine="openpyxl",
    )

    # Option B: Excel letter range — useful when column names are unknown
    df_range = pd.read_excel(EXCEL_PATH, usecols="A:D", engine="openpyxl")

    # Option C: callable — keep columns matching a pattern
    df_q = pd.read_excel(
        EXCEL_PATH,
        usecols=lambda c: str(c).startswith("Revenue"),
        engine="openpyxl",
    )

    print(f"Loaded {df.shape[1]} columns, {df.shape[0]} rows")
    print("Memory:", df.memory_usage(deep=True).sum() // 1024, "KB")
except Exception as e:
    raise SystemExit(f"Read error: {e}")

When usecols is a list of names, those names must match the header row exactly — case-sensitive, including any whitespace in the original file. Strip column names after load if you suspect padding: df.columns = df.columns.str.strip().

Step 5: Fix Type Inference with dtype and parse_dates

Pandas type inference makes two predictable mistakes on Excel data. First, columns that look numeric to the sampler (but contain IDs like 001234) get cast to float64, adding .0 and dropping leading zeros. Second, date columns stored as strings or formatted text cells stay as object dtype rather than datetime64.

Fix both issues at the read_excel call:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("sample.xlsx")

try:
    df = pd.read_excel(
        EXCEL_PATH,
        sheet_name="Sales",
        dtype={
            "order_id": str,       # keep as "1001", not 1001.0
            "customer": str,
            "amount": float,
        },
        parse_dates=["order_date"],    # coerce to datetime64[ns]
        engine="openpyxl",
    )
    print(df.dtypes)
    # order_id       object
    # customer       object
    # amount        float64
    # order_date    datetime64[ns]
except Exception as e:
    raise SystemExit(f"Read error: {e}")

If parse_dates fails silently and the column stays object, the cells contain text strings rather than Excel date serials. Use this fallback:

# pip install pandas
import pandas as pd

# After loading df above
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
# errors="coerce" inserts NaT for values that cannot be parsed
print(df["order_date"].isna().sum(), "unparseable dates")

The errors="coerce" argument makes failures visible as NaT rather than silently keeping the original strings.

Step 6: Load All Sheets at Once

When all sheets share the same column structure and you want to concatenate them, pass sheet_name=None. This returns an OrderedDict of {sheet_name: DataFrame} without requiring you to enumerate names in advance:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("annual_report.xlsx")

try:
    all_sheets: dict[str, pd.DataFrame] = pd.read_excel(
        EXCEL_PATH,
        sheet_name=None,           # load every sheet
        dtype={"order_id": str},
        engine="openpyxl",
    )

    frames = []
    for name, df in all_sheets.items():
        df = df.dropna(how="all")           # drop fully blank rows
        df["source_sheet"] = name           # track which sheet each row came from
        frames.append(df)

    if not frames:
        raise ValueError("No non-empty sheets found")

    combined = pd.concat(frames, ignore_index=True)
    print(f"Combined: {combined.shape[0]} rows across {len(frames)} sheets")
except Exception as e:
    raise SystemExit(f"Error: {e}")

This sheet_name=None pattern is the standard entry point for consolidation workflows. See Merging Multiple Spreadsheets for how to extend this to workbooks from different sources.

Step 7: Validate the Result

Validate immediately after loading, before any downstream transformation or reporting. A malformed source file should raise an explicit error rather than produce a silent wrong result in a downstream report:

# pip install pandas openpyxl
from pathlib import Path
import pandas as pd

EXCEL_PATH = Path("sample.xlsx")
REQUIRED_COLS = {"order_id", "customer", "amount", "order_date"}

try:
    df = pd.read_excel(
        EXCEL_PATH,
        sheet_name="Sales",
        dtype={"order_id": str},
        parse_dates=["order_date"],
        engine="openpyxl",
    )

    missing_cols = REQUIRED_COLS - set(df.columns)
    assert not missing_cols, f"Missing columns: {missing_cols}"
    assert df.shape[0] > 0, "DataFrame is empty"
    assert not df["order_id"].isna().any(), "Null order_ids found"
    assert pd.api.types.is_datetime64_any_dtype(df["order_date"]), "order_date not datetime"

    print(f"Validation passed — {df.shape[0]} rows")
    print(df.dtypes)
except AssertionError as e:
    raise SystemExit(f"Validation failed: {e}")
except Exception as e:
    raise SystemExit(f"Load error: {e}")

These four assertions catch the four most common silent failures: missing columns (schema drift), empty sheet (wrong tab or empty export), null IDs (join will fail), and unparsed dates (time-series operations will break).

Troubleshooting Common Errors

ErrorRoot causeFix
ImportError: Missing optional dependency 'openpyxl'openpyxl not installedpip install openpyxl
XLRDError: Excel xlsx file; not supportedxlrd ≥ 2.0 selected for .xlsxAdd engine="openpyxl" — see Fix xlrd Error Reading .xlsx Files
KeyError: 'SheetName'Sheet name has trailing space or wrong casePrint pd.ExcelFile(path).sheet_names to confirm exact string
Columns named Unnamed: 0, Unnamed: 1Header row is not row 0Add skiprows=N after previewing with nrows=5
Numeric IDs shown as 1001.0Pandas inferred float64Add dtype={"order_id": str}
EmptyDataErrorskiprows value exceeded total row countPreview with nrows=10 first to count intro rows
Date column stays object dtypeCells contain strings, not Excel date serialsCall pd.to_datetime(df["col"], errors="coerce") post-load

Frequently Asked Questions

How do I read only the first N rows? Pass nrows=100 to pd.read_excel(). Useful for previewing large files before committing to a full load.

How do I read multiple specific sheets by name? Pass a list: sheet_name=["Q1", "Q2", "Q3"]. Pandas returns a dict of DataFrames keyed by sheet name.

Can I skip the first few rows of a report header? Yes. skiprows=3 skips rows 0, 1, and 2; the next row becomes the header. For non-contiguous rows, pass a list: skiprows=[0, 2, 5].

Why is my date column still object dtype after parse_dates? The cell values are text strings rather than Excel date serials (integers). Call pd.to_datetime(df["col"], errors="coerce") after loading to force the conversion.

What is the difference between header=None and header=0?header=0 (the default) uses row 0 as column names. header=None treats every row as data and assigns integer column indices. Use header=None when you need to manipulate the header row yourself before assigning it — for example, to forward-fill merged cell labels.

How do I load a .xls file instead of .xlsx? Install xlrd < 2.0 and pass engine="xlrd". The openpyxl engine cannot open legacy .xls binary files.

Part of Reading Excel Files with Python.

/html>