Fixing UnicodeDecodeError in CSV Files: 'utf-8' codec can't decode byte 0x96

The error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 142: invalid start byte stops your script cold every time you try to load a legacy or Windows-exported CSV with pandas. This page gives you a deterministic workflow: detect the real encoding, apply the right fix, handle edge cases like BOM and mixed encodings, and verify the result.

Root Cause

Pandas defaults to UTF-8. Byte 0x96 is a valid Windows-1252 character — the en-dash — but it is an illegal start byte in UTF-8. The moment the C parser hits that byte it raises and halts; there is no partial load. The same pattern applies to 0x910x9F (curly quotes, em-dash, ellipsis, bullet) that Windows applications encode in the cp1252 range. Regional ERP exports, older Excel CSV saves, and accounting software from any era are the common culprits. For a broader look at which parser handles which file quirks, see Best Python Libraries for CSV Parsing.

Step 1 — Detect the Encoding with chardet

Before guessing, measure. chardet reads a binary sample and returns a codec name and a confidence score.

# pip install chardet pandas
from pathlib import Path
import chardet
import pandas as pd

csv_path = Path("data/export.csv")

try:
    raw_sample = csv_path.read_bytes()[:50_000]  # first 50 KB is enough
    result = chardet.detect(raw_sample)
    print(result)  # {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
except OSError as e:
    raise SystemExit(f"Cannot read file: {e}") from e

A confidence score above 0.7 is reliable enough to use directly. Below 0.7 the detector is uncertain — fall back to the manual ladder in the next section. Always read in binary mode (read_bytes() / 'rb') so no premature decoding occurs before chardet has seen the raw bytes.

Step 2 — Detect with charset-normalizer (Alternative)

charset-normalizer is a pure-Python alternative with no C extension dependency, making it friendlier in restricted environments. Its API mirrors chardet intentionally.

# pip install charset-normalizer pandas
from pathlib import Path
from charset_normalizer import detect
import pandas as pd

csv_path = Path("data/export.csv")

try:
    raw_sample = csv_path.read_bytes()[:50_000]
    result = detect(raw_sample)
    detected_encoding = result.get("encoding")
    confidence = result.get("confidence", 0.0)
    print(f"Detected: {detected_encoding}  confidence: {confidence:.2f}")
except OSError as e:
    raise SystemExit(f"Cannot read file: {e}") from e

Both libraries return the same dictionary shape: {"encoding": str | None, "confidence": float}. Use either; the rest of the workflow is identical.

Step 3 — Encoding Detection Flow

Encoding Detection Decision Flow Flowchart showing: open file in binary mode, run chardet.detect(), check confidence above 0.7, if yes use detected encoding, if no try utf-8-sig then cp1252 then latin-1. Open file in binary mode chardet.detect() on sample confidence > 0.7? Yes Use detected encoding No Try encoding='utf-8-sig' Try encoding='cp1252' Final fallback: encoding='latin-1' (never raises)

Step 4 — Fix with Explicit Encoding

cp1252 — Windows and Excel Exports

cp1252 is the right first choice for any file that originated on a Windows machine or was saved by Excel's "Save As CSV" option. It maps 0x800x9F to typographic characters that UTF-8 cannot represent in those byte positions.

# pip install pandas
from pathlib import Path
import pandas as pd

csv_path = Path("data/export.csv")

try:
    df = pd.read_csv(csv_path, encoding="cp1252", engine="python")
    print(f"Loaded {len(df):,} rows")
except (UnicodeDecodeError, OSError) as e:
    raise SystemExit(f"Load failed: {e}") from e

latin-1 — Universal Single-Byte Fallback

latin-1 (ISO-8859-1) maps every byte 0x000xFF directly to the first 256 Unicode code points. It never raises a UnicodeDecodeError, making it the final safety net when you have no idea what encoding was used. The trade-off: cp1252 characters in 0x800x9F come through as control characters rather than typographic symbols, so use it only after cp1252 fails.

# pip install pandas
from pathlib import Path
import pandas as pd

csv_path = Path("data/export.csv")

try:
    df = pd.read_csv(csv_path, encoding="latin-1", engine="python")
    print(df.dtypes)
except OSError as e:
    raise SystemExit(f"Cannot open file: {e}") from e

utf-8-sig — BOM Stripping

Files exported from Excel's UTF-8 CSV option often carry a three-byte BOM (EF BB BF) at the start. When pandas reads these with encoding='utf-8', the first column name gains a leading  — your DataFrame has a column named order_id instead of order_id. Using encoding='utf-8-sig' strips the BOM automatically before parsing.

# pip install pandas
from pathlib import Path
import pandas as pd

csv_path = Path("data/excel_utf8_export.csv")

try:
    df = pd.read_csv(csv_path, encoding="utf-8-sig")
    print(df.columns.tolist())  # ['order_id', 'amount', ...] — no BOM prefix
except (UnicodeDecodeError, OSError) as e:
    raise SystemExit(f"Load failed: {e}") from e

When exporting data to CSV formats from pandas you can write a BOM-free UTF-8 file with df.to_csv(path, encoding='utf-8', index=False) — this prevents the BOM problem entirely for downstream consumers.

Step 5 — Handle Residual Bad Bytes with encoding_errors

Even with the correct encoding, some files contain isolated corrupted bytes — a copy-paste artifact, a transmission glitch. Use encoding_errors to control what happens when the parser hits them.

encoding_errors='replace'

Substitutes each undecodable byte with U+FFFD (the replacement character , displayed as ). The file loads completely; you can then find affected cells and convert them to pd.NA.

# pip install pandas
from pathlib import Path
import pandas as pd

csv_path = Path("data/mixed.csv")

try:
    df = pd.read_csv(
        csv_path,
        encoding="utf-8",
        encoding_errors="replace",
        engine="python",
    )
    # Replace replacement characters with proper NA
    df = df.replace("�", pd.NA)
    print(f"Rows with any NA after replacement: {df.isna().any(axis=1).sum()}")
except OSError as e:
    raise SystemExit(f"Cannot open file: {e}") from e

encoding_errors='backslashreplace'

Converts each bad byte to a Python escape sequence like \x96. Useful for debugging — you can see exactly which bytes are causing trouble — but leave this only as a diagnostic mode, not in production pipelines.

# pip install pandas
from pathlib import Path
import pandas as pd

csv_path = Path("data/mixed.csv")

try:
    df = pd.read_csv(
        csv_path,
        encoding="utf-8",
        encoding_errors="backslashreplace",
        engine="python",
    )
    # Find cells containing escape sequences to spot bad bytes
    mask = df.apply(lambda col: col.astype(str).str.contains(r"\\x", regex=True))
    print("Cells with escaped bytes:")
    print(df[mask.any(axis=1)])
except OSError as e:
    raise SystemExit(f"Cannot open file: {e}") from e

Never use encoding_errors='ignore'. It silently drops the offending bytes entirely. A value like caf\xe9 becomes caf — a shorter string — which can cause column misalignment across rows, truncated values, and data loss that has no visible signal in the DataFrame.

Step 6 — Automated Fallback Ladder

Combine detection with a ranked fallback sequence. This covers the full workflow: detect → high-confidence shortcut → manual ladder → give up gracefully.

# pip install chardet pandas
from pathlib import Path
import chardet
import pandas as pd

FALLBACK_ENCODINGS = ["utf-8-sig", "cp1252", "latin-1"]

def load_csv_robust(csv_path: Path) -> pd.DataFrame:
    """Load a CSV file, auto-detecting encoding with a manual fallback ladder."""
    try:
        raw_sample = csv_path.read_bytes()[:50_000]
    except OSError as e:
        raise SystemExit(f"Cannot read file: {e}") from e

    detected = chardet.detect(raw_sample)
    enc = detected.get("encoding")
    confidence = detected.get("confidence", 0.0)

    if enc and confidence > 0.7:
        try:
            return pd.read_csv(csv_path, encoding=enc, engine="python")
        except (UnicodeDecodeError, LookupError):
            pass  # fall through to manual ladder

    for encoding in FALLBACK_ENCODINGS:
        try:
            df = pd.read_csv(csv_path, encoding=encoding, engine="python")
            print(f"Loaded with fallback encoding: {encoding}")
            return df
        except UnicodeDecodeError:
            continue

    raise ValueError(
        f"Could not decode {csv_path.name} with any known encoding. "
        "Inspect with a hex editor."
    )

if __name__ == "__main__":
    df = load_csv_robust(Path("data/export.csv"))
    print(df.head())

The same pattern applies when reading Excel files with Python.xlsx files are always UTF-8 internally, but legacy .xls and CSV saves from Excel carry cp1252 or regional codecs.

Step 7 — Verify the Fix

A row-count assertion and a spot-check catch silent failures before data reaches downstream code. If the file loaded with encoding_errors='replace' but you expected clean data, the assertion surfaces the problem.

# pip install pandas
from pathlib import Path
import pandas as pd

csv_path = Path("data/export.csv")
EXPECTED_ROWS = 10_000  # set from source system record count

try:
    df = pd.read_csv(csv_path, encoding="cp1252", engine="python")
except (UnicodeDecodeError, OSError) as e:
    raise SystemExit(f"Load failed: {e}") from e

# Row count assertion
assert len(df) == EXPECTED_ROWS, (
    f"Expected {EXPECTED_ROWS} rows, got {len(df)}. "
    "Check for header/footer rows or encoding truncation."
)

# Spot-check for replacement characters (indicates residual bad bytes)
replacement_count = df.apply(lambda c: c.astype(str).str.contains("�")).sum().sum()
if replacement_count > 0:
    print(f"WARNING: {replacement_count} cells contain replacement characters.")

print("Verification passed.")

Quick-Reference: Encoding Choices

SituationRecommended encodingNotes
chardet confidence > 0.7Use detected encodingRead 50 KB minimum
Excel "Save As CSV" on Windowscp1252Covers smart quotes, en/em-dash
Excel "UTF-8 CSV" with BOMutf-8-sigStrips EF BB BF automatically
Unknown legacy file, any platformlatin-1Never raises; use as last resort
Isolated corrupted bytesencoding_errors='replace'Replace with pd.NA
Debugging bad bytesencoding_errors='backslashreplace'Shows \x96 escapes — diagnostic only
Avoidencoding_errors='ignore'Drops bytes silently → column shifts

Common Mistakes

MistakeImpactResolution
encoding_errors='ignore'Silently drops bytes, causes column misalignmentUse 'replace' and convert to pd.NA
Assuming all CSVs are UTF-8Immediate crash on legacy exportsDetect with chardet or try cp1252 first
Not reading a binary sample for chardetChardet decodes the sample before detectingAlways use read_bytes() or 'rb' mode
Ignoring low-confidence detectionsWrong encoding applied, garbled stringsTreat confidence < 0.7 as unknown; use fallback ladder
Skipping engine='python'C engine has limited codec fallback supportAdd engine='python' for non-UTF-8 codecs

Encoding issues are one facet of a larger ingestion problem. The Cleaning Messy CSV Data with Pandas guide continues from a successfully loaded DataFrame into type coercion, whitespace normalization, and duplicate handling.


Part of Cleaning Messy CSV Data with Pandas.

/html>