Fixing UnicodeDecodeError in CSV Files: 'utf-8' codec can't decode byte 0x96
The error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 142: invalid start byte stops your script cold every time you try to load a legacy or Windows-exported CSV with pandas. This page gives you a deterministic workflow: detect the real encoding, apply the right fix, handle edge cases like BOM and mixed encodings, and verify the result.
Root Cause
Pandas defaults to UTF-8. Byte 0x96 is a valid Windows-1252 character — the en-dash – — but it is an illegal start byte in UTF-8. The moment the C parser hits that byte it raises and halts; there is no partial load. The same pattern applies to 0x91–0x9F (curly quotes, em-dash, ellipsis, bullet) that Windows applications encode in the cp1252 range. Regional ERP exports, older Excel CSV saves, and accounting software from any era are the common culprits. For a broader look at which parser handles which file quirks, see Best Python Libraries for CSV Parsing.
Step 1 — Detect the Encoding with chardet
Before guessing, measure. chardet reads a binary sample and returns a codec name and a confidence score.
# pip install chardet pandas
from pathlib import Path
import chardet
import pandas as pd
csv_path = Path("data/export.csv")
try:
raw_sample = csv_path.read_bytes()[:50_000] # first 50 KB is enough
result = chardet.detect(raw_sample)
print(result) # {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
except OSError as e:
raise SystemExit(f"Cannot read file: {e}") from e
A confidence score above 0.7 is reliable enough to use directly. Below 0.7 the detector is uncertain — fall back to the manual ladder in the next section. Always read in binary mode (read_bytes() / 'rb') so no premature decoding occurs before chardet has seen the raw bytes.
Step 2 — Detect with charset-normalizer (Alternative)
charset-normalizer is a pure-Python alternative with no C extension dependency, making it friendlier in restricted environments. Its API mirrors chardet intentionally.
# pip install charset-normalizer pandas
from pathlib import Path
from charset_normalizer import detect
import pandas as pd
csv_path = Path("data/export.csv")
try:
raw_sample = csv_path.read_bytes()[:50_000]
result = detect(raw_sample)
detected_encoding = result.get("encoding")
confidence = result.get("confidence", 0.0)
print(f"Detected: {detected_encoding} confidence: {confidence:.2f}")
except OSError as e:
raise SystemExit(f"Cannot read file: {e}") from e
Both libraries return the same dictionary shape: {"encoding": str | None, "confidence": float}. Use either; the rest of the workflow is identical.
Step 3 — Encoding Detection Flow
Step 4 — Fix with Explicit Encoding
cp1252 — Windows and Excel Exports
cp1252 is the right first choice for any file that originated on a Windows machine or was saved by Excel's "Save As CSV" option. It maps 0x80–0x9F to typographic characters that UTF-8 cannot represent in those byte positions.
# pip install pandas
from pathlib import Path
import pandas as pd
csv_path = Path("data/export.csv")
try:
df = pd.read_csv(csv_path, encoding="cp1252", engine="python")
print(f"Loaded {len(df):,} rows")
except (UnicodeDecodeError, OSError) as e:
raise SystemExit(f"Load failed: {e}") from e
latin-1 — Universal Single-Byte Fallback
latin-1 (ISO-8859-1) maps every byte 0x00–0xFF directly to the first 256 Unicode code points. It never raises a UnicodeDecodeError, making it the final safety net when you have no idea what encoding was used. The trade-off: cp1252 characters in 0x80–0x9F come through as control characters rather than typographic symbols, so use it only after cp1252 fails.
# pip install pandas
from pathlib import Path
import pandas as pd
csv_path = Path("data/export.csv")
try:
df = pd.read_csv(csv_path, encoding="latin-1", engine="python")
print(df.dtypes)
except OSError as e:
raise SystemExit(f"Cannot open file: {e}") from e
utf-8-sig — BOM Stripping
Files exported from Excel's UTF-8 CSV option often carry a three-byte BOM (EF BB BF) at the start. When pandas reads these with encoding='utf-8', the first column name gains a leading — your DataFrame has a column named order_id instead of order_id. Using encoding='utf-8-sig' strips the BOM automatically before parsing.
# pip install pandas
from pathlib import Path
import pandas as pd
csv_path = Path("data/excel_utf8_export.csv")
try:
df = pd.read_csv(csv_path, encoding="utf-8-sig")
print(df.columns.tolist()) # ['order_id', 'amount', ...] — no BOM prefix
except (UnicodeDecodeError, OSError) as e:
raise SystemExit(f"Load failed: {e}") from e
When exporting data to CSV formats from pandas you can write a BOM-free UTF-8 file with df.to_csv(path, encoding='utf-8', index=False) — this prevents the BOM problem entirely for downstream consumers.
Step 5 — Handle Residual Bad Bytes with encoding_errors
Even with the correct encoding, some files contain isolated corrupted bytes — a copy-paste artifact, a transmission glitch. Use encoding_errors to control what happens when the parser hits them.
encoding_errors='replace'
Substitutes each undecodable byte with U+FFFD (the replacement character �, displayed as �). The file loads completely; you can then find affected cells and convert them to pd.NA.
# pip install pandas
from pathlib import Path
import pandas as pd
csv_path = Path("data/mixed.csv")
try:
df = pd.read_csv(
csv_path,
encoding="utf-8",
encoding_errors="replace",
engine="python",
)
# Replace replacement characters with proper NA
df = df.replace("�", pd.NA)
print(f"Rows with any NA after replacement: {df.isna().any(axis=1).sum()}")
except OSError as e:
raise SystemExit(f"Cannot open file: {e}") from e
encoding_errors='backslashreplace'
Converts each bad byte to a Python escape sequence like \x96. Useful for debugging — you can see exactly which bytes are causing trouble — but leave this only as a diagnostic mode, not in production pipelines.
# pip install pandas
from pathlib import Path
import pandas as pd
csv_path = Path("data/mixed.csv")
try:
df = pd.read_csv(
csv_path,
encoding="utf-8",
encoding_errors="backslashreplace",
engine="python",
)
# Find cells containing escape sequences to spot bad bytes
mask = df.apply(lambda col: col.astype(str).str.contains(r"\\x", regex=True))
print("Cells with escaped bytes:")
print(df[mask.any(axis=1)])
except OSError as e:
raise SystemExit(f"Cannot open file: {e}") from e
Never use encoding_errors='ignore'. It silently drops the offending bytes entirely. A value like caf\xe9 becomes caf — a shorter string — which can cause column misalignment across rows, truncated values, and data loss that has no visible signal in the DataFrame.
Step 6 — Automated Fallback Ladder
Combine detection with a ranked fallback sequence. This covers the full workflow: detect → high-confidence shortcut → manual ladder → give up gracefully.
# pip install chardet pandas
from pathlib import Path
import chardet
import pandas as pd
FALLBACK_ENCODINGS = ["utf-8-sig", "cp1252", "latin-1"]
def load_csv_robust(csv_path: Path) -> pd.DataFrame:
"""Load a CSV file, auto-detecting encoding with a manual fallback ladder."""
try:
raw_sample = csv_path.read_bytes()[:50_000]
except OSError as e:
raise SystemExit(f"Cannot read file: {e}") from e
detected = chardet.detect(raw_sample)
enc = detected.get("encoding")
confidence = detected.get("confidence", 0.0)
if enc and confidence > 0.7:
try:
return pd.read_csv(csv_path, encoding=enc, engine="python")
except (UnicodeDecodeError, LookupError):
pass # fall through to manual ladder
for encoding in FALLBACK_ENCODINGS:
try:
df = pd.read_csv(csv_path, encoding=encoding, engine="python")
print(f"Loaded with fallback encoding: {encoding}")
return df
except UnicodeDecodeError:
continue
raise ValueError(
f"Could not decode {csv_path.name} with any known encoding. "
"Inspect with a hex editor."
)
if __name__ == "__main__":
df = load_csv_robust(Path("data/export.csv"))
print(df.head())
The same pattern applies when reading Excel files with Python — .xlsx files are always UTF-8 internally, but legacy .xls and CSV saves from Excel carry cp1252 or regional codecs.
Step 7 — Verify the Fix
A row-count assertion and a spot-check catch silent failures before data reaches downstream code. If the file loaded with encoding_errors='replace' but you expected clean data, the assertion surfaces the problem.
# pip install pandas
from pathlib import Path
import pandas as pd
csv_path = Path("data/export.csv")
EXPECTED_ROWS = 10_000 # set from source system record count
try:
df = pd.read_csv(csv_path, encoding="cp1252", engine="python")
except (UnicodeDecodeError, OSError) as e:
raise SystemExit(f"Load failed: {e}") from e
# Row count assertion
assert len(df) == EXPECTED_ROWS, (
f"Expected {EXPECTED_ROWS} rows, got {len(df)}. "
"Check for header/footer rows or encoding truncation."
)
# Spot-check for replacement characters (indicates residual bad bytes)
replacement_count = df.apply(lambda c: c.astype(str).str.contains("�")).sum().sum()
if replacement_count > 0:
print(f"WARNING: {replacement_count} cells contain replacement characters.")
print("Verification passed.")
Quick-Reference: Encoding Choices
| Situation | Recommended encoding | Notes |
|---|---|---|
| chardet confidence > 0.7 | Use detected encoding | Read 50 KB minimum |
| Excel "Save As CSV" on Windows | cp1252 | Covers smart quotes, en/em-dash |
| Excel "UTF-8 CSV" with BOM | utf-8-sig | Strips EF BB BF automatically |
| Unknown legacy file, any platform | latin-1 | Never raises; use as last resort |
| Isolated corrupted bytes | encoding_errors='replace' | Replace � with pd.NA |
| Debugging bad bytes | encoding_errors='backslashreplace' | Shows \x96 escapes — diagnostic only |
| Avoid | encoding_errors='ignore' | Drops bytes silently → column shifts |
Common Mistakes
| Mistake | Impact | Resolution |
|---|---|---|
encoding_errors='ignore' | Silently drops bytes, causes column misalignment | Use 'replace' and convert � to pd.NA |
| Assuming all CSVs are UTF-8 | Immediate crash on legacy exports | Detect with chardet or try cp1252 first |
| Not reading a binary sample for chardet | Chardet decodes the sample before detecting | Always use read_bytes() or 'rb' mode |
| Ignoring low-confidence detections | Wrong encoding applied, garbled strings | Treat confidence < 0.7 as unknown; use fallback ladder |
Skipping engine='python' | C engine has limited codec fallback support | Add engine='python' for non-UTF-8 codecs |
Encoding issues are one facet of a larger ingestion problem. The Cleaning Messy CSV Data with Pandas guide continues from a successfully loaded DataFrame into type coercion, whitespace normalization, and duplicate handling.
Related
- Cleaning Messy CSV Data with Pandas — post-ingestion cleaning workflows
- Best Python Libraries for CSV Parsing — chardet, charset-normalizer, and parser comparisons
- Reading Excel Files with Python — encoding considerations for
.xlsand legacy Excel formats - Exporting Data to CSV Formats — write BOM-free UTF-8 to prevent downstream encoding errors
- Python for Excel & CSV Data Processing — full pipeline overview
Part of Cleaning Messy CSV Data with Pandas.