How to Extract Tables from Scanned PDFs
pdfplumber and camelot return empty DataFrames on scanned documents because they parse the PDF content stream — which contains no text objects when a document was photocopied or printed-then-scanned. The fix is an OCR pipeline: render each page to a high-DPI image, extract text with spatial coordinates via Tesseract, then reconstruct row structure by clustering y-coordinates.
Root Cause
A scanned PDF is a wrapper around one or more raster images. It has no /Text stream, no font dictionaries, and no vector line objects. Any library that reads those structures (pdfplumber, camelot, tabula-py) silently returns empty results — not an error, just nothing. The symptom:
import pdfplumber
with pdfplumber.open("scanned_report.pdf") as pdf:
tables = pdf.pages[0].extract_tables()
print(tables) # → []
import camelot
tables = camelot.read_pdf("scanned_report.pdf", flavor="lattice")
print(tables.n) # → 0
Minimal Diagnostic
Before building a pipeline, confirm you are actually dealing with a scanned PDF:
# pip install pymupdf
from pathlib import Path
import fitz # PyMuPDF
def is_scanned(path: Path) -> bool:
"""Return True if the PDF has no selectable text on any page."""
try:
doc = fitz.open(str(path))
total_chars = sum(len(page.get_text("text").strip()) for page in doc)
doc.close()
return total_chars == 0
except Exception as e:
raise RuntimeError(f"Could not inspect {path}: {e}") from e
if __name__ == "__main__":
pdf = Path("data/scanned_report.pdf")
if is_scanned(pdf):
print("No text layer — use OCR pipeline")
else:
print("Text layer present — use pdfplumber or camelot")
A result of True with zero characters confirms the OCR path is required. A small character count (under 50) often indicates a partially OCR'd scan where Tesseract was run at low quality — treat it the same as a full scan.
Prerequisites
Install system binaries and Python packages before running the pipeline:
# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils
# Python packages
pip install pdf2image pytesseract pandas pymupdf
Verify Tesseract is on the PATH:
tesseract --version
# Expected: tesseract 4.x or 5.x
If Tesseract is not found, set the path in code: pytesseract.pytesseract.tesseract_cmd = "/usr/bin/tesseract". For the full Tesseract not-found error on Linux, see Scanning and OCR Processing with Python.
Step 1: Render Pages to Images
pdf2image.convert_from_path calls Poppler's pdftoppm under the hood. Use at least 300 DPI — lower resolutions blur character edges and cause Tesseract to merge adjacent cell text.
# pip install pdf2image
from pathlib import Path
from pdf2image import convert_from_path
from PIL import Image # installed with pdf2image
PDF_PATH = Path("data/scanned_report.pdf")
def render_pages(path: Path, dpi: int = 300) -> list[Image.Image]:
"""Render all PDF pages to PIL Image objects at the specified DPI."""
try:
images = convert_from_path(str(path), dpi=dpi)
except Exception as e:
raise RuntimeError(
f"pdf2image failed on {path}. Ensure poppler-utils is installed: {e}"
) from e
print(f"Rendered {len(images)} page(s) at {dpi} DPI")
return images
if __name__ == "__main__":
pages = render_pages(PDF_PATH)
DPI guidance:
| Scan quality | Recommended DPI |
|---|---|
| Clean laser print | 200 |
| Typical office scan | 300 |
| Old or faint document | 400–600 |
| Mixed content with small text | 400 |
Step 2: Extract OCR Data with Spatial Coordinates
pytesseract.image_to_data() returns word-level bounding boxes alongside recognized text. This spatial data is essential for reconstructing rows — without it you only get a flat string.
# pip install pytesseract
from PIL import Image
import pytesseract
def ocr_page(image: Image.Image, min_confidence: int = 60) -> list[tuple[str, int, int]]:
"""
Run Tesseract on a single page image.
Returns a list of (text, x_left, y_top) tuples for words above the confidence threshold.
"""
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
tokens = []
for i in range(len(data["text"])):
text = data["text"][i].strip()
conf = int(data["conf"][i])
if text and conf >= min_confidence:
tokens.append((text, data["left"][i], data["top"][i]))
return tokens
min_confidence=60 filters out noise (speckles, bleed-through) that Tesseract assigns low scores. Raise to 70–75 for cleaner scans; lower to 50 for degraded documents.
Step 3: Reconstruct Table Rows by Y-Clustering
Group tokens into rows by rounding their y-coordinate to a bucket width (row_tolerance), then sort each bucket by x-coordinate. This replaces the missing vector-line metadata.
def tokens_to_rows(
tokens: list[tuple[str, int, int]],
row_tolerance: int = 15,
) -> list[list[str]]:
"""
Cluster OCR tokens into table rows using y-coordinate bucketing.
row_tolerance: pixel height of one row bucket (tune per document font size).
"""
row_map: dict[int, list[tuple[int, str]]] = {}
for text, x, y in tokens:
bucket = round(y / row_tolerance) # integer bucket key
row_map.setdefault(bucket, []).append((x, text))
rows = []
for bucket_key in sorted(row_map):
row_map[bucket_key].sort(key=lambda item: item[0]) # sort left-to-right
rows.append([cell[1] for cell in row_map[bucket_key]])
return rows
Tuning row_tolerance: Start with 15 pixels at 300 DPI. If rows merge, lower it to 10. If single rows split across two buckets, raise it to 20. For variable-height rows (bold headers vs body text), post-process by merging adjacent short rows.
Step 4: Export to DataFrame
# pip install pandas
from pathlib import Path
import pandas as pd
OUTPUT_PATH = Path("output/scanned_table.csv")
def rows_to_dataframe(rows: list[list[str]], header_row: int = 0) -> pd.DataFrame:
"""Convert row-list to DataFrame, using the first row as column names."""
if not rows:
return pd.DataFrame()
# Pad all rows to the same width
max_cols = max(len(r) for r in rows)
padded = [r + [""] * (max_cols - len(r)) for r in rows]
header = padded[header_row]
data = padded[header_row + 1:]
df = pd.DataFrame(data, columns=header)
df.replace("", pd.NA, inplace=True)
return df
if __name__ == "__main__":
OUTPUT_PATH.parent.mkdir(exist_ok=True)
# (assumes pages and OCR steps already run)
from pdf2image import convert_from_path
import pytesseract
PDF_PATH = Path("data/scanned_report.pdf")
images = convert_from_path(str(PDF_PATH), dpi=300)
all_rows: list[list[str]] = []
for img in images:
tokens = ocr_page(img)
all_rows.extend(tokens_to_rows(tokens))
df = rows_to_dataframe(all_rows)
df.to_csv(OUTPUT_PATH, index=False)
print(f"Exported {len(df)} rows × {df.shape[1]} cols to {OUTPUT_PATH}")
print(df.head())
Variant Fix: Multi-Page Tables with Consistent Headers
When the scanned document spans many pages and each page repeats the table header, deduplicate before concatenating — the same technique used in Extracting Tables from PDFs for native PDFs applies here:
# pip install pandas
import pandas as pd
def merge_scanned_pages(page_rows: list[list[list[str]]]) -> pd.DataFrame:
"""
Merge rows from multiple scanned pages, removing repeated header rows.
page_rows: list of row-lists, one per page.
"""
if not page_rows or not page_rows[0]:
return pd.DataFrame()
header = page_rows[0][0] # canonical header from first page
all_data: list[list[str]] = []
for rows in page_rows:
for row in rows:
if row == header:
continue # skip repeated headers
all_data.append(row)
max_cols = max((len(r) for r in all_data), default=0)
if max_cols == 0:
return pd.DataFrame()
padded = [r + [""] * (max_cols - len(r)) for r in all_data]
df = pd.DataFrame(padded, columns=header + [f"extra_{i}" for i in range(max_cols - len(header))])
return df
Image Pre-Processing for Low-Quality Scans
If Tesseract accuracy is poor (many garbage tokens even at min_confidence=50), pre-process the image before calling image_to_data. Improving contrast and binarising the image gives Tesseract cleaner character boundaries.
# pip install pytesseract pillow opencv-python-headless
from PIL import Image, ImageFilter, ImageOps
import cv2
import numpy as np
import pytesseract
def preprocess_for_ocr(image: Image.Image) -> Image.Image:
"""
Sharpen, binarize, and denoise a scan image before OCR.
Returns a processed PIL Image.
"""
# Convert to greyscale
grey = ImageOps.grayscale(image)
# Convert to numpy for OpenCV processing
arr = np.array(grey)
# Adaptive threshold — handles uneven lighting across the scan
binary = cv2.adaptiveThreshold(
arr, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=15, # neighbourhood size; raise for larger text
C=8, # constant subtracted from mean; raise to remove light background
)
# Morphological opening to remove small noise spots
kernel = np.ones((1, 1), np.uint8)
cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
return Image.fromarray(cleaned)
def ocr_page_enhanced(image: Image.Image, min_confidence: int = 60) -> list[tuple[str, int, int]]:
"""OCR pipeline with image pre-processing."""
processed = preprocess_for_ocr(image)
data = pytesseract.image_to_data(processed, output_type=pytesseract.Output.DICT)
tokens = []
for i in range(len(data["text"])):
text = data["text"][i].strip()
conf = int(data["conf"][i])
if text and conf >= min_confidence:
tokens.append((text, data["left"][i], data["top"][i]))
return tokens
Run ocr_page_enhanced in place of ocr_page for any scan with visible background noise, shadow at page edges, or inconsistent ink coverage. The adaptive threshold is particularly effective for documents that were scanned under uneven lighting.
Verification
Check extraction quality before trusting the output:
# pip install pandas
import pandas as pd
from pathlib import Path
df = pd.read_csv(Path("output/scanned_table.csv"))
# 1. Row count — compare against a manual count from the source PDF
print(f"Rows: {len(df)}")
# 2. Null ratio — high nulls indicate column reconstruction problems
null_rate = df.isnull().mean().mean()
print(f"Null rate: {null_rate:.1%} (>30% → tune row_tolerance or raise DPI)")
# 3. Numeric column check — merged cells show as parse failures
for col in df.columns:
n = pd.to_numeric(df[col], errors="coerce").notna().mean()
if n > 0.7:
print(f" {col}: numeric ({n:.0%} parseable)")
If null rate exceeds 30%, try increasing DPI to 400 or lowering min_confidence to 50. If numerics fail to parse, the column was merged — lower row_tolerance to split the rows more finely.
Common Mistakes
| Issue | Root cause | Fix |
|---|---|---|
| Empty CSV output | No text layer — ran pdfplumber directly on scan | Use is_scanned() first; route to this OCR pipeline |
| Rows merged together | row_tolerance too large for the font size | Lower row_tolerance; start at 10px for small fonts |
Numeric values misread (l instead of 1) | DPI too low — character edges blur | Use dpi=300; raise to 400 for old documents |
| Garbage tokens in every cell | min_confidence too low | Raise to 65–70; pre-process image contrast first |
poppler not found on convert_from_path | Poppler system binaries missing | sudo apt-get install poppler-utils |
Related
- Extracting Tables from PDFs — lattice and stream extraction for native PDFs with text layers
- Fix PDF Text Extraction Alignment Issues — coordinate-sorting fix for partially-OCR'd PDFs that still misalign
- Scanning and OCR Processing with Python — full OCR preprocessing guide including image enhancement
- Cleaning Messy CSV Data with pandas — clean and type-coerce the extracted DataFrame
Part of Extracting Tables from PDFs.