How to Extract Tables from Scanned PDFs

pdfplumber and camelot return empty DataFrames on scanned documents because they parse the PDF content stream — which contains no text objects when a document was photocopied or printed-then-scanned. The fix is an OCR pipeline: render each page to a high-DPI image, extract text with spatial coordinates via Tesseract, then reconstruct row structure by clustering y-coordinates.

Root Cause

A scanned PDF is a wrapper around one or more raster images. It has no /Text stream, no font dictionaries, and no vector line objects. Any library that reads those structures (pdfplumber, camelot, tabula-py) silently returns empty results — not an error, just nothing. The symptom:

import pdfplumber
with pdfplumber.open("scanned_report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()
print(tables)   # → []
import camelot
tables = camelot.read_pdf("scanned_report.pdf", flavor="lattice")
print(tables.n)  # → 0

Minimal Diagnostic

Before building a pipeline, confirm you are actually dealing with a scanned PDF:

# pip install pymupdf
from pathlib import Path
import fitz  # PyMuPDF

def is_scanned(path: Path) -> bool:
    """Return True if the PDF has no selectable text on any page."""
    try:
        doc = fitz.open(str(path))
        total_chars = sum(len(page.get_text("text").strip()) for page in doc)
        doc.close()
        return total_chars == 0
    except Exception as e:
        raise RuntimeError(f"Could not inspect {path}: {e}") from e

if __name__ == "__main__":
    pdf = Path("data/scanned_report.pdf")
    if is_scanned(pdf):
        print("No text layer — use OCR pipeline")
    else:
        print("Text layer present — use pdfplumber or camelot")

A result of True with zero characters confirms the OCR path is required. A small character count (under 50) often indicates a partially OCR'd scan where Tesseract was run at low quality — treat it the same as a full scan.

Prerequisites

Install system binaries and Python packages before running the pipeline:

# System dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

# Python packages
pip install pdf2image pytesseract pandas pymupdf

Verify Tesseract is on the PATH:

tesseract --version
# Expected: tesseract 4.x or 5.x

If Tesseract is not found, set the path in code: pytesseract.pytesseract.tesseract_cmd = "/usr/bin/tesseract". For the full Tesseract not-found error on Linux, see Scanning and OCR Processing with Python.

Step 1: Render Pages to Images

pdf2image.convert_from_path calls Poppler's pdftoppm under the hood. Use at least 300 DPI — lower resolutions blur character edges and cause Tesseract to merge adjacent cell text.

# pip install pdf2image
from pathlib import Path
from pdf2image import convert_from_path
from PIL import Image  # installed with pdf2image

PDF_PATH = Path("data/scanned_report.pdf")

def render_pages(path: Path, dpi: int = 300) -> list[Image.Image]:
    """Render all PDF pages to PIL Image objects at the specified DPI."""
    try:
        images = convert_from_path(str(path), dpi=dpi)
    except Exception as e:
        raise RuntimeError(
            f"pdf2image failed on {path}. Ensure poppler-utils is installed: {e}"
        ) from e
    print(f"Rendered {len(images)} page(s) at {dpi} DPI")
    return images

if __name__ == "__main__":
    pages = render_pages(PDF_PATH)

DPI guidance:

Scan qualityRecommended DPI
Clean laser print200
Typical office scan300
Old or faint document400–600
Mixed content with small text400

Step 2: Extract OCR Data with Spatial Coordinates

pytesseract.image_to_data() returns word-level bounding boxes alongside recognized text. This spatial data is essential for reconstructing rows — without it you only get a flat string.

# pip install pytesseract
from PIL import Image
import pytesseract

def ocr_page(image: Image.Image, min_confidence: int = 60) -> list[tuple[str, int, int]]:
    """
    Run Tesseract on a single page image.
    Returns a list of (text, x_left, y_top) tuples for words above the confidence threshold.
    """
    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
    tokens = []
    for i in range(len(data["text"])):
        text = data["text"][i].strip()
        conf = int(data["conf"][i])
        if text and conf >= min_confidence:
            tokens.append((text, data["left"][i], data["top"][i]))
    return tokens

min_confidence=60 filters out noise (speckles, bleed-through) that Tesseract assigns low scores. Raise to 70–75 for cleaner scans; lower to 50 for degraded documents.

Step 3: Reconstruct Table Rows by Y-Clustering

Group tokens into rows by rounding their y-coordinate to a bucket width (row_tolerance), then sort each bucket by x-coordinate. This replaces the missing vector-line metadata.

def tokens_to_rows(
    tokens: list[tuple[str, int, int]],
    row_tolerance: int = 15,
) -> list[list[str]]:
    """
    Cluster OCR tokens into table rows using y-coordinate bucketing.
    row_tolerance: pixel height of one row bucket (tune per document font size).
    """
    row_map: dict[int, list[tuple[int, str]]] = {}
    for text, x, y in tokens:
        bucket = round(y / row_tolerance)   # integer bucket key
        row_map.setdefault(bucket, []).append((x, text))

    rows = []
    for bucket_key in sorted(row_map):
        row_map[bucket_key].sort(key=lambda item: item[0])   # sort left-to-right
        rows.append([cell[1] for cell in row_map[bucket_key]])

    return rows

Tuning row_tolerance: Start with 15 pixels at 300 DPI. If rows merge, lower it to 10. If single rows split across two buckets, raise it to 20. For variable-height rows (bold headers vs body text), post-process by merging adjacent short rows.

Step 4: Export to DataFrame

# pip install pandas
from pathlib import Path
import pandas as pd

OUTPUT_PATH = Path("output/scanned_table.csv")

def rows_to_dataframe(rows: list[list[str]], header_row: int = 0) -> pd.DataFrame:
    """Convert row-list to DataFrame, using the first row as column names."""
    if not rows:
        return pd.DataFrame()

    # Pad all rows to the same width
    max_cols = max(len(r) for r in rows)
    padded = [r + [""] * (max_cols - len(r)) for r in rows]

    header = padded[header_row]
    data = padded[header_row + 1:]
    df = pd.DataFrame(data, columns=header)
    df.replace("", pd.NA, inplace=True)
    return df

if __name__ == "__main__":
    OUTPUT_PATH.parent.mkdir(exist_ok=True)
    # (assumes pages and OCR steps already run)
    from pdf2image import convert_from_path
    import pytesseract

    PDF_PATH = Path("data/scanned_report.pdf")
    images = convert_from_path(str(PDF_PATH), dpi=300)
    all_rows: list[list[str]] = []
    for img in images:
        tokens = ocr_page(img)
        all_rows.extend(tokens_to_rows(tokens))

    df = rows_to_dataframe(all_rows)
    df.to_csv(OUTPUT_PATH, index=False)
    print(f"Exported {len(df)} rows × {df.shape[1]} cols to {OUTPUT_PATH}")
    print(df.head())

Variant Fix: Multi-Page Tables with Consistent Headers

When the scanned document spans many pages and each page repeats the table header, deduplicate before concatenating — the same technique used in Extracting Tables from PDFs for native PDFs applies here:

# pip install pandas
import pandas as pd

def merge_scanned_pages(page_rows: list[list[list[str]]]) -> pd.DataFrame:
    """
    Merge rows from multiple scanned pages, removing repeated header rows.
    page_rows: list of row-lists, one per page.
    """
    if not page_rows or not page_rows[0]:
        return pd.DataFrame()

    header = page_rows[0][0]   # canonical header from first page
    all_data: list[list[str]] = []

    for rows in page_rows:
        for row in rows:
            if row == header:
                continue   # skip repeated headers
            all_data.append(row)

    max_cols = max((len(r) for r in all_data), default=0)
    if max_cols == 0:
        return pd.DataFrame()

    padded = [r + [""] * (max_cols - len(r)) for r in all_data]
    df = pd.DataFrame(padded, columns=header + [f"extra_{i}" for i in range(max_cols - len(header))])
    return df

Image Pre-Processing for Low-Quality Scans

If Tesseract accuracy is poor (many garbage tokens even at min_confidence=50), pre-process the image before calling image_to_data. Improving contrast and binarising the image gives Tesseract cleaner character boundaries.

# pip install pytesseract pillow opencv-python-headless
from PIL import Image, ImageFilter, ImageOps
import cv2
import numpy as np
import pytesseract

def preprocess_for_ocr(image: Image.Image) -> Image.Image:
    """
    Sharpen, binarize, and denoise a scan image before OCR.
    Returns a processed PIL Image.
    """
    # Convert to greyscale
    grey = ImageOps.grayscale(image)

    # Convert to numpy for OpenCV processing
    arr = np.array(grey)

    # Adaptive threshold — handles uneven lighting across the scan
    binary = cv2.adaptiveThreshold(
        arr, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY,
        blockSize=15,   # neighbourhood size; raise for larger text
        C=8,            # constant subtracted from mean; raise to remove light background
    )

    # Morphological opening to remove small noise spots
    kernel = np.ones((1, 1), np.uint8)
    cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

    return Image.fromarray(cleaned)

def ocr_page_enhanced(image: Image.Image, min_confidence: int = 60) -> list[tuple[str, int, int]]:
    """OCR pipeline with image pre-processing."""
    processed = preprocess_for_ocr(image)
    data = pytesseract.image_to_data(processed, output_type=pytesseract.Output.DICT)
    tokens = []
    for i in range(len(data["text"])):
        text = data["text"][i].strip()
        conf = int(data["conf"][i])
        if text and conf >= min_confidence:
            tokens.append((text, data["left"][i], data["top"][i]))
    return tokens

Run ocr_page_enhanced in place of ocr_page for any scan with visible background noise, shadow at page edges, or inconsistent ink coverage. The adaptive threshold is particularly effective for documents that were scanned under uneven lighting.

Verification

Check extraction quality before trusting the output:

# pip install pandas
import pandas as pd
from pathlib import Path

df = pd.read_csv(Path("output/scanned_table.csv"))

# 1. Row count — compare against a manual count from the source PDF
print(f"Rows: {len(df)}")

# 2. Null ratio — high nulls indicate column reconstruction problems
null_rate = df.isnull().mean().mean()
print(f"Null rate: {null_rate:.1%}  (>30% → tune row_tolerance or raise DPI)")

# 3. Numeric column check — merged cells show as parse failures
for col in df.columns:
    n = pd.to_numeric(df[col], errors="coerce").notna().mean()
    if n > 0.7:
        print(f"  {col}: numeric ({n:.0%} parseable)")

If null rate exceeds 30%, try increasing DPI to 400 or lowering min_confidence to 50. If numerics fail to parse, the column was merged — lower row_tolerance to split the rows more finely.

Common Mistakes

IssueRoot causeFix
Empty CSV outputNo text layer — ran pdfplumber directly on scanUse is_scanned() first; route to this OCR pipeline
Rows merged togetherrow_tolerance too large for the font sizeLower row_tolerance; start at 10px for small fonts
Numeric values misread (l instead of 1)DPI too low — character edges blurUse dpi=300; raise to 400 for old documents
Garbage tokens in every cellmin_confidence too lowRaise to 65–70; pre-process image contrast first
poppler not found on convert_from_pathPoppler system binaries missingsudo apt-get install poppler-utils

Part of Extracting Tables from PDFs.