How to Extract Tables from Scanned PDFs

Standard parsers fail on scanned documents due to missing text layers, triggering Empty DataFrame or TableNotFoundError exceptions. This workflow resolves the issue by implementing an OCR-driven pipeline that converts rasterized pages into structured tabular data, extending core methods from Extracting Tables from PDFs into production-ready automation.

Key Objectives:

Diagnose vector-text vs raster-image PDFs before parsing
Implement Tesseract OCR with spatial coordinate preservation
Reconstruct table boundaries using Y-axis tolerance clustering
Export parsed matrices to CSV/Excel for downstream analysis

Diagnosing the Empty Table Extraction Error

When running traditional extraction libraries on scanned documents, you will typically encounter one of these exact errors:

pdfplumber: TableNotFoundError: No tables found on page X
camelot: Empty DataFrame returned with 0 rows × 0 columns
tabula-py: java.lang.RuntimeException: No tables detected

Root Cause

Scanned PDFs are raster image containers wrapped in a PDF wrapper. They lack embedded text streams, font dictionaries, and vector line objects. Libraries like pdfplumber and Camelot rely on parsing PDF content streams (/Text objects and /Path drawings). When those objects are absent, the parsers return null results instead of failing outright.

Diagnostic Check

Verify document type before applying extraction logic. Use PyMuPDF (fitz) to check for actual text content:

import fitz

def is_scanned_pdf(filepath):
 doc = fitz.open(filepath)
 total_text = sum(len(page.get_text("text").strip()) for page in doc)
 doc.close()
 return total_text == 0

if is_scanned_pdf("scanned_report.pdf"):
 print("️ Raster-only PDF detected. Switch to OCR pipeline.")
else:
 print("✅ Vector text layer present. Use standard extraction.")

Pre-Processing Scanned Pages with OCR

To extract data from raster pages, you must render each page to an image and run optical character recognition. Tesseract's image_to_data output provides bounding box coordinates alongside recognized text, which is essential for reconstructing tabular layouts.

Prerequisites

# System dependency (Ubuntu/Debian)
sudo apt-get install tesseract-ocr libpoppler-dev

# Python packages
pip install pytesseract pdf2image pandas

Rendering & OCR Execution

Always render at 300 DPI minimum. Lower resolutions degrade character boundaries, causing adjacent cells to merge during recognition.

from pdf2image import convert_from_path
import pytesseract

# Render at 300 DPI to preserve column separators
images = convert_from_path("scanned_report.pdf", dpi=300)

# Extract spatial text data
ocr_data = pytesseract.image_to_data(images[0], output_type=pytesseract.Output.DICT)

Reconstructing Table Structure from OCR Output

Tesseract outputs flat text blocks with pixel coordinates. To rebuild a table, you must cluster blocks into rows using Y-axis tolerance, then sort each cluster by X-axis position. This coordinate mapping replaces the missing vector metadata.

Complete Production Pipeline

The following script handles multi-block clustering, filters low-confidence tokens, and exports a clean pandas.DataFrame. This modular approach integrates seamlessly into broader Automating PDF Extraction & Generation workflows.

import pdf2image
import pytesseract
import pandas as pd

def extract_scanned_table(pdf_path, output_csv="extracted_table.csv", dpi=300, row_tolerance=15, min_confidence=60):
 """
 Converts a scanned PDF to a structured CSV using OCR coordinate clustering.
 """
 # 1. Render pages to high-DPI images
 images = pdf2image.convert_from_path(pdf_path, dpi=dpi)
 
 all_rows = []
 
 for img in images:
 # 2. Extract OCR data with bounding boxes
 data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
 
 # 3. Filter valid text blocks
 blocks = [
 (data['text'][i], data['left'][i], data['top'][i])
 for i in range(len(data['text']))
 if int(data['conf'][i]) > min_confidence and data['text'][i].strip()
 ]
 
 if not blocks:
 continue
 
 # 4. Coordinate-based row grouping
 rows = {}
 for text, x, y in blocks:
 # Quantize Y-coordinate to group lines within tolerance
 row_key = round(y / row_tolerance)
 rows.setdefault(row_key, []).append((x, text))
 
 # 5. Sort rows by Y, then columns by X
 for y_key in sorted(rows.keys()):
 rows[y_key].sort(key=lambda item: item[0])
 all_rows.append([cell[1] for cell in rows[y_key]])
 
 # 6. Export to DataFrame & CSV
 df = pd.DataFrame(all_rows)
 df.to_csv(output_csv, index=False)
 print(f"✅ Extracted {len(df)} rows to {output_csv}")
 return df

# Execute pipeline
extract_scanned_table("scanned_report.pdf")

Common Mistakes & Fixes

Issue	Root Cause	Production Fix
Using `pdfplumber`/`Camelot` directly on scans	These libraries parse embedded text streams and vector lines. Raster-only PDFs contain zero parseable objects.	Run the diagnostic check above. If `is_scanned_pdf()` returns `True`, route to the OCR pipeline.
Low DPI rendering (default 72)	Character glyphs blur at low resolution. Tesseract merges adjacent columns or misreads `1` as `l` or `I`.	Always specify `dpi=300` or `dpi=350` in `pdf2image.convert_from_path()`.
Ignoring confidence thresholds	Background noise or scan artifacts generate low-confidence tokens (`conf < 50`), polluting cells with garbage strings.	Filter `data['conf'][i] > 60` before clustering. Adjust tolerance based on scan quality.
Fixed row tolerance for all documents	Font sizes and line spacing vary across documents. Hardcoded `y/15` may split single rows or merge adjacent ones.	Dynamically calculate tolerance using `np.median(np.diff(sorted(y_coords)))` or expose it as a configurable parameter.

FAQ

Why does my table extraction script return an empty CSV for scanned files? Scanned PDFs lack embedded text layers. Standard parsers read vector metadata, not pixels. You must run OCR first to generate a searchable text layer before table extraction.

Can pytesseract automatically detect table borders? No. pytesseract outputs text and bounding boxes. You must implement coordinate clustering or use a dedicated layout analysis library like LayoutParser to map borders to rows/columns.

How do I handle multi-page scanned tables? Process each page sequentially, maintain a consistent column header schema, and concatenate DataFrames using pandas.concat() while filtering duplicate header rows.

What if the table has merged cells? Coordinate clustering treats each text block independently. Merged cells will appear as a single cell spanning multiple columns. Post-process by checking for NaN values in adjacent columns and forward-filling if domain logic permits.