How to Extract Tables from Scanned PDFs
Standard parsers fail on scanned documents due to missing text layers, triggering Empty DataFrame or TableNotFoundError exceptions. This workflow resolves the issue by implementing an OCR-driven pipeline that converts rasterized pages into structured tabular data, extending core methods from Extracting Tables from PDFs into production-ready automation.
Key Objectives:
- Diagnose vector-text vs raster-image PDFs before parsing
- Implement Tesseract OCR with spatial coordinate preservation
- Reconstruct table boundaries using Y-axis tolerance clustering
- Export parsed matrices to CSV/Excel for downstream analysis
Diagnosing the Empty Table Extraction Error
When running traditional extraction libraries on scanned documents, you will typically encounter one of these exact errors:
pdfplumber:TableNotFoundError: No tables found on page Xcamelot:Empty DataFramereturned with0 rows × 0 columnstabula-py:java.lang.RuntimeException: No tables detected
Root Cause
Scanned PDFs are raster image containers wrapped in a PDF wrapper. They lack embedded text streams, font dictionaries, and vector line objects. Libraries like pdfplumber and Camelot rely on parsing PDF content streams (/Text objects and /Path drawings). When those objects are absent, the parsers return null results instead of failing outright.
Diagnostic Check
Verify document type before applying extraction logic. Use PyMuPDF (fitz) to check for actual text content:
import fitz
def is_scanned_pdf(filepath):
doc = fitz.open(filepath)
total_text = sum(len(page.get_text("text").strip()) for page in doc)
doc.close()
return total_text == 0
if is_scanned_pdf("scanned_report.pdf"):
print("️ Raster-only PDF detected. Switch to OCR pipeline.")
else:
print("✅ Vector text layer present. Use standard extraction.")
Pre-Processing Scanned Pages with OCR
To extract data from raster pages, you must render each page to an image and run optical character recognition. Tesseract's image_to_data output provides bounding box coordinates alongside recognized text, which is essential for reconstructing tabular layouts.
Prerequisites
# System dependency (Ubuntu/Debian)
sudo apt-get install tesseract-ocr libpoppler-dev
# Python packages
pip install pytesseract pdf2image pandas
Rendering & OCR Execution
Always render at 300 DPI minimum. Lower resolutions degrade character boundaries, causing adjacent cells to merge during recognition.
from pdf2image import convert_from_path
import pytesseract
# Render at 300 DPI to preserve column separators
images = convert_from_path("scanned_report.pdf", dpi=300)
# Extract spatial text data
ocr_data = pytesseract.image_to_data(images[0], output_type=pytesseract.Output.DICT)
Reconstructing Table Structure from OCR Output
Tesseract outputs flat text blocks with pixel coordinates. To rebuild a table, you must cluster blocks into rows using Y-axis tolerance, then sort each cluster by X-axis position. This coordinate mapping replaces the missing vector metadata.
Complete Production Pipeline
The following script handles multi-block clustering, filters low-confidence tokens, and exports a clean pandas.DataFrame. This modular approach integrates seamlessly into broader Automating PDF Extraction & Generation workflows.
import pdf2image
import pytesseract
import pandas as pd
def extract_scanned_table(pdf_path, output_csv="extracted_table.csv", dpi=300, row_tolerance=15, min_confidence=60):
"""
Converts a scanned PDF to a structured CSV using OCR coordinate clustering.
"""
# 1. Render pages to high-DPI images
images = pdf2image.convert_from_path(pdf_path, dpi=dpi)
all_rows = []
for img in images:
# 2. Extract OCR data with bounding boxes
data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
# 3. Filter valid text blocks
blocks = [
(data['text'][i], data['left'][i], data['top'][i])
for i in range(len(data['text']))
if int(data['conf'][i]) > min_confidence and data['text'][i].strip()
]
if not blocks:
continue
# 4. Coordinate-based row grouping
rows = {}
for text, x, y in blocks:
# Quantize Y-coordinate to group lines within tolerance
row_key = round(y / row_tolerance)
rows.setdefault(row_key, []).append((x, text))
# 5. Sort rows by Y, then columns by X
for y_key in sorted(rows.keys()):
rows[y_key].sort(key=lambda item: item[0])
all_rows.append([cell[1] for cell in rows[y_key]])
# 6. Export to DataFrame & CSV
df = pd.DataFrame(all_rows)
df.to_csv(output_csv, index=False)
print(f"✅ Extracted {len(df)} rows to {output_csv}")
return df
# Execute pipeline
extract_scanned_table("scanned_report.pdf")
Common Mistakes & Fixes
| Issue | Root Cause | Production Fix |
|---|---|---|
Using pdfplumber/Camelot directly on scans | These libraries parse embedded text streams and vector lines. Raster-only PDFs contain zero parseable objects. | Run the diagnostic check above. If is_scanned_pdf() returns True, route to the OCR pipeline. |
| Low DPI rendering (default 72) | Character glyphs blur at low resolution. Tesseract merges adjacent columns or misreads 1 as l or I. | Always specify dpi=300 or dpi=350 in pdf2image.convert_from_path(). |
| Ignoring confidence thresholds | Background noise or scan artifacts generate low-confidence tokens (conf < 50), polluting cells with garbage strings. | Filter data['conf'][i] > 60 before clustering. Adjust tolerance based on scan quality. |
| Fixed row tolerance for all documents | Font sizes and line spacing vary across documents. Hardcoded y/15 may split single rows or merge adjacent ones. | Dynamically calculate tolerance using np.median(np.diff(sorted(y_coords))) or expose it as a configurable parameter. |
FAQ
Why does my table extraction script return an empty CSV for scanned files? Scanned PDFs lack embedded text layers. Standard parsers read vector metadata, not pixels. You must run OCR first to generate a searchable text layer before table extraction.
Can pytesseract automatically detect table borders?
No. pytesseract outputs text and bounding boxes. You must implement coordinate clustering or use a dedicated layout analysis library like LayoutParser to map borders to rows/columns.
How do I handle multi-page scanned tables?
Process each page sequentially, maintain a consistent column header schema, and concatenate DataFrames using pandas.concat() while filtering duplicate header rows.
What if the table has merged cells?
Coordinate clustering treats each text block independently. Merged cells will appear as a single cell spanning multiple columns. Post-process by checking for NaN values in adjacent columns and forward-filling if domain logic permits.