Scanning and OCR Processing with Python
Scanned PDFs are image containers — no embedded text, no selectable characters. Every standard parser returns empty results because there is nothing to parse. The fix is a three-stage pipeline: rasterize each page to a high-DPI image, preprocess that image to maximise contrast and alignment, then feed it to an OCR engine that produces a text layer you can search, index, or pipe into downstream extraction.
Generic "just call pytesseract" tutorials skip the preprocessing stage. That works on clean scans at ideal DPI; it fails on faded invoices, skewed photographs, and low-contrast forms. This guide covers the full pipeline end-to-end.
Prerequisites
System binary and Python packages must both be present. pytesseract is only a wrapper — without the Tesseract binary the wrapper raises TesseractNotFoundError immediately. If you hit that error, see Fix TesseractNotFoundError in Python.
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Windows — download from https://github.com/UB-Mannheim/tesseract/wiki and add to PATH
# Python packages
pip install pytesseract pymupdf pdf2image Pillow opencv-python numpy
# For extra language packs (e.g. German + French):
sudo apt-get install tesseract-ocr-deu tesseract-ocr-fra
Verify the installation before writing any pipeline code:
tesseract --version
python -c "import pytesseract; print(pytesseract.get_tesseract_version())"
Step 1 — Diagnose the PDF Before Choosing a Path
Not every "scanned" PDF is purely raster. Some contain a hidden text layer added by a previous OCR pass. Running the full preprocessing+OCR pipeline on those wastes time and can degrade quality. Use PyMuPDF to check first.
# pip install pymupdf
from pathlib import Path
import fitz # PyMuPDF
def classify_pdf(pdf_path: Path) -> str:
"""
Returns 'text' if the PDF has an embedded text layer,
'raster' if it is a pure scan that needs OCR.
"""
try:
doc = fitz.open(pdf_path)
char_count = sum(len(page.get_text("text").strip()) for page in doc)
doc.close()
return "text" if char_count > 20 else "raster"
except Exception as exc:
raise RuntimeError(f"Could not open {pdf_path}: {exc}") from exc
pdf = Path("contract_scan.pdf")
kind = classify_pdf(pdf)
print(f"{pdf.name}: {kind}")
# → contract_scan.pdf: raster
If the result is text, use pdfplumber or camelot directly — no OCR needed. If it is raster, continue with the pipeline below.
Step 2 — Rasterize PDF Pages
Convert each page to a PIL Image at 300 DPI minimum. Below 300 DPI, character strokes blur and Tesseract misreads adjacent characters.
Two options: pdf2image (wraps Poppler, simple API) or PyMuPDF (no Poppler dependency, faster on large files).
# pip install pymupdf Pillow
from pathlib import Path
from PIL import Image
import fitz
def rasterize_pdf_pymupdf(pdf_path: Path, dpi: int = 300) -> list[Image.Image]:
"""Render each page of a PDF to a PIL Image at the specified DPI."""
pages: list[Image.Image] = []
try:
doc = fitz.open(pdf_path)
for page in doc:
pix = page.get_pixmap(dpi=dpi)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
pages.append(img)
doc.close()
except Exception as exc:
raise RuntimeError(f"Rasterize failed for {pdf_path}: {exc}") from exc
return pages
# Alternative: pdf2image (requires poppler installed separately)
# from pdf2image import convert_from_path
# pages = convert_from_path("contract_scan.pdf", dpi=300)
PyMuPDF does not require Poppler. pdf2image requires poppler-utils (apt-get install poppler-utils / brew install poppler) but can be more convenient for batch workflows.
Step 3 — Preprocess Images for OCR
Raw scans have low contrast, background grain, and slight rotation. Preprocessing before OCR is not optional for production accuracy — it moves character recognition confidence from 40–60 % up to 85–95 % on typical office scans.
# pip install opencv-python numpy Pillow
from pathlib import Path
import cv2
import numpy as np
from PIL import Image
def preprocess_for_ocr(pil_img: Image.Image) -> np.ndarray:
"""
Convert a PIL Image to a denoised, deskewed, binarised numpy array
suitable for Tesseract input.
"""
# Convert PIL → OpenCV grayscale
img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2GRAY)
# Otsu's binarisation — optimal threshold calculated automatically
_, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Fast non-local means denoise (h=10 is good for mild scanner grain)
denoised = cv2.fastNlMeansDenoising(thresh, h=10)
# Deskew via minAreaRect on foreground pixels
coords = np.column_stack(np.where(denoised > 0))
if len(coords) >= 4:
rect = cv2.minAreaRect(coords)
angle = rect[-1]
angle = -(90 + angle) if angle < -45 else -angle
h, w = denoised.shape[:2]
M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
denoised = cv2.warpAffine(
denoised, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE,
)
return denoised
def save_debug_image(arr: np.ndarray, output_path: Path) -> None:
"""Optionally save the preprocessed image for visual inspection."""
cv2.imwrite(str(output_path), arr)
When to Skip Which Steps
- Already-binarised images (black-and-white TIFF scans): skip the threshold step; applying Otsu again may invert regions.
- Digital-born PDFs saved as images (e.g. screenshots): skip deskew; they are already axis-aligned.
- High-resolution colour scans (600+ DPI): apply
cv2.resizeto scale down to 300 DPI first — larger images slow Tesseract without accuracy benefit.
For adaptive thresholding on low-contrast or shadow-heavy scans, replace the Otsu step with:
# pip install opencv-python
import cv2
# Adaptive Gaussian threshold — better for uneven illumination
thresh = cv2.adaptiveThreshold(
img, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=31, # must be odd; increase for larger text
C=10,
)
Step 4 — Run Tesseract OCR
image_to_string returns plain text. image_to_data returns per-word bounding boxes and confidence scores — use it whenever you need spatial layout or confidence filtering.
# pip install pytesseract Pillow
from pathlib import Path
import pytesseract
from PIL import Image
import numpy as np
def ocr_page(arr: np.ndarray, lang: str = "eng", psm: int = 3) -> str:
"""
Run Tesseract on a preprocessed numpy array and return the full text.
psm=3: fully automatic page segmentation (default).
psm=6: assume a single uniform block of text.
psm=11: sparse text — find as much text as possible in no particular order.
"""
pil_img = Image.fromarray(arr)
config = f"--psm {psm}"
try:
return pytesseract.image_to_string(pil_img, lang=lang, config=config)
except pytesseract.pytesseract.TesseractNotFoundError:
raise RuntimeError(
"Tesseract binary not found — see "
"/automating-pdf-extraction-generation/scanning-and-ocr-processing-with-python/fix-tesseract-not-found-error/"
)
def ocr_page_with_confidence(
arr: np.ndarray,
lang: str = "eng",
psm: int = 3,
min_conf: int = 60,
) -> str:
"""
Run OCR and discard tokens below min_conf. Reduces noise in output
at the cost of occasionally dropping low-quality but correct characters.
"""
pil_img = Image.fromarray(arr)
config = f"--psm {psm}"
data = pytesseract.image_to_data(
pil_img, lang=lang, config=config,
output_type=pytesseract.Output.DICT,
)
tokens = [
data["text"][i]
for i in range(len(data["text"]))
if int(data["conf"][i]) >= min_conf and data["text"][i].strip()
]
return " ".join(tokens)
Page Segmentation Mode Reference
| PSM | Use when |
|---|---|
| 3 (default) | Multi-column page with mixed content |
| 4 | Single column, variable text sizes |
| 6 | Single uniform block of text |
| 7 | Single text line |
| 11 | Sparse text — forms with scattered labels |
| 13 | Raw line — no layout analysis |
Wrong PSM on multi-column pages causes Tesseract to concatenate columns horizontally, producing garbled output. When in doubt, try --psm 3 first, then --psm 4 if columns merge incorrectly.
Step 5 — Confidence Filtering
image_to_data returns a -1 confidence for non-word tokens (spaces, line separators). Filter those out before joining:
# pip install pytesseract Pillow
import pytesseract
from PIL import Image
import numpy as np
def extract_high_confidence_words(
arr: np.ndarray,
min_conf: int = 65,
) -> list[dict]:
"""
Return a list of dicts with text, bounding box, and confidence
for tokens that pass the confidence threshold.
"""
pil_img = Image.fromarray(arr)
data = pytesseract.image_to_data(pil_img, output_type=pytesseract.Output.DICT)
results = []
for i, conf in enumerate(data["conf"]):
if int(conf) < min_conf:
continue
text = data["text"][i].strip()
if not text:
continue
results.append({
"text": text,
"conf": int(conf),
"x": data["left"][i],
"y": data["top"][i],
"w": data["width"][i],
"h": data["height"][i],
})
return results
The spatial data (x, y, w, h) is useful downstream — the coordinate-mapping approach in How to Extract Tables from Scanned PDFs uses these bounding boxes to reconstruct row and column structure without needing vector lines.
Step 6 — Build a Searchable PDF
Embed an invisible OCR text layer over the original scan so the file is searchable without changing how it looks.
# pip install pymupdf pytesseract Pillow
from pathlib import Path
import fitz # PyMuPDF
import pytesseract
from PIL import Image
def make_searchable_pdf(input_pdf: Path, output_pdf: Path, lang: str = "eng") -> None:
"""
For each page, run Tesseract to generate a hidden-text PDF overlay,
then merge that overlay onto the original scan page.
"""
try:
doc = fitz.open(input_pdf)
for page_num, page in enumerate(doc):
pix = page.get_pixmap(dpi=300)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
# Tesseract produces a PDF with transparent text at the correct positions
ocr_pdf_bytes = pytesseract.image_to_pdf_or_hocr(img, extension="pdf", lang=lang)
overlay = fitz.open("pdf", ocr_pdf_bytes)
# show_pdf_page copies the overlay text layer onto the original page
page.show_pdf_page(page.rect, overlay, 0)
overlay.close()
doc.save(str(output_pdf), garbage=4, deflate=True)
print(f"Searchable PDF written → {output_pdf}")
except Exception as exc:
raise RuntimeError(f"Failed to create searchable PDF: {exc}") from exc
make_searchable_pdf(
Path("scans/invoice_001.pdf"),
Path("output/invoice_001_searchable.pdf"),
)
The resulting file is visually identical to the original. PDF viewers, grep tools, and indexers can now find text in it. You can batch these files with the pattern in Batch Merge PDFs with a Python Script to produce a single searchable archive.
Edge Cases and Variants
Multi-Language Documents
Pass a +-delimited language string. Language packs must be installed separately:
# pip install pytesseract Pillow
import pytesseract
from PIL import Image
img = Image.open("multilingual_form.png")
# Requires tesseract-ocr-deu and tesseract-ocr-fra installed
text = pytesseract.image_to_string(img, lang="eng+deu+fra")
Check which packs are installed: tesseract --list-langs.
Handwriting
Standard Tesseract (eng) performs poorly on cursive or non-standard print. Alternatives:
- EasyOCR (
pip install easyocr) — handles handwriting better, runs on CPU but is slower. - Cloud APIs (AWS Textract, Google Vision) — higher accuracy, cost per page, network dependency.
- Tesseract
script/HanSfor CJK scripts — install viaapt-get install tesseract-ocr-script-hans.
Colour-Tinted or Watermarked Scans
Binarisation with Otsu can obliterate faint watermarks alongside background colour. To preserve more signal, use HSV channel separation before thresholding:
# pip install opencv-python numpy
import cv2
import numpy as np
def threshold_on_value_channel(bgr_img: np.ndarray) -> np.ndarray:
"""
Extract the V channel from HSV, then apply Otsu.
Reduces colour interference from stamps, highlights, or watermarks.
"""
hsv = cv2.cvtColor(bgr_img, cv2.COLOR_BGR2HSV)
v_channel = hsv[:, :, 2]
_, thresh = cv2.threshold(v_channel, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
return thresh
Validation
After running the pipeline, verify output quality before committing to a batch run:
# pip install pytesseract Pillow
from pathlib import Path
import pytesseract
from PIL import Image
def check_ocr_quality(arr, min_mean_conf: float = 70.0) -> dict:
"""
Returns mean confidence and a pass/fail flag.
Fail means the scan quality or preprocessing needs adjustment.
"""
import numpy as np
pil_img = Image.fromarray(arr)
data = pytesseract.image_to_data(pil_img, output_type=pytesseract.Output.DICT)
confs = [int(c) for c in data["conf"] if int(c) != -1]
if not confs:
return {"mean_conf": 0.0, "ok": False, "word_count": 0}
mean_conf = sum(confs) / len(confs)
return {
"mean_conf": round(mean_conf, 1),
"ok": mean_conf >= min_mean_conf,
"word_count": len([t for t in data["text"] if t.strip()]),
}
A mean confidence below 60 % usually means the DPI is too low, the binarisation is misconfigured, or the scan has severe physical damage. Inspect the preprocessed image with save_debug_image() before investigating further.
Performance and Scale
- Memory: Each 300-DPI A4 page rasterises to roughly 25–35 MB in RAM as a PIL Image. For batch jobs over 100+ pages, process one page at a time and del/gc between pages.
- Speed: Tesseract is single-threaded per call. Use
concurrent.futures.ProcessPoolExecutorto parallelise across pages or files — avoidThreadPoolExecutorbecause Tesseract releases the GIL inconsistently. - Chunking large PDFs: Open with
fitz.open(), iterate page by page, and callpix = Noneafter rasterising each page to release the pixmap.
# pip install pymupdf pytesseract Pillow
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
import fitz
import pytesseract
from PIL import Image
def _ocr_page_index(args):
pdf_path, page_num = args
doc = fitz.open(pdf_path)
page = doc[page_num]
pix = page.get_pixmap(dpi=300)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
doc.close()
return page_num, pytesseract.image_to_string(img)
def batch_ocr(pdf_path: Path, max_workers: int = 4) -> dict[int, str]:
"""Returns {page_num: text} for all pages, parallelised."""
doc = fitz.open(pdf_path)
page_count = len(doc)
doc.close()
with ProcessPoolExecutor(max_workers=max_workers) as pool:
results = pool.map(_ocr_page_index, [(str(pdf_path), i) for i in range(page_count)])
return dict(results)
Troubleshooting
| Error / symptom | Root cause | Fix |
|---|---|---|
TesseractNotFoundError | Tesseract binary not installed or not on PATH | Install the binary; see Fix TesseractNotFoundError |
Empty string from image_to_string | PDF has a text layer (not a scan), or image too small / too low DPI | Run classify_pdf() first; enforce dpi=300 |
| Garbled multi-column text | Wrong PSM — Tesseract reads columns left-to-right as a single line | Set --psm 3 or --psm 4 |
Error, could not initialize tesseract API with language "xxx" | Language pack not installed | apt-get install tesseract-ocr-xxx or set TESSDATA_PREFIX |
| Output truncated mid-page | Tesseract timeout on very large images | Resize to 300 DPI before passing; split oversized scans into quadrants |
| Deskew rotates page 90° | minAreaRect angle ambiguity on near-vertical text blocks | Clamp angle: if abs(angle) > 45: angle = 0 and skip rotation |
Complete Pipeline Script
#!/usr/bin/env python3
# pip install pymupdf pytesseract opencv-python Pillow numpy
"""
ocr_pipeline.py — Rasterize a scanned PDF, preprocess pages, run OCR,
and write a searchable PDF to the output path.
Usage:
python ocr_pipeline.py input.pdf output_searchable.pdf --lang eng --psm 3
"""
import argparse
import gc
from pathlib import Path
import cv2
import fitz
import numpy as np
import pytesseract
from PIL import Image
def preprocess(pil_img: Image.Image) -> np.ndarray:
img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2GRAY)
_, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
denoised = cv2.fastNlMeansDenoising(thresh, h=10)
coords = np.column_stack(np.where(denoised > 0))
if len(coords) >= 4:
rect = cv2.minAreaRect(coords)
angle = rect[-1]
angle = -(90 + angle) if angle < -45 else -angle
h, w = denoised.shape[:2]
M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
denoised = cv2.warpAffine(denoised, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
return denoised
def run(input_pdf: Path, output_pdf: Path, lang: str, psm: int) -> None:
doc = fitz.open(input_pdf)
for i, page in enumerate(doc):
pix = page.get_pixmap(dpi=300)
raw_img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
pix = None # release pixmap memory
preprocessed = preprocess(raw_img)
pil_preprocessed = Image.fromarray(preprocessed)
ocr_pdf_bytes = pytesseract.image_to_pdf_or_hocr(
pil_preprocessed, extension="pdf",
lang=lang, config=f"--psm {psm}",
)
overlay = fitz.open("pdf", ocr_pdf_bytes)
page.show_pdf_page(page.rect, overlay, 0)
overlay.close()
gc.collect()
print(f" processed page {i + 1}/{len(doc)}")
doc.save(str(output_pdf), garbage=4, deflate=True)
doc.close()
print(f"Done → {output_pdf}")
def main() -> None:
ap = argparse.ArgumentParser(description="OCR pipeline for scanned PDFs")
ap.add_argument("input", type=Path, help="Input scanned PDF")
ap.add_argument("output", type=Path, help="Output searchable PDF")
ap.add_argument("--lang", default="eng", help="Tesseract language (default: eng)")
ap.add_argument("--psm", type=int, default=3, help="Page segmentation mode (default: 3)")
args = ap.parse_args()
if not args.input.exists():
raise FileNotFoundError(f"Input not found: {args.input}")
args.output.parent.mkdir(parents=True, exist_ok=True)
try:
run(args.input, args.output, args.lang, args.psm)
except pytesseract.pytesseract.TesseractNotFoundError as exc:
raise SystemExit(
"Tesseract not found. See: "
"/automating-pdf-extraction-generation/scanning-and-ocr-processing-with-python/fix-tesseract-not-found-error/"
) from exc
if __name__ == "__main__":
main()
Related
- Fix TesseractNotFoundError in Python — binary install and PATH config for all platforms
- How to Extract Tables from Scanned PDFs — coordinate-clustering to reconstruct tabular structure from OCR bounding boxes
- Extracting Tables from PDFs — pdfplumber, camelot, and tabula for vector-text PDFs
- Merging and Splitting PDF Documents — batch-process and archive the searchable PDFs this pipeline produces