Scanning and OCR Processing with Python

Scanned PDFs are image containers — no embedded text, no selectable characters. Every standard parser returns empty results because there is nothing to parse. The fix is a three-stage pipeline: rasterize each page to a high-DPI image, preprocess that image to maximise contrast and alignment, then feed it to an OCR engine that produces a text layer you can search, index, or pipe into downstream extraction.

Generic "just call pytesseract" tutorials skip the preprocessing stage. That works on clean scans at ideal DPI; it fails on faded invoices, skewed photographs, and low-contrast forms. This guide covers the full pipeline end-to-end.

Prerequisites

System binary and Python packages must both be present. pytesseract is only a wrapper — without the Tesseract binary the wrapper raises TesseractNotFoundError immediately. If you hit that error, see Fix TesseractNotFoundError in Python.

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows — download from https://github.com/UB-Mannheim/tesseract/wiki and add to PATH

# Python packages
pip install pytesseract pymupdf pdf2image Pillow opencv-python numpy

# For extra language packs (e.g. German + French):
sudo apt-get install tesseract-ocr-deu tesseract-ocr-fra

Verify the installation before writing any pipeline code:

tesseract --version
python -c "import pytesseract; print(pytesseract.get_tesseract_version())"

Step 1 — Diagnose the PDF Before Choosing a Path

Not every "scanned" PDF is purely raster. Some contain a hidden text layer added by a previous OCR pass. Running the full preprocessing+OCR pipeline on those wastes time and can degrade quality. Use PyMuPDF to check first.

# pip install pymupdf
from pathlib import Path
import fitz  # PyMuPDF

def classify_pdf(pdf_path: Path) -> str:
    """
    Returns 'text' if the PDF has an embedded text layer,
    'raster' if it is a pure scan that needs OCR.
    """
    try:
        doc = fitz.open(pdf_path)
        char_count = sum(len(page.get_text("text").strip()) for page in doc)
        doc.close()
        return "text" if char_count > 20 else "raster"
    except Exception as exc:
        raise RuntimeError(f"Could not open {pdf_path}: {exc}") from exc

pdf = Path("contract_scan.pdf")
kind = classify_pdf(pdf)
print(f"{pdf.name}: {kind}")
# → contract_scan.pdf: raster

If the result is text, use pdfplumber or camelot directly — no OCR needed. If it is raster, continue with the pipeline below.

Step 2 — Rasterize PDF Pages

Convert each page to a PIL Image at 300 DPI minimum. Below 300 DPI, character strokes blur and Tesseract misreads adjacent characters.

Two options: pdf2image (wraps Poppler, simple API) or PyMuPDF (no Poppler dependency, faster on large files).

# pip install pymupdf Pillow
from pathlib import Path
from PIL import Image
import fitz

def rasterize_pdf_pymupdf(pdf_path: Path, dpi: int = 300) -> list[Image.Image]:
    """Render each page of a PDF to a PIL Image at the specified DPI."""
    pages: list[Image.Image] = []
    try:
        doc = fitz.open(pdf_path)
        for page in doc:
            pix = page.get_pixmap(dpi=dpi)
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            pages.append(img)
        doc.close()
    except Exception as exc:
        raise RuntimeError(f"Rasterize failed for {pdf_path}: {exc}") from exc
    return pages

# Alternative: pdf2image (requires poppler installed separately)
# from pdf2image import convert_from_path
# pages = convert_from_path("contract_scan.pdf", dpi=300)

PyMuPDF does not require Poppler. pdf2image requires poppler-utils (apt-get install poppler-utils / brew install poppler) but can be more convenient for batch workflows.

Step 3 — Preprocess Images for OCR

Preprocess → OCR → Postprocess pipeline Flow diagram showing the four stages of the scanning OCR pipeline: rasterize PDF pages, preprocess images (grayscale, Otsu threshold, deskew, denoise), run Tesseract OCR with PSM config, and postprocess (confidence filter, text normalise, embed text layer). Rasterize pdf2image or PyMuPDF 300 DPI PIL Image Preprocess Grayscale Otsu threshold Deskew Denoise OpenCV / Pillow Tesseract OCR image_to_string image_to_data PSM flags lang packs pytesseract Postprocess Confidence filter Text normalise Embed text layer → searchable PDF PyMuPDF Scanned PDF Searchable PDF

Raw scans have low contrast, background grain, and slight rotation. Preprocessing before OCR is not optional for production accuracy — it moves character recognition confidence from 40–60 % up to 85–95 % on typical office scans.

# pip install opencv-python numpy Pillow
from pathlib import Path
import cv2
import numpy as np
from PIL import Image

def preprocess_for_ocr(pil_img: Image.Image) -> np.ndarray:
    """
    Convert a PIL Image to a denoised, deskewed, binarised numpy array
    suitable for Tesseract input.
    """
    # Convert PIL → OpenCV grayscale
    img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2GRAY)

    # Otsu's binarisation — optimal threshold calculated automatically
    _, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # Fast non-local means denoise (h=10 is good for mild scanner grain)
    denoised = cv2.fastNlMeansDenoising(thresh, h=10)

    # Deskew via minAreaRect on foreground pixels
    coords = np.column_stack(np.where(denoised > 0))
    if len(coords) >= 4:
        rect = cv2.minAreaRect(coords)
        angle = rect[-1]
        angle = -(90 + angle) if angle < -45 else -angle
        h, w = denoised.shape[:2]
        M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
        denoised = cv2.warpAffine(
            denoised, M, (w, h),
            flags=cv2.INTER_CUBIC,
            borderMode=cv2.BORDER_REPLICATE,
        )

    return denoised


def save_debug_image(arr: np.ndarray, output_path: Path) -> None:
    """Optionally save the preprocessed image for visual inspection."""
    cv2.imwrite(str(output_path), arr)

When to Skip Which Steps

  • Already-binarised images (black-and-white TIFF scans): skip the threshold step; applying Otsu again may invert regions.
  • Digital-born PDFs saved as images (e.g. screenshots): skip deskew; they are already axis-aligned.
  • High-resolution colour scans (600+ DPI): apply cv2.resize to scale down to 300 DPI first — larger images slow Tesseract without accuracy benefit.

For adaptive thresholding on low-contrast or shadow-heavy scans, replace the Otsu step with:

# pip install opencv-python
import cv2

# Adaptive Gaussian threshold — better for uneven illumination
thresh = cv2.adaptiveThreshold(
    img, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=31,   # must be odd; increase for larger text
    C=10,
)

Step 4 — Run Tesseract OCR

image_to_string returns plain text. image_to_data returns per-word bounding boxes and confidence scores — use it whenever you need spatial layout or confidence filtering.

# pip install pytesseract Pillow
from pathlib import Path
import pytesseract
from PIL import Image
import numpy as np

def ocr_page(arr: np.ndarray, lang: str = "eng", psm: int = 3) -> str:
    """
    Run Tesseract on a preprocessed numpy array and return the full text.
    psm=3: fully automatic page segmentation (default).
    psm=6: assume a single uniform block of text.
    psm=11: sparse text — find as much text as possible in no particular order.
    """
    pil_img = Image.fromarray(arr)
    config = f"--psm {psm}"
    try:
        return pytesseract.image_to_string(pil_img, lang=lang, config=config)
    except pytesseract.pytesseract.TesseractNotFoundError:
        raise RuntimeError(
            "Tesseract binary not found — see "
            "/automating-pdf-extraction-generation/scanning-and-ocr-processing-with-python/fix-tesseract-not-found-error/"
        )


def ocr_page_with_confidence(
    arr: np.ndarray,
    lang: str = "eng",
    psm: int = 3,
    min_conf: int = 60,
) -> str:
    """
    Run OCR and discard tokens below min_conf. Reduces noise in output
    at the cost of occasionally dropping low-quality but correct characters.
    """
    pil_img = Image.fromarray(arr)
    config = f"--psm {psm}"
    data = pytesseract.image_to_data(
        pil_img, lang=lang, config=config,
        output_type=pytesseract.Output.DICT,
    )
    tokens = [
        data["text"][i]
        for i in range(len(data["text"]))
        if int(data["conf"][i]) >= min_conf and data["text"][i].strip()
    ]
    return " ".join(tokens)

Page Segmentation Mode Reference

PSMUse when
3 (default)Multi-column page with mixed content
4Single column, variable text sizes
6Single uniform block of text
7Single text line
11Sparse text — forms with scattered labels
13Raw line — no layout analysis

Wrong PSM on multi-column pages causes Tesseract to concatenate columns horizontally, producing garbled output. When in doubt, try --psm 3 first, then --psm 4 if columns merge incorrectly.

Step 5 — Confidence Filtering

image_to_data returns a -1 confidence for non-word tokens (spaces, line separators). Filter those out before joining:

# pip install pytesseract Pillow
import pytesseract
from PIL import Image
import numpy as np

def extract_high_confidence_words(
    arr: np.ndarray,
    min_conf: int = 65,
) -> list[dict]:
    """
    Return a list of dicts with text, bounding box, and confidence
    for tokens that pass the confidence threshold.
    """
    pil_img = Image.fromarray(arr)
    data = pytesseract.image_to_data(pil_img, output_type=pytesseract.Output.DICT)
    results = []
    for i, conf in enumerate(data["conf"]):
        if int(conf) < min_conf:
            continue
        text = data["text"][i].strip()
        if not text:
            continue
        results.append({
            "text": text,
            "conf": int(conf),
            "x": data["left"][i],
            "y": data["top"][i],
            "w": data["width"][i],
            "h": data["height"][i],
        })
    return results

The spatial data (x, y, w, h) is useful downstream — the coordinate-mapping approach in How to Extract Tables from Scanned PDFs uses these bounding boxes to reconstruct row and column structure without needing vector lines.

Step 6 — Build a Searchable PDF

Embed an invisible OCR text layer over the original scan so the file is searchable without changing how it looks.

# pip install pymupdf pytesseract Pillow
from pathlib import Path
import fitz  # PyMuPDF
import pytesseract
from PIL import Image

def make_searchable_pdf(input_pdf: Path, output_pdf: Path, lang: str = "eng") -> None:
    """
    For each page, run Tesseract to generate a hidden-text PDF overlay,
    then merge that overlay onto the original scan page.
    """
    try:
        doc = fitz.open(input_pdf)
        for page_num, page in enumerate(doc):
            pix = page.get_pixmap(dpi=300)
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

            # Tesseract produces a PDF with transparent text at the correct positions
            ocr_pdf_bytes = pytesseract.image_to_pdf_or_hocr(img, extension="pdf", lang=lang)

            overlay = fitz.open("pdf", ocr_pdf_bytes)
            # show_pdf_page copies the overlay text layer onto the original page
            page.show_pdf_page(page.rect, overlay, 0)
            overlay.close()

        doc.save(str(output_pdf), garbage=4, deflate=True)
        print(f"Searchable PDF written → {output_pdf}")
    except Exception as exc:
        raise RuntimeError(f"Failed to create searchable PDF: {exc}") from exc


make_searchable_pdf(
    Path("scans/invoice_001.pdf"),
    Path("output/invoice_001_searchable.pdf"),
)

The resulting file is visually identical to the original. PDF viewers, grep tools, and indexers can now find text in it. You can batch these files with the pattern in Batch Merge PDFs with a Python Script to produce a single searchable archive.

Edge Cases and Variants

Multi-Language Documents

Pass a +-delimited language string. Language packs must be installed separately:

# pip install pytesseract Pillow
import pytesseract
from PIL import Image

img = Image.open("multilingual_form.png")
# Requires tesseract-ocr-deu and tesseract-ocr-fra installed
text = pytesseract.image_to_string(img, lang="eng+deu+fra")

Check which packs are installed: tesseract --list-langs.

Handwriting

Standard Tesseract (eng) performs poorly on cursive or non-standard print. Alternatives:

  • EasyOCR (pip install easyocr) — handles handwriting better, runs on CPU but is slower.
  • Cloud APIs (AWS Textract, Google Vision) — higher accuracy, cost per page, network dependency.
  • Tesseract script/HanS for CJK scripts — install via apt-get install tesseract-ocr-script-hans.

Colour-Tinted or Watermarked Scans

Binarisation with Otsu can obliterate faint watermarks alongside background colour. To preserve more signal, use HSV channel separation before thresholding:

# pip install opencv-python numpy
import cv2
import numpy as np

def threshold_on_value_channel(bgr_img: np.ndarray) -> np.ndarray:
    """
    Extract the V channel from HSV, then apply Otsu.
    Reduces colour interference from stamps, highlights, or watermarks.
    """
    hsv = cv2.cvtColor(bgr_img, cv2.COLOR_BGR2HSV)
    v_channel = hsv[:, :, 2]
    _, thresh = cv2.threshold(v_channel, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    return thresh

Validation

After running the pipeline, verify output quality before committing to a batch run:

# pip install pytesseract Pillow
from pathlib import Path
import pytesseract
from PIL import Image

def check_ocr_quality(arr, min_mean_conf: float = 70.0) -> dict:
    """
    Returns mean confidence and a pass/fail flag.
    Fail means the scan quality or preprocessing needs adjustment.
    """
    import numpy as np
    pil_img = Image.fromarray(arr)
    data = pytesseract.image_to_data(pil_img, output_type=pytesseract.Output.DICT)
    confs = [int(c) for c in data["conf"] if int(c) != -1]
    if not confs:
        return {"mean_conf": 0.0, "ok": False, "word_count": 0}
    mean_conf = sum(confs) / len(confs)
    return {
        "mean_conf": round(mean_conf, 1),
        "ok": mean_conf >= min_mean_conf,
        "word_count": len([t for t in data["text"] if t.strip()]),
    }

A mean confidence below 60 % usually means the DPI is too low, the binarisation is misconfigured, or the scan has severe physical damage. Inspect the preprocessed image with save_debug_image() before investigating further.

Performance and Scale

  • Memory: Each 300-DPI A4 page rasterises to roughly 25–35 MB in RAM as a PIL Image. For batch jobs over 100+ pages, process one page at a time and del/gc between pages.
  • Speed: Tesseract is single-threaded per call. Use concurrent.futures.ProcessPoolExecutor to parallelise across pages or files — avoid ThreadPoolExecutor because Tesseract releases the GIL inconsistently.
  • Chunking large PDFs: Open with fitz.open(), iterate page by page, and call pix = None after rasterising each page to release the pixmap.
# pip install pymupdf pytesseract Pillow
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor
import fitz
import pytesseract
from PIL import Image

def _ocr_page_index(args):
    pdf_path, page_num = args
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    pix = page.get_pixmap(dpi=300)
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    doc.close()
    return page_num, pytesseract.image_to_string(img)

def batch_ocr(pdf_path: Path, max_workers: int = 4) -> dict[int, str]:
    """Returns {page_num: text} for all pages, parallelised."""
    doc = fitz.open(pdf_path)
    page_count = len(doc)
    doc.close()
    with ProcessPoolExecutor(max_workers=max_workers) as pool:
        results = pool.map(_ocr_page_index, [(str(pdf_path), i) for i in range(page_count)])
    return dict(results)

Troubleshooting

Error / symptomRoot causeFix
TesseractNotFoundErrorTesseract binary not installed or not on PATHInstall the binary; see Fix TesseractNotFoundError
Empty string from image_to_stringPDF has a text layer (not a scan), or image too small / too low DPIRun classify_pdf() first; enforce dpi=300
Garbled multi-column textWrong PSM — Tesseract reads columns left-to-right as a single lineSet --psm 3 or --psm 4
Error, could not initialize tesseract API with language "xxx"Language pack not installedapt-get install tesseract-ocr-xxx or set TESSDATA_PREFIX
Output truncated mid-pageTesseract timeout on very large imagesResize to 300 DPI before passing; split oversized scans into quadrants
Deskew rotates page 90°minAreaRect angle ambiguity on near-vertical text blocksClamp angle: if abs(angle) > 45: angle = 0 and skip rotation

Complete Pipeline Script

#!/usr/bin/env python3
# pip install pymupdf pytesseract opencv-python Pillow numpy
"""
ocr_pipeline.py — Rasterize a scanned PDF, preprocess pages, run OCR,
and write a searchable PDF to the output path.

Usage:
    python ocr_pipeline.py input.pdf output_searchable.pdf --lang eng --psm 3
"""
import argparse
import gc
from pathlib import Path

import cv2
import fitz
import numpy as np
import pytesseract
from PIL import Image


def preprocess(pil_img: Image.Image) -> np.ndarray:
    img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGB2GRAY)
    _, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    denoised = cv2.fastNlMeansDenoising(thresh, h=10)
    coords = np.column_stack(np.where(denoised > 0))
    if len(coords) >= 4:
        rect = cv2.minAreaRect(coords)
        angle = rect[-1]
        angle = -(90 + angle) if angle < -45 else -angle
        h, w = denoised.shape[:2]
        M = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
        denoised = cv2.warpAffine(denoised, M, (w, h),
                                   flags=cv2.INTER_CUBIC,
                                   borderMode=cv2.BORDER_REPLICATE)
    return denoised


def run(input_pdf: Path, output_pdf: Path, lang: str, psm: int) -> None:
    doc = fitz.open(input_pdf)
    for i, page in enumerate(doc):
        pix = page.get_pixmap(dpi=300)
        raw_img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        pix = None  # release pixmap memory

        preprocessed = preprocess(raw_img)
        pil_preprocessed = Image.fromarray(preprocessed)

        ocr_pdf_bytes = pytesseract.image_to_pdf_or_hocr(
            pil_preprocessed, extension="pdf",
            lang=lang, config=f"--psm {psm}",
        )
        overlay = fitz.open("pdf", ocr_pdf_bytes)
        page.show_pdf_page(page.rect, overlay, 0)
        overlay.close()
        gc.collect()
        print(f"  processed page {i + 1}/{len(doc)}")

    doc.save(str(output_pdf), garbage=4, deflate=True)
    doc.close()
    print(f"Done → {output_pdf}")


def main() -> None:
    ap = argparse.ArgumentParser(description="OCR pipeline for scanned PDFs")
    ap.add_argument("input", type=Path, help="Input scanned PDF")
    ap.add_argument("output", type=Path, help="Output searchable PDF")
    ap.add_argument("--lang", default="eng", help="Tesseract language (default: eng)")
    ap.add_argument("--psm", type=int, default=3, help="Page segmentation mode (default: 3)")
    args = ap.parse_args()

    if not args.input.exists():
        raise FileNotFoundError(f"Input not found: {args.input}")
    args.output.parent.mkdir(parents=True, exist_ok=True)

    try:
        run(args.input, args.output, args.lang, args.psm)
    except pytesseract.pytesseract.TesseractNotFoundError as exc:
        raise SystemExit(
            "Tesseract not found. See: "
            "/automating-pdf-extraction-generation/scanning-and-ocr-processing-with-python/fix-tesseract-not-found-error/"
        ) from exc


if __name__ == "__main__":
    main()

Part of Automating PDF Extraction & Generation.

Explore next