pdfplumber vs camelot vs tabula

Three libraries dominate Python PDF table extraction: pdfplumber, camelot, and tabula-py. Each wraps a different extraction engine with different runtime requirements, and each wins in a distinct scenario. Choosing the wrong one costs hours of debugging — choose based on your table type, your infrastructure constraints, and whether you can install system-level dependencies.

This guide covers installation, runtime requirements, accuracy across table types, output format, speed, and a complete decision script.

1. Problem Framing

Generic PDF-to-text converters strip structural information. A PDF table is not a semantic grid — it is a set of independently positioned rectangles or text runs with no native row/column semantics. Each library reconstructs that structure differently:

pdfplumber uses geometric line detection and spatial clustering on the raw PDF content stream. No system deps. Pure Python install.
camelot implements two algorithms: lattice (explicit border lines) and stream (whitespace inference). Requires Ghostscript and OpenCV at runtime.
tabula-py wraps the Tabula Java library. It shells out to a bundled JAR, so a JRE/JDK must be on PATH. Fast for standard bordered tables on documents without unusual encodings.

If your deployment environment is a locked-down container or a serverless function, camelot's Ghostscript dependency and tabula-py's JVM requirement are immediate blockers. pdfplumber installs with a single pip install.

2. Prerequisites

Install all three libraries in a virtual environment to run the comparisons in this guide:

# System deps — install before pip installs
# Ghostscript (camelot lattice mode)
# Ubuntu/Debian:
sudo apt-get install ghostscript libgs-dev
# macOS:
brew install ghostscript

# Java JRE/JDK (tabula-py)
# Ubuntu/Debian:
sudo apt-get install default-jre
# macOS:
brew install openjdk
# Windows: download from https://adoptium.net/ and add to PATH

# Python packages
pip install pdfplumber camelot-py[cv] tabula-py pandas opencv-python-headless

# Verify Java is available to Python subprocess
java -version

Confirm camelot can find Ghostscript:

# pip install camelot-py[cv]
import camelot
# This import alone triggers the Ghostscript check on some platforms
print(camelot.__version__)

If you see OSError: Ghostscript is not installed, see Fix Camelot Import Error on Linux.

If tabula-py raises JavaNotFoundError, see Fix tabula-py "java not found" Error.

3. Diagnostic Step: Classify Your Table Before Choosing a Library

Run this inspection snippet on any new PDF before committing to a library. It checks for explicit vector lines (bordered table indicator) and text density (borderless indicator):

# pip install pdfplumber
from pathlib import Path
import pdfplumber

PDF_PATH = Path("data/report.pdf")

def classify_table_type(pdf_path: Path) -> dict:
    """Inspect page geometry to recommend an extraction strategy."""
    results = []
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for i, page in enumerate(pdf.pages[:3]):  # Sample first 3 pages
                lines = page.lines
                rects = page.rects
                words = page.extract_words()
                results.append({
                    "page": i + 1,
                    "vector_lines": len(lines),
                    "rects": len(rects),
                    "words": len(words),
                    "recommendation": (
                        "lattice"  if len(lines) > 10 or len(rects) > 5
                        else "stream" if len(words) > 20
                        else "ocr"
                    ),
                })
    except Exception as e:
        return {"error": str(e)}
    return results

if __name__ == "__main__":
    import json
    print(json.dumps(classify_table_type(PDF_PATH), indent=2))

vector_lines > 10 or rects > 5: bordered table — use camelot lattice or pdfplumber.
Few lines, many words: borderless/stream table — use camelot stream or pdfplumber with vertical_strategy="text".
Near-zero words: scanned image — OCR required first. See Scanning and OCR Processing with Python.

4. Library-by-Library Walkthrough

Step 1 — Extract with pdfplumber

# pip install pdfplumber pandas
from pathlib import Path
import pdfplumber
import pandas as pd

PDF_PATH = Path("data/report.pdf")

def extract_pdfplumber(pdf_path: Path, page_num: int = 0) -> list[pd.DataFrame]:
    """Extract all tables from a single page using pdfplumber."""
    dfs = []
    try:
        with pdfplumber.open(pdf_path) as pdf:
            page = pdf.pages[page_num]
            # extract_tables() uses both horizontal and vertical line detection
            raw = page.extract_tables({
                "vertical_strategy": "lines",
                "horizontal_strategy": "lines",
                "snap_tolerance": 3,
                "join_tolerance": 3,
            })
            for tbl in raw:
                if tbl and len(tbl) > 1:
                    df = pd.DataFrame(tbl[1:], columns=tbl[0])
                    dfs.append(df)
    except Exception as e:
        print(f"pdfplumber error: {e}")
    return dfs

if __name__ == "__main__":
    tables = extract_pdfplumber(PDF_PATH)
    for i, df in enumerate(tables):
        print(f"Table {i}: {df.shape}")
        print(df.head())

When pdfplumber wins: borderless tables, complex coordinate geometry, embedded font edge cases, or any environment where system deps are unavailable.

Step 2 — Extract with camelot (lattice and stream)

# pip install camelot-py[cv] pandas
# System: ghostscript must be on PATH
from pathlib import Path
import camelot
import pandas as pd

PDF_PATH = Path("data/report.pdf")

def extract_camelot(pdf_path: Path, flavor: str = "lattice") -> list[pd.DataFrame]:
    """
    Extract tables with camelot.
    flavor="lattice"  — bordered tables with explicit grid lines
    flavor="stream"   — borderless tables inferred from whitespace
    """
    dfs = []
    try:
        tables = camelot.read_pdf(
            str(pdf_path),
            pages="all",
            flavor=flavor,
            # lattice-specific: copy_text handles cells that span rows
            copy_text=["v"] if flavor == "lattice" else [],
        )
        print(f"camelot ({flavor}): found {tables.n} table(s)")
        for tbl in tables:
            # tbl.parsing_report gives accuracy score (0-100)
            print(f"  accuracy={tbl.parsing_report['accuracy']:.1f}%  "
                  f"whitespace={tbl.parsing_report['whitespace']:.1f}%")
            dfs.append(tbl.df)
    except Exception as e:
        print(f"camelot error: {e}")
    return dfs

if __name__ == "__main__":
    lattice_dfs = extract_camelot(PDF_PATH, flavor="lattice")
    stream_dfs  = extract_camelot(PDF_PATH, flavor="stream")

Camelot's accuracy score is unique among the three libraries — it quantifies how much whitespace was left in cells, giving you a confidence signal without manually inspecting every table.

When camelot wins: two-pass comparison (run both flavors, take the higher accuracy score), official financial or government PDFs with clean bordered grids, or situations where you need that accuracy signal for automated QA.

Step 3 — Extract with tabula-py

# pip install tabula-py pandas
# System: Java JRE/JDK must be on PATH
from pathlib import Path
import tabula
import pandas as pd

PDF_PATH = Path("data/report.pdf")

def extract_tabula(pdf_path: Path) -> list[pd.DataFrame]:
    """Extract all tables from all pages using tabula-py."""
    try:
        # read_pdf returns a list of DataFrames, one per detected table
        dfs = tabula.read_pdf(
            str(pdf_path),
            pages="all",
            multiple_tables=True,
            pandas_options={"header": 0},
            lattice=True,   # True = bordered; False = stream-style
            silent=True,    # suppress Java stderr
        )
        print(f"tabula: found {len(dfs)} table(s)")
        return dfs
    except Exception as e:
        print(f"tabula error: {e}")
        return []

if __name__ == "__main__":
    tables = extract_tabula(PDF_PATH)
    for i, df in enumerate(tables):
        print(f"Table {i}: {df.shape}")
        print(df.head())

When tabula wins: standard bordered tables in mainstream PDF generators (Word exports, Excel-to-PDF, LibreOffice), bulk processing where JVM startup time amortizes over many files, or teams already running Java tooling.

5. Comparison Matrix

Feature	pdfplumber	camelot	tabula-py
Bordered tables (lattice)	Good	Excellent	Excellent
Borderless tables (stream)	Good	Good	Fair
Scanned / image tables	None (needs OCR first)	None	None
Output type	`list[list]` → DataFrame	`TableList` → DataFrame via `.df`	`list[DataFrame]` direct
Speed (single page)	Fast	Moderate	Moderate (JVM warmup)
Speed (batch 100 pages)	Fast	Moderate	Fast (JVM amortized)
Accuracy signal	None built-in	`parsing_report` score	None built-in
Ghostscript required	No	Yes	No
Java JRE required	No	No	Yes
License	MIT	MIT	MIT
pip install	`pdfplumber`	`camelot-py[cv]`	`tabula-py`
Stream/whitespace mode	`vertical_strategy="text"`	`flavor="stream"`	`lattice=False`

6. Edge Cases and Variants

Variant A: Rotated Tables

pdfplumber handles rotated pages via .rotate, but its table extractor assumes upright coordinate axes. For pages rotated 90° or 270°, normalize first:

# pip install pdfplumber pandas
from pathlib import Path
import pdfplumber
import pandas as pd

PDF_PATH = Path("data/rotated_report.pdf")

def extract_rotated(pdf_path: Path) -> list[pd.DataFrame]:
    dfs = []
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                # Rotate back to upright before extracting
                upright = page.rotate(page.rotation * -1) if page.rotation else page
                for tbl in upright.extract_tables():
                    if tbl and len(tbl) > 1:
                        dfs.append(pd.DataFrame(tbl[1:], columns=tbl[0]))
    except Exception as e:
        print(f"Error: {e}")
    return dfs

if __name__ == "__main__":
    tables = extract_rotated(PDF_PATH)
    print(f"Extracted {len(tables)} table(s) from rotated PDF")

Variant B: Multi-Page Tables Spanning Page Breaks

Tables split across pages arrive as separate DataFrames. Detect repeating header rows and concatenate:

# pip install pdfplumber pandas
from pathlib import Path
import pdfplumber
import pandas as pd

PDF_PATH = Path("data/multi_page_table.pdf")

def extract_multipage_table(pdf_path: Path) -> pd.DataFrame:
    """Concatenate tables across pages, stripping repeated headers."""
    all_dfs: list[pd.DataFrame] = []
    header: list | None = None
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                for tbl in page.extract_tables():
                    if not tbl:
                        continue
                    if header is None:
                        header = tbl[0]
                        rows = tbl[1:]
                    else:
                        # Skip row if it repeats the header (page-break artefact)
                        rows = tbl[1:] if tbl[0] == header else tbl
                    if rows:
                        all_dfs.append(pd.DataFrame(rows, columns=header))
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
        return pd.DataFrame()
    return pd.concat(all_dfs, ignore_index=True) if all_dfs else pd.DataFrame()

if __name__ == "__main__":
    df = extract_multipage_table(PDF_PATH)
    print(df.shape)
    df.to_csv("output/merged_table.csv", index=False)

The Extracting Tables from PDFs guide covers deduplication patterns in more detail.

Variant C: Password-Protected PDFs

All three libraries accept a password parameter. Load it from the environment rather than hardcoding:

# pip install pdfplumber
import os
from pathlib import Path
import pdfplumber

PDF_PATH = Path("data/protected.pdf")
PDF_PASSWORD = os.environ.get("PDF_PASSWORD", "")

with pdfplumber.open(PDF_PATH, password=PDF_PASSWORD) as pdf:
    page = pdf.pages[0]
    print(page.extract_text()[:200])

For tabula-py, pass password=PDF_PASSWORD to tabula.read_pdf(). For camelot, pass password=PDF_PASSWORD to camelot.read_pdf().

7. Validation

After extraction, verify correctness before downstream processing. Do not trust shape alone — check a known cell value:

# pip install pandas
import pandas as pd

def validate_extraction(df: pd.DataFrame, expected_cols: int, sample_cell: str | None = None) -> bool:
    """Assert basic structural correctness of an extracted DataFrame."""
    assert not df.empty, "Extracted DataFrame is empty"
    assert df.shape[1] == expected_cols, (
        f"Column count mismatch: got {df.shape[1]}, expected {expected_cols}"
    )
    if sample_cell:
        found = df.apply(lambda col: col.astype(str).str.contains(sample_cell, na=False)).any().any()
        assert found, f"Expected sample value '{sample_cell}' not found in DataFrame"
    print(f"Validation passed: {df.shape[0]} rows x {df.shape[1]} cols")
    return True

For data that feeds reporting pipelines — see Python for Excel & CSV Data Processing — also check numeric column dtypes after coercion.

8. Performance and Scale Notes

JVM startup cost for tabula-py: The Java process starts fresh per Python session. Amortize it by calling tabula.read_pdf() in bulk rather than spawning multiple subprocesses. For batch processing 500+ PDFs, consider tabula.convert_into_by_batch() which passes a directory to the Java JAR directly.

camelot memory use: camelot loads each page as an OpenCV image matrix. High-DPI PDFs or documents with many pages can exhaust RAM. Process in chunks and delete intermediate TableList objects explicitly.

pdfplumber chunking: For very large files, iterate pages lazily and write each table to CSV immediately rather than holding all DataFrames in memory:

# pip install pdfplumber pandas
from pathlib import Path
import pdfplumber
import pandas as pd

PDF_PATH = Path("data/large_report.pdf")
OUT_DIR  = Path("output/tables")
OUT_DIR.mkdir(parents=True, exist_ok=True)

with pdfplumber.open(PDF_PATH) as pdf:
    for page_num, page in enumerate(pdf.pages):
        for tbl_num, tbl in enumerate(page.extract_tables()):
            if tbl and len(tbl) > 1:
                df = pd.DataFrame(tbl[1:], columns=tbl[0])
                path = OUT_DIR / f"page{page_num+1}_tbl{tbl_num+1}.csv"
                df.to_csv(path, index=False)

9. Troubleshooting

Error	Root cause	Fix
`OSError: Ghostscript is not installed`	camelot needs `gs` on PATH	`apt install ghostscript` or `brew install ghostscript`; see Fix Camelot Import Error on Linux
`JavaNotFoundError` / `java not found`	tabula-py cannot find JRE	Install JDK, add to PATH; see Fix tabula-py "java not found" Error
`camelot returns 0 tables`	Wrong flavor for table type	Try `flavor="stream"` if no visible borders; inspect `parsing_report`
`tabula returns garbled Unicode`	PDF uses CIDFont or custom encoding	Switch to pdfplumber; tabula's Java layer does not handle all font encodings
`pdfplumber returns None cells`	Table has merged/spanning cells	Use `.extract_table(table_settings={"snap_tolerance": 5})` and forward-fill
`tabula.read_pdf` hangs	JVM OOM on large PDF	Add `java_options=["-Xmx512m"]`; split the PDF first

10. Decision Guide

Pick your library with three questions:

Can you install system dependencies?
- No → use pdfplumber (pure Python, no system deps).
- Yes → continue.
Does your table have visible borders?
- Yes, clean grid lines → try camelot lattice first (best accuracy signal); tabula is a solid alternative.
- No border lines → camelot stream or pdfplumber with vertical_strategy="text".
Is this a scanned image PDF?
- Yes → OCR it first with Tesseract (see Scanning and OCR Processing with Python), then re-run extraction on the text-layer output.

11. Complete Script: camelot with pdfplumber Fallback

This production-ready script tries camelot lattice, falls back to camelot stream, then falls back to pdfplumber. It accepts a file path via argparse and writes one CSV per extracted table.

#!/usr/bin/env python3
# pip install camelot-py[cv] pdfplumber pandas opencv-python-headless
# System deps: ghostscript (for camelot)
"""
extract_tables.py — extract all tables from a PDF, camelot → pdfplumber fallback.
Usage: python extract_tables.py report.pdf --out output/
"""
import argparse
import sys
from pathlib import Path

import pandas as pd

def _try_camelot(pdf_path: Path, out_dir: Path) -> bool:
    """Attempt extraction with camelot lattice then stream. Return True on success."""
    try:
        import camelot
    except ImportError:
        print("camelot not installed, skipping")
        return False

    for flavor in ("lattice", "stream"):
        try:
            tables = camelot.read_pdf(str(pdf_path), pages="all", flavor=flavor)
            if tables.n == 0:
                continue
            for i, tbl in enumerate(tables):
                df = tbl.df.copy()
                # Promote first row to header if it looks like one
                if df.shape[0] > 1:
                    df.columns = df.iloc[0].tolist()
                    df = df.iloc[1:].reset_index(drop=True)
                out_path = out_dir / f"camelot_{flavor}_tbl{i+1}.csv"
                df.to_csv(out_path, index=False)
                print(f"Saved {out_path}  (accuracy {tbl.parsing_report['accuracy']:.1f}%)")
            return True
        except Exception as exc:
            print(f"camelot {flavor} failed: {exc}")
    return False


def _try_pdfplumber(pdf_path: Path, out_dir: Path) -> bool:
    """Fallback extraction with pdfplumber. Return True on success."""
    try:
        import pdfplumber
    except ImportError:
        print("pdfplumber not installed, cannot fall back")
        return False

    count = 0
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages):
                for tbl_num, raw in enumerate(page.extract_tables()):
                    if not raw or len(raw) < 2:
                        continue
                    df = pd.DataFrame(raw[1:], columns=raw[0])
                    out_path = out_dir / f"pdfplumber_p{page_num+1}_tbl{tbl_num+1}.csv"
                    df.to_csv(out_path, index=False)
                    print(f"Saved {out_path}")
                    count += 1
    except Exception as exc:
        print(f"pdfplumber failed: {exc}")
        return False
    return count > 0


def main() -> None:
    parser = argparse.ArgumentParser(description="Extract tables from a PDF file.")
    parser.add_argument("pdf", type=Path, help="Path to input PDF")
    parser.add_argument("--out", type=Path, default=Path("output"), help="Output directory")
    args = parser.parse_args()

    if not args.pdf.exists():
        sys.exit(f"File not found: {args.pdf}")

    args.out.mkdir(parents=True, exist_ok=True)

    if not _try_camelot(args.pdf, args.out):
        print("camelot produced no tables, falling back to pdfplumber")
        if not _try_pdfplumber(args.pdf, args.out):
            sys.exit("All extraction methods failed. Check that the PDF contains selectable text.")
        else:
            print("Extraction complete via pdfplumber fallback.")
    else:
        print("Extraction complete via camelot.")


if __name__ == "__main__":
    main()

Run it:

python extract_tables.py report.pdf --out output/tables/

Extracting Tables from PDFs — coordinate mapping, multi-page iteration, and structured CSV export
Fix Camelot Import Error on Linux — Ghostscript and OpenCV install issues on Debian/Ubuntu
Fix tabula-py "java not found" Error — JRE install, PATH config, and Docker variants
Scanning and OCR Processing with Python — pre-process scanned PDFs before running any of these libraries
Python for Excel & CSV Data Processing — downstream pandas workflows for the DataFrames you extract

Part of Automating PDF Extraction & Generation.

Explore next

Fix tabula-py "java not found" Error