Fix tabula-py "java not found" Error

tabula-py raises JavaNotFoundError — or the message java command is not found from this Python process — when it cannot locate a Java runtime on PATH. This is the single most common tabula-py failure and has nothing to do with your PDF or your Python code. The fix is to install a JRE/JDK and make sure the java binary is on the PATH visible to the Python subprocess.

Root Cause

tabula-py does not extract PDF tables in Python. It calls a bundled Java JAR (tabula-1.x.x-jar-with-dependencies.jar) via subprocess.run(["java", "-jar", ...]). If java is not on the system PATH — or if it is installed but not in the PATH that Python's subprocess environment inherits — the call fails immediately with one of these messages:

java.lang.Exception: Error in Java call
tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.
Please ensure Java is installed and PATH is set for `java`
FileNotFoundError: [Errno 2] No such file or directory: 'java'

The error is raised before tabula-py reads a single byte of your PDF. Confirming java is on PATH in your shell is not sufficient — the PATH seen by a Python subprocess can differ, especially in virtual environments, Docker containers, or GUI-launched processes on macOS.

Minimal Diagnostic

Reproduce the failure with the shortest possible snippet to confirm whether java is reachable from Python:

# pip install tabula-py
import subprocess
import sys

result = subprocess.run(
    ["java", "-version"],
    capture_output=True,
    text=True,
)
if result.returncode != 0:
    print("java NOT found by Python subprocess")
    print("stderr:", result.stderr)
    sys.exit(1)
else:
    print("java found:")
    print(result.stderr)  # java -version writes to stderr by convention

If this prints java NOT found, the issue is PATH or a missing JRE. If it prints the version string, tabula-py should work — re-run tabula.read_pdf() and capture the exact error for the variant fixes below.

Fix: Install Java and Add It to PATH

Ubuntu / Debian

# Install the default JRE (Java 17 on Ubuntu 22.04+)
sudo apt-get update && sudo apt-get install -y default-jre

# Confirm the binary is on PATH
java -version
# Expected: openjdk version "17.x.x" ...

# Find the install location if you need JAVA_HOME
readlink -f $(which java)
# e.g. /usr/lib/jvm/java-17-openjdk-amd64/bin/java
# → JAVA_HOME = /usr/lib/jvm/java-17-openjdk-amd64

After installation, re-run the diagnostic snippet above from within your virtual environment to confirm the Python subprocess sees java.

macOS

# Install OpenJDK via Homebrew
brew install openjdk

# Homebrew does not symlink openjdk to /usr/local/bin by default.
# Add it to PATH manually (add this line to ~/.zshrc or ~/.bash_profile):
export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"

# Reload the profile and verify
source ~/.zshrc
java -version

On Apple Silicon, the Homebrew prefix is /opt/homebrew. On Intel Macs it is /usr/local. Check with brew --prefix openjdk.

Windows

  1. Download a JDK installer from Adoptium (choose the .msi for Temurin 17 LTS).
  2. Run the installer — it offers to set JAVA_HOME and add to PATH automatically. Accept both.
  3. Open a new Command Prompt or PowerShell (existing terminals do not inherit the new PATH):
java -version
# Expected: openjdk version "17.x.x" ...
  1. If the installer did not update PATH, add manually:
# PowerShell (permanent, current user)
[System.Environment]::SetEnvironmentVariable(
    "JAVA_HOME",
    "C:\Program Files\Eclipse Adoptium\jdk-17.x.x-hotspot",
    "User"
)
[System.Environment]::SetEnvironmentVariable(
    "PATH",
    "$env:PATH;$env:JAVA_HOME\bin",
    "User"
)

Then restart your terminal and relaunch any IDE or script runner.

Fix Implementation: Corrected tabula-py Call

Once java -version works from the Python diagnostic script, this minimal call should succeed:

# pip install tabula-py pandas
from pathlib import Path
import tabula
import pandas as pd

PDF_PATH = Path("data/report.pdf")

try:
    # read_pdf shells out to the bundled JAR; java must be on PATH
    dfs: list[pd.DataFrame] = tabula.read_pdf(
        str(PDF_PATH),       # tabula-py requires a str path, not Path
        pages="all",
        multiple_tables=True,
        silent=True,         # suppress Java stderr noise in output
    )
    print(f"Extracted {len(dfs)} table(s)")
    for i, df in enumerate(dfs):
        print(f"Table {i}: {df.shape}")
        print(df.head(3))
except FileNotFoundError as e:
    # Raised when 'java' is genuinely not found on PATH
    raise RuntimeError(
        "java not found. Install a JRE and ensure 'java' is on PATH. "
        "Run: java -version to verify."
    ) from e

Key changed lines:

  • str(PDF_PATH) — tabula-py passes this to Java; use a string, not a Path object.
  • silent=True — suppresses Java's own stderr output so your logs stay clean.
  • FileNotFoundError catch — raised by subprocess when the java binary cannot be found; re-raise with a clear message.

Variant Fixes

Variant A: Java Is Installed but Not in the Subprocess PATH

This happens with virtual environments activated in terminals where JAVA_HOME was set for a different shell session, or with systemd services that use a stripped environment.

Set JAVA_HOME and prepend it to PATH before your Python process starts, or set it inside the process:

# pip install tabula-py pandas
import os
from pathlib import Path
import tabula

# Explicitly extend PATH before calling tabula
# Replace with your actual JDK path
JAVA_HOME = Path("/usr/lib/jvm/java-17-openjdk-amd64")
os.environ["JAVA_HOME"] = str(JAVA_HOME)
os.environ["PATH"] = str(JAVA_HOME / "bin") + os.pathsep + os.environ.get("PATH", "")

PDF_PATH = Path("data/report.pdf")

dfs = tabula.read_pdf(str(PDF_PATH), pages="1", silent=True)
print(f"Extracted {len(dfs)} table(s)")

Setting os.environ before the first tabula.read_pdf() call is sufficient — Python's subprocess inherits os.environ at call time, not at import time.

Variant B: Docker Container with No JRE

A minimal Python base image (python:3.12-slim) ships without Java. Add the install step to your Dockerfile:

FROM python:3.12-slim

# Install JRE and ghostscript (needed by camelot; include if you use both)
RUN apt-get update \
 && apt-get install -y --no-install-recommends \
      default-jre-headless \
 && rm -rf /var/lib/apt/lists/*

# Verify java is on PATH for the Python subprocess
RUN java -version

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . /app
WORKDIR /app
CMD ["python", "extract_tables.py"]

default-jre-headless is smaller than default-jre — it omits graphical components that a server container does not need.

Variant C: Passing java_options to tabula-py

If Java is found but the extraction fails with OutOfMemoryError on large files, pass JVM flags:

# pip install tabula-py
import tabula
from pathlib import Path

PDF_PATH = Path("data/large_report.pdf")

dfs = tabula.read_pdf(
    str(PDF_PATH),
    pages="all",
    multiple_tables=True,
    java_options=["-Xmx512m"],  # cap heap at 512 MB
    silent=True,
)
print(f"Extracted {len(dfs)} table(s)")

java_options is a list of strings appended to the java command before -jar. Common options: -Xmx512m (heap cap), -Djava.awt.headless=true (suppress AWT warnings in headless environments).

Verification

Run the diagnostic snippet from the root cause section one final time. Then confirm a real extraction works end-to-end:

# pip install tabula-py pandas
import subprocess, sys, tabula, pandas as pd
from pathlib import Path

# Step 1: confirm java is reachable
r = subprocess.run(["java", "-version"], capture_output=True, text=True)
assert r.returncode == 0, f"java still not found: {r.stderr}"
print("java OK:", r.stderr.splitlines()[0])

# Step 2: confirm tabula can read a PDF
PDF_PATH = Path("data/report.pdf")
assert PDF_PATH.exists(), f"Test PDF not found at {PDF_PATH}"

dfs = tabula.read_pdf(str(PDF_PATH), pages="1", silent=True)
assert len(dfs) > 0, "tabula returned no tables — check the PDF has a bordered table on page 1"
assert isinstance(dfs[0], pd.DataFrame), "Expected a DataFrame"
print(f"tabula OK: extracted {len(dfs)} table(s), first table shape {dfs[0].shape}")

Expected output:

java OK: openjdk version "17.0.x" ...
tabula OK: extracted 1 table(s), first table shape (12, 5)

If you still see JavaNotFoundError after installing Java, check whether your IDE or process runner has its own PATH override that excludes the Java binary directory. In VS Code, set "terminal.integrated.env.linux" (or .mac / .windows) to include the Java bin path.

If Java is found but your PDF returns 0 tables or garbled Unicode, switch to pdfplumber vs camelot vs tabula and try the pdfplumber fallback — tabula-py does not handle all PDF font encodings.

FAQ

Why does java -version work in my terminal but tabula-py still raises JavaNotFoundError? Your interactive shell sources ~/.bashrc or ~/.zshrc, which sets PATH. Python's subprocess inherits the environment of the process that launched it — not your shell profile. If you start Python from an IDE launcher, a cron job, or a systemd service, the PATH in that environment may not include the Java bin directory. Fix: set JAVA_HOME and prepend $JAVA_HOME/bin to PATH in the same script, or set them as system-level environment variables (see the Windows SetEnvironmentVariable example above).

Does tabula-py require a full JDK or just a JRE? A JRE (Java Runtime Environment) is sufficient — tabula-py only runs a pre-compiled JAR, it does not compile Java code. default-jre on Debian/Ubuntu or openjdk on macOS both install a JRE. A full JDK works too and is fine if you already have one installed.

Which Java version does tabula-py support? tabula-py's bundled JAR targets Java 8+ and has been tested through Java 21 LTS. Java 17 LTS is the recommended choice for new installs — it is the current Ubuntu LTS default and is available on all major platforms via Adoptium. Avoid Java 8 on new setups; it is end-of-life.

Can I use tabula-py in a GitHub Actions workflow? Yes. The ubuntu-latest runner ships with Java pre-installed. Confirm with java -version in a run: step before calling your Python script. If you use the python:3.x-slim Docker container action instead, add a RUN apt-get install -y default-jre-headless step to your Dockerfile first.

tabula-py works locally but fails in production — what is different? Common causes: (1) production runs in a Docker container based on python:slim which has no JRE; (2) a systemd service uses EnvironmentFile that does not set JAVA_HOME; (3) a AWS Lambda / Cloud Run function uses a minimal runtime image. Use the Dockerfile fix in Variant B, or switch to pdfplumber for serverless environments where system dependencies are impractical to install.

Part of pdfplumber vs camelot vs tabula.