Fix "TesseractNotFoundError" in Python
Calling pytesseract.image_to_string() raises the following error the first time it is run in a fresh environment:
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH
The error means pytesseract — the Python wrapper — is installed, but the Tesseract binary it wraps is not. The two are separate: pip install pytesseract installs only the Python adapter; the actual OCR engine must be installed at the OS level independently.
Root Cause
pytesseract works by shelling out to the tesseract executable. When you call image_to_string, it runs subprocess.check_output(["tesseract", ...]) under the hood. If the binary is absent from PATH, Python raises TesseractNotFoundError before any image processing happens. Installing or updating pytesseract alone cannot fix this — only installing the binary (or telling pytesseract where to find an existing one) resolves it.
The architecture looks like this: your Python script → pytesseract (PyPI package) → tesseract OS binary → tessdata language files. All three layers must be present. A fresh virtual environment does not inherit system packages, but it does inherit the system PATH, so if tesseract is installed at the system level, pytesseract inside a venv will find it without any extra configuration.
Minimal Diagnostic
Confirm the binary is missing before reaching for any fix:
# pip install pytesseract
import subprocess
import pytesseract
# Check 1: is the binary on PATH at all?
try:
out = subprocess.check_output(["tesseract", "--version"], text=True)
print("Binary found:", out.splitlines()[0])
except FileNotFoundError:
print("Binary NOT found on PATH")
# Check 2: what path is pytesseract looking for?
print("pytesseract is looking for:", pytesseract.pytesseract.tesseract_cmd)
If Check 1 prints Binary NOT found on PATH, install the binary (see below). If the binary exists but Check 2 shows a wrong path, set tesseract_cmd directly.
Fix 1 — Install the Tesseract Binary
Linux (Ubuntu / Debian)
sudo apt-get update
sudo apt-get install -y tesseract-ocr
# Verify
tesseract --version
For additional language packs, install them in the same command:
sudo apt-get install -y tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-spa
After installation tesseract --version should print tesseract 4.x.x or 5.x.x. Restart your Python process — no reboot needed.
macOS
brew install tesseract
# With extra language packs
brew install tesseract-lang
# Verify
tesseract --version
which tesseract
Homebrew installs to /usr/local/bin/tesseract (Intel) or /opt/homebrew/bin/tesseract (Apple Silicon). Both are on PATH automatically in a standard Homebrew setup.
Windows
- Download the installer from the UB Mannheim builds: https://github.com/UB-Mannheim/tesseract/wiki
- Run the installer. Take note of the install path, typically
C:\Program Files\Tesseract-OCR\. - Add that folder to your system
PATH:- Open System Properties → Environment Variables.
- Select Path under System variables → Edit → New.
- Paste
C:\Program Files\Tesseract-OCR\→ OK all dialogs.
- Open a new terminal (the current one will not pick up the change) and verify:
tesseract --version
If you cannot modify the system PATH (corporate machine, CI environment), set tesseract_cmd in code instead — see Fix 2 below.
Windows PATH Troubleshooting
The most common Windows failure mode: the installer ran, C:\Program Files\Tesseract-OCR\tesseract.exe exists, but tesseract --version still fails in a new PowerShell window. The cause is almost always a session that predates the PATH change. Open a brand-new terminal after editing PATH — Windows does not propagate environment variable changes to already-open sessions.
To check whether the PATH entry was written correctly without closing your current session:
# List PATH entries that contain "tesseract" (case-insensitive)
$env:PATH -split ";" | Where-Object { $_ -match "tesseract" }
If that returns nothing, the PATH entry was either not saved or was added to the User variables of a different account. Repeat the PATH edit, log out, log back in, and try again.
A second Windows-specific gotcha: spaces in the path. If your install is under C:\Program Files\, the space is handled correctly by Windows PATH, but older tools that use raw string concatenation may fail. Setting tesseract_cmd to the full path (Fix 2) sidesteps this entirely.
Fix 2 — Set tesseract_cmd in Code
If the binary is installed but not on PATH, or if you need to point to a non-default install location, set the path explicitly before calling any pytesseract function:
# pip install pytesseract Pillow
from pathlib import Path
import pytesseract
from PIL import Image
# Adjust this path to match your actual install location
# Windows example:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# macOS Homebrew (Apple Silicon):
# pytesseract.pytesseract.tesseract_cmd = "/opt/homebrew/bin/tesseract"
# Linux custom install:
# pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
# Now calls will succeed
try:
img = Image.open(Path("scan.png"))
text = pytesseract.image_to_string(img)
print(text[:200])
except pytesseract.pytesseract.TesseractNotFoundError as exc:
print(f"Still not found at: {pytesseract.pytesseract.tesseract_cmd}")
raise
Use pathlib.Path to resolve the actual location at runtime so the script is portable:
# pip install pytesseract
from pathlib import Path
import pytesseract
# Auto-detect on Windows where the installer places the binary
_WIN_DEFAULT = Path(r"C:\Program Files\Tesseract-OCR\tesseract.exe")
if _WIN_DEFAULT.exists():
pytesseract.pytesseract.tesseract_cmd = str(_WIN_DEFAULT)
# On Linux/macOS the binary should already be on PATH after apt/brew install
Making tesseract_cmd Environment-Aware
Hard-coding a path makes scripts non-portable. A better pattern reads the path from an environment variable with a sensible fallback:
# pip install pytesseract Pillow
import os
from pathlib import Path
import pytesseract
from PIL import Image
def configure_tesseract() -> None:
"""
Set tesseract_cmd from TESSERACT_CMD env var, or auto-detect
the default Windows install location as a fallback.
Does nothing on Linux/macOS where the binary is typically on PATH.
"""
env_cmd = os.environ.get("TESSERACT_CMD")
if env_cmd:
# Explicitly configured — trust it
pytesseract.pytesseract.tesseract_cmd = env_cmd
return
# Windows auto-detect
win_default = Path(r"C:\Program Files\Tesseract-OCR\tesseract.exe")
if win_default.exists():
pytesseract.pytesseract.tesseract_cmd = str(win_default)
# Call once at module load time
configure_tesseract()
# Then use pytesseract normally
img = Image.open(Path("scan.png"))
try:
text = pytesseract.image_to_string(img)
print(text[:200])
except pytesseract.pytesseract.TesseractNotFoundError as exc:
raise RuntimeError(
"Set TESSERACT_CMD=/path/to/tesseract or install via apt/brew"
) from exc
Set TESSERACT_CMD in a .env file or your CI secrets, and the same code works across developer machines and deployment targets without any per-machine edits.
Fix 3 — Docker / CI Environments
In a Docker-based build or CI pipeline, the Python dependencies and the system binary are installed in separate layers. A common mistake is installing pytesseract via pip in a requirements.txt without also installing the binary in the Dockerfile.
FROM python:3.12-slim
# Install Tesseract binary first
RUN apt-get update && apt-get install -y --no-install-recommends \
tesseract-ocr \
tesseract-ocr-eng \
&& rm -rf /var/lib/apt/lists/*
# Then install Python deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app
WORKDIR /app
CMD ["python", "ocr_pipeline.py"]
Verify inside the container:
docker run --rm your-image tesseract --version
For GitHub Actions, add a step before pip install:
- name: Install Tesseract
run: sudo apt-get install -y tesseract-ocr
GitHub Actions: Caching the Tesseract Install
On hosted GitHub Actions runners apt-get install tesseract-ocr takes 15–30 seconds. Cache the install to speed up repeated runs:
- name: Cache Tesseract
uses: actions/cache@v4
with:
path: /usr/share/tesseract-ocr
key: tesseract-${{ runner.os }}-v5
- name: Install Tesseract (if cache miss)
run: |
if ! command -v tesseract &> /dev/null; then
sudo apt-get install -y tesseract-ocr
fi
- name: Install Python deps
run: pip install -r requirements.txt
Fix 4 — Missing Language Data (TESSDATA_PREFIX)
A related but distinct error appears when the binary is found but a specific language pack is missing:
Error, could not initialize tesseract API with language "deu".
This means the language .traineddata file is absent. Fix:
# Install the pack
sudo apt-get install -y tesseract-ocr-deu
# Or set TESSDATA_PREFIX to a custom directory containing .traineddata files
export TESSDATA_PREFIX=/opt/tessdata/
# Verify available languages
tesseract --list-langs
In Python:
# pip install pytesseract Pillow
import os
from pathlib import Path
import pytesseract
from PIL import Image
# Point to a custom tessdata directory if needed
os.environ["TESSDATA_PREFIX"] = "/opt/tessdata/"
img = Image.open(Path("german_invoice.png"))
try:
text = pytesseract.image_to_string(img, lang="deu")
print(text[:200])
except pytesseract.pytesseract.TesseractNotFoundError:
print("Binary missing — install tesseract-ocr")
except Exception as exc:
# Catches language pack errors
print(f"OCR error: {exc}")
Variant: TesseractError After the Binary Is Found
Once the binary is installed and on PATH, a second class of error can appear:
pytesseract.pytesseract.TesseractError: (1, 'Error, could not initialize tesseract API')
This is not the same as TesseractNotFoundError. The binary was found but crashed during initialisation. Common causes:
- Corrupt tessdata directory: reinstall with
sudo apt-get install --reinstall tesseract-ocr. - Wrong
TESSDATA_PREFIX: the environment variable points to a directory that does not contain the expected.traineddatafiles. UnsetTESSDATA_PREFIXand let Tesseract use its compiled-in default path. - Version mismatch: a Tesseract 5 binary with Tesseract 4 language files (or vice versa). Check
tesseract --versionand ensure your language pack packages match.
Quick diagnostic:
# Print the tessdata directory Tesseract is actually using
tesseract --print-parameters 2>&1 | grep tessdata_dir
# List detected languages
tesseract --list-langs
If --list-langs returns an empty list or crashes, the tessdata directory is either missing or misconfigured.
Variant: ImportError for pytesseract Itself
If your error is:
ModuleNotFoundError: No module named 'pytesseract'
the Python wrapper is missing from the current environment. Install it:
pip install pytesseract
This is distinct from TesseractNotFoundError. In a virtual environment, always confirm you are installing into the correct env:
which python # should point inside your venv
pip show pytesseract # should show a Version line
Verification
After applying any fix, run this end-to-end smoke test to confirm the full stack is working:
# pip install pytesseract Pillow
from PIL import Image, ImageDraw, ImageFont
import pytesseract
def smoke_test_ocr() -> None:
"""
Create a minimal in-memory image with known text, run OCR,
and assert the result matches.
"""
# Draw "Hello OCR" on a white image
img = Image.new("RGB", (200, 60), color=(255, 255, 255))
draw = ImageDraw.Draw(img)
draw.text((10, 15), "Hello OCR", fill=(0, 0, 0))
result = pytesseract.image_to_string(img).strip()
assert "Hello" in result, f"OCR smoke test failed — got: {result!r}"
version = pytesseract.get_tesseract_version()
print(f"OK — Tesseract {version}, recognised: {result!r}")
smoke_test_ocr()
# → OK — Tesseract 5.3.x, recognised: 'Hello OCR'
If the assertion passes, the binary, the Python wrapper, and the default language pack are all correctly configured. The full Scanning and OCR Processing with Python pipeline will run without further changes.
Related
- Scanning and OCR Processing with Python — full pipeline: rasterize, preprocess, OCR, searchable PDF
- How to Extract Tables from Scanned PDFs — coordinate-clustering to extract tabular data once OCR is working
- Extracting Tables from PDFs — standard (non-OCR) table extraction for vector PDFs