Converting DOCX to PDF with Python

.docx is the source format; PDF is what you send. Doing this manually — open Word, File → Export, repeat — falls apart the moment you have a folder of 200 contracts. Python can drive the conversion engine directly, whether that engine is Microsoft Word on Windows/macOS or LibreOffice headless on a Linux server.

Two common tools exist: docx2pdf, which wraps Word's COM automation, and LibreOffice's soffice --headless command, which runs the LibreOffice rendering engine without a GUI. Choosing the wrong one for your environment is the single most common conversion failure — docx2pdf simply will not run on Linux. This guide diagnoses your environment first, then provides battle-tested code for each path.

For background on dynamically generating the .docx files you will be converting, see Automating Word Document Creation. Once converted, if you need to assemble the resulting PDFs into a single deliverable, see Generating PDF Reports Dynamically.

Prerequisites

# Windows / macOS (requires Microsoft Word installed)
pip install docx2pdf

# Linux / server environments (requires LibreOffice)
# Ubuntu/Debian: sudo apt install libreoffice
# RHEL/CentOS:   sudo yum install libreoffice
# Verify: soffice --version

Set up a test file before proceeding:

mkdir -p input_docs output_pdfs
# Place a sample.docx in input_docs/ before running the examples below

1. Detect Your Environment

Before writing a single conversion line, check which engine is available. Running docx2pdf on Linux raises NotImplementedError immediately; running LibreOffice on a machine where only Word is available wastes time. The snippet below makes the engine explicit at startup.

# pip install docx2pdf  (Windows/macOS only)
import platform
import shutil
from pathlib import Path

def detect_engine() -> str:
    """Return 'docx2pdf' on Windows/macOS if Word is accessible, else 'libreoffice'."""
    system = platform.system()
    if system in ("Windows", "Darwin"):
        try:
            import docx2pdf  # noqa: F401
            return "docx2pdf"
        except ImportError:
            pass
    soffice = shutil.which("soffice") or shutil.which("libreoffice")
    if soffice:
        return "libreoffice"
    raise RuntimeError(
        "No conversion engine found. "
        "Install docx2pdf (Windows/macOS) or LibreOffice (Linux/server)."
    )

print(detect_engine())
DOCX to PDF engine decision tree A decision tree showing that a .docx file feeds into an OS check; Windows or macOS leads to docx2pdf via Word COM, while Linux/server leads to LibreOffice headless soffice, and both produce a PDF output. .docx file source document OS check platform.system() Windows / macOS docx2pdf Word COM automation Linux / server LibreOffice headless soffice CLI PDF output

2. Convert a Single File with docx2pdf (Windows / macOS)

docx2pdf calls into the running Word instance via COM automation on Windows and via AppleScript on macOS. The output PDF preserves all Word styles, embedded fonts, and track-changes markup exactly as Word renders them.

# pip install docx2pdf
from pathlib import Path
from docx2pdf import convert

INPUT = Path("input_docs/contract.docx")
OUTPUT = Path("output_pdfs/contract.pdf")

try:
    OUTPUT.parent.mkdir(parents=True, exist_ok=True)
    convert(INPUT, OUTPUT)
    print(f"Converted: {OUTPUT}")
except FileNotFoundError as exc:
    print(f"Input not found: {exc}")
except Exception as exc:
    # On Linux this raises NotImplementedError; on Windows it may raise
    # com_error if Word is not installed.
    print(f"Conversion failed: {exc}")

convert() accepts a file path or a directory path. When you pass a directory it converts every .docx it finds into the same folder. For custom output directories use the two-argument form shown above.

3. Convert a Single File with LibreOffice Headless (Linux / Server)

LibreOffice's --headless flag runs the full rendering pipeline without spawning a window. The --outdir flag controls where the PDF lands. Use subprocess so Python captures errors and exit codes.

# No pip install needed — requires: sudo apt install libreoffice
import subprocess
from pathlib import Path

INPUT = Path("input_docs/contract.docx")
OUTPUT_DIR = Path("output_pdfs")

try:
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    result = subprocess.run(
        [
            "soffice",
            "--headless",
            "--convert-to", "pdf",
            "--outdir", str(OUTPUT_DIR),
            str(INPUT),
        ],
        capture_output=True,
        text=True,
        timeout=60,
    )
    if result.returncode != 0:
        raise RuntimeError(result.stderr.strip())
    print(f"Converted: {OUTPUT_DIR / (INPUT.stem + '.pdf')}")
except FileNotFoundError:
    print("soffice not found — install LibreOffice and ensure it is on PATH.")
except subprocess.TimeoutExpired:
    print("Conversion timed out. File may be corrupt or very large.")
except Exception as exc:
    print(f"Conversion failed: {exc}")

Note: LibreOffice creates a user profile directory on first run. On headless servers this can default to a locked or read-only location. If you see user installation could not be completed errors, see the variant fix in Fix docx2pdf Error on Linux.

4. docx2pdf vs LibreOffice vs Cloud — Engine Comparison

EnginePlatformFidelitySpeedRequires
docx2pdfWindows, macOSHighest — Word renders itMedium (COM overhead)Microsoft Word installed
LibreOffice headlessLinux, macOS, WindowsGood — minor font diffs possibleFast, parallelisableLibreOffice ≥ 7 on PATH
Cloud APIs (Adobe, Zamzar, GroupDocs)AnyVaries by vendorNetwork-boundAPI key + internet
WeasyPrintAnyHTML/CSS onlyFastHTML intermediate step

If your documents use complex Word-specific features — mail merge fields, ActiveX controls, or custom VBA macros — only docx2pdf (which uses Word itself) will reproduce them faithfully. LibreOffice handles standard paragraph styles, tables, headers/footers, and embedded images reliably; fidelity degrades with advanced typography or proprietary OOXML extensions.

5. Batch Convert a Folder

Converting a folder preserves the relative directory structure in the output. Both engines handle this differently; the snippet below wraps both in a unified interface.

# pip install docx2pdf   (Windows/macOS path)
# Linux path requires: sudo apt install libreoffice
import platform
import shutil
import subprocess
from pathlib import Path

def batch_convert(input_dir: Path, output_dir: Path) -> list[Path]:
    """Convert all .docx files in input_dir to PDF and write them to output_dir."""
    input_dir = input_dir.resolve()
    output_dir = output_dir.resolve()
    output_dir.mkdir(parents=True, exist_ok=True)

    docx_files = list(input_dir.rglob("*.docx"))
    if not docx_files:
        print("No .docx files found.")
        return []

    system = platform.system()
    converted: list[Path] = []

    for docx_path in docx_files:
        # Mirror subdirectory structure
        rel = docx_path.relative_to(input_dir)
        out_subdir = output_dir / rel.parent
        out_subdir.mkdir(parents=True, exist_ok=True)
        pdf_path = out_subdir / (docx_path.stem + ".pdf")

        try:
            if system in ("Windows", "Darwin"):
                from docx2pdf import convert
                convert(docx_path, pdf_path)
            else:
                result = subprocess.run(
                    ["soffice", "--headless", "--convert-to", "pdf",
                     "--outdir", str(out_subdir), str(docx_path)],
                    capture_output=True, text=True, timeout=120,
                )
                if result.returncode != 0:
                    raise RuntimeError(result.stderr.strip())
            converted.append(pdf_path)
            print(f"  OK: {rel}")
        except Exception as exc:
            print(f"  FAIL: {rel}{exc}")

    return converted


if __name__ == "__main__":
    results = batch_convert(
        input_dir=Path("input_docs"),
        output_dir=Path("output_pdfs"),
    )
    print(f"\nConverted {len(results)} file(s).")

6. Font and Layout Fidelity Caveats

Embedded vs. System Fonts

docx2pdf asks Word to render the document, so every font Word can access — including fonts embedded in the .docx — is available. LibreOffice renders with its own font engine; fonts embedded in the OOXML container are extracted at conversion time but system fonts referenced by name must be installed on the server.

Action: On Linux servers, install the Microsoft core fonts package to cover the most common Word typefaces:

# Ubuntu/Debian
sudo apt install ttf-mscorefonts-installer
sudo fc-cache -f -v

If documents use custom brand fonts, copy the .ttf/.otf files to /usr/share/fonts/custom/ and run fc-cache again before converting.

Complex Layouts

LibreOffice may misplace:

  • Text boxes anchored to a character position (rather than the page)
  • Word Art and SmartArt graphics (rendered as bitmaps at low resolution)
  • Tables that span a page break with Keep together enabled
  • Headers/footers using linked-story chains

For documents with these features, prefer docx2pdf on a Windows CI runner, or convert .docx → HTML via python-docx first and then use WeasyPrint for the PDF step — though HTML conversion itself loses complex formatting.

7. Edge Cases and Variants

Password-Protected DOCX Files

Both engines will fail silently or raise on encrypted .docx files. Strip the password first:

# pip install msoffcrypto-tool
import msoffcrypto
from pathlib import Path
import io

def decrypt_docx(encrypted_path: Path, password: str) -> bytes:
    """Return decrypted .docx bytes, ready for conversion."""
    try:
        with open(encrypted_path, "rb") as f:
            office_file = msoffcrypto.OfficeFile(f)
            office_file.load_key(password=password)
            output = io.BytesIO()
            office_file.decrypt(output)
            return output.getvalue()
    except msoffcrypto.exceptions.InvalidKeyError:
        raise ValueError("Incorrect password for the .docx file.")

Skipping Already-Converted Files

In incremental runs you want to skip files whose PDF is newer than the source .docx:

from pathlib import Path

def needs_conversion(docx_path: Path, pdf_path: Path) -> bool:
    """Return True if the PDF is missing or older than the .docx."""
    if not pdf_path.exists():
        return True
    return docx_path.stat().st_mtime > pdf_path.stat().st_mtime

Running LibreOffice in Docker

For reproducible server deployments avoid relying on system LibreOffice. A minimal Dockerfile:

FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
    libreoffice \
    ttf-mscorefonts-installer \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir docx2pdf  # kept for Windows/macOS parity; Linux path uses soffice
CMD ["python", "batch_convert.py"]

8. Validation

After conversion, open-check the PDFs with pypdf to confirm page count and that the file is not corrupt:

# pip install pypdf
from pathlib import Path
from pypdf import PdfReader

def validate_pdf(pdf_path: Path, expected_min_pages: int = 1) -> bool:
    """Return True if the PDF is readable and has at least expected_min_pages."""
    try:
        reader = PdfReader(pdf_path)
        actual = len(reader.pages)
        if actual < expected_min_pages:
            print(f"[WARN] {pdf_path.name}: only {actual} page(s), expected ≥ {expected_min_pages}")
            return False
        return True
    except Exception as exc:
        print(f"[FAIL] {pdf_path.name}: {exc}")
        return False

# Quick batch check
output_dir = Path("output_pdfs")
for pdf in output_dir.rglob("*.pdf"):
    validate_pdf(pdf)

9. Performance and Scale

LibreOffice parallelism: Each soffice process locks its user-profile directory. Running multiple conversions in parallel fails unless each process gets its own profile:

import subprocess
import tempfile
from pathlib import Path

def soffice_convert_isolated(docx_path: Path, output_dir: Path) -> None:
    """Run soffice with a per-process user profile to allow parallelism."""
    with tempfile.TemporaryDirectory() as tmp:
        profile_dir = Path(tmp) / "lo_profile"
        profile_dir.mkdir()
        subprocess.run(
            [
                "soffice",
                f"-env:UserInstallation=file://{profile_dir}",
                "--headless",
                "--convert-to", "pdf",
                "--outdir", str(output_dir),
                str(docx_path),
            ],
            capture_output=True, text=True, timeout=120, check=True,
        )

docx2pdf parallelism: On Windows, docx2pdf supports passing a directory path. For true parallelism, spawn multiple Word processes via separate COM instances — but in practice this is rarely stable. Prefer sequential conversion with docx2pdf and parallel conversion with LibreOffice.

Memory: LibreOffice loads the full document model into memory. Documents with many embedded images can exceed 2 GB RAM per process. Monitor with psutil and cap concurrent workers accordingly.

10. Troubleshooting

ErrorRoot causeFix
NotImplementedError: docx2pdf is not implemented for linuxdocx2pdf requires Word, unavailable on LinuxSwitch to LibreOffice headless; see Fix docx2pdf Error on Linux
com_error: -2147221005Word COM server not registered / Word not installedInstall Microsoft Word or use LibreOffice
soffice: command not foundLibreOffice not on PATHsudo apt install libreoffice or add LibreOffice bin dir to PATH
user installation could not be completedLibreOffice profile dir locked or read-onlyUse -env:UserInstallation with a per-run temp dir (see section 9)
Garbled or missing text in PDFFont not installed on Linux serverInstall ttf-mscorefonts-installer and custom fonts; run fc-cache
PDF page count is 0Empty or corrupt .docxValidate with python-docx before converting: Document(path).paragraphs

11. Complete Script

#!/usr/bin/env python3
"""
batch_docx_to_pdf.py — Convert a folder of .docx files to PDF.

Usage:
    python batch_docx_to_pdf.py input_docs/ output_pdfs/

Engines:
    Windows / macOS: docx2pdf (requires Microsoft Word)
    Linux / server:  LibreOffice headless (requires soffice on PATH)

pip install docx2pdf pypdf    # docx2pdf only needed on Windows/macOS
"""
import argparse
import platform
import subprocess
import tempfile
from pathlib import Path

try:
    from pypdf import PdfReader
    PYPDF_AVAILABLE = True
except ImportError:
    PYPDF_AVAILABLE = False

SYSTEM = platform.system()


def soffice_convert(docx_path: Path, output_dir: Path) -> None:
    """Convert via LibreOffice headless with an isolated user profile."""
    with tempfile.TemporaryDirectory() as tmp:
        profile = Path(tmp) / "lo_profile"
        profile.mkdir()
        result = subprocess.run(
            [
                "soffice",
                f"-env:UserInstallation=file://{profile}",
                "--headless",
                "--convert-to", "pdf",
                "--outdir", str(output_dir),
                str(docx_path),
            ],
            capture_output=True, text=True, timeout=120,
        )
        if result.returncode != 0:
            raise RuntimeError(result.stderr.strip() or "soffice exited non-zero")


def docx2pdf_convert(docx_path: Path, output_path: Path) -> None:
    """Convert via docx2pdf (Windows/macOS only)."""
    from docx2pdf import convert  # noqa: PLC0415
    convert(docx_path, output_path)


def validate(pdf_path: Path) -> bool:
    if not PYPDF_AVAILABLE:
        return pdf_path.exists() and pdf_path.stat().st_size > 0
    try:
        return len(PdfReader(pdf_path).pages) > 0
    except Exception:
        return False


def batch_convert(input_dir: Path, output_dir: Path, skip_existing: bool = True) -> None:
    input_dir = input_dir.resolve()
    output_dir = output_dir.resolve()
    output_dir.mkdir(parents=True, exist_ok=True)

    files = sorted(input_dir.rglob("*.docx"))
    if not files:
        print("No .docx files found.")
        return

    ok = fail = skipped = 0

    for docx_path in files:
        rel = docx_path.relative_to(input_dir)
        out_subdir = output_dir / rel.parent
        out_subdir.mkdir(parents=True, exist_ok=True)
        pdf_path = out_subdir / (docx_path.stem + ".pdf")

        if skip_existing and pdf_path.exists() and pdf_path.stat().st_mtime >= docx_path.stat().st_mtime:
            skipped += 1
            continue

        try:
            if SYSTEM in ("Windows", "Darwin"):
                docx2pdf_convert(docx_path, pdf_path)
            else:
                soffice_convert(docx_path, out_subdir)

            if validate(pdf_path):
                print(f"  OK:   {rel}")
                ok += 1
            else:
                print(f"  WARN: {rel} — PDF validation failed")
                fail += 1
        except Exception as exc:
            print(f"  FAIL: {rel}{exc}")
            fail += 1

    print(f"\nDone: {ok} converted, {fail} failed, {skipped} skipped.")


def main() -> None:
    parser = argparse.ArgumentParser(description="Batch convert .docx to PDF")
    parser.add_argument("input_dir", type=Path)
    parser.add_argument("output_dir", type=Path)
    parser.add_argument("--no-skip", action="store_true", help="Re-convert even if PDF exists")
    args = parser.parse_args()
    batch_convert(args.input_dir, args.output_dir, skip_existing=not args.no_skip)


if __name__ == "__main__":
    main()

Part of Word Document Templating & Batch Processing.

Explore next