Converting DOCX to PDF with Python
.docx is the source format; PDF is what you send. Doing this manually — open Word, File → Export, repeat — falls apart the moment you have a folder of 200 contracts. Python can drive the conversion engine directly, whether that engine is Microsoft Word on Windows/macOS or LibreOffice headless on a Linux server.
Two common tools exist: docx2pdf, which wraps Word's COM automation, and LibreOffice's soffice --headless command, which runs the LibreOffice rendering engine without a GUI. Choosing the wrong one for your environment is the single most common conversion failure — docx2pdf simply will not run on Linux. This guide diagnoses your environment first, then provides battle-tested code for each path.
For background on dynamically generating the .docx files you will be converting, see Automating Word Document Creation. Once converted, if you need to assemble the resulting PDFs into a single deliverable, see Generating PDF Reports Dynamically.
Prerequisites
# Windows / macOS (requires Microsoft Word installed)
pip install docx2pdf
# Linux / server environments (requires LibreOffice)
# Ubuntu/Debian: sudo apt install libreoffice
# RHEL/CentOS: sudo yum install libreoffice
# Verify: soffice --version
Set up a test file before proceeding:
mkdir -p input_docs output_pdfs
# Place a sample.docx in input_docs/ before running the examples below
1. Detect Your Environment
Before writing a single conversion line, check which engine is available. Running docx2pdf on Linux raises NotImplementedError immediately; running LibreOffice on a machine where only Word is available wastes time. The snippet below makes the engine explicit at startup.
# pip install docx2pdf (Windows/macOS only)
import platform
import shutil
from pathlib import Path
def detect_engine() -> str:
"""Return 'docx2pdf' on Windows/macOS if Word is accessible, else 'libreoffice'."""
system = platform.system()
if system in ("Windows", "Darwin"):
try:
import docx2pdf # noqa: F401
return "docx2pdf"
except ImportError:
pass
soffice = shutil.which("soffice") or shutil.which("libreoffice")
if soffice:
return "libreoffice"
raise RuntimeError(
"No conversion engine found. "
"Install docx2pdf (Windows/macOS) or LibreOffice (Linux/server)."
)
print(detect_engine())
2. Convert a Single File with docx2pdf (Windows / macOS)
docx2pdf calls into the running Word instance via COM automation on Windows and via AppleScript on macOS. The output PDF preserves all Word styles, embedded fonts, and track-changes markup exactly as Word renders them.
# pip install docx2pdf
from pathlib import Path
from docx2pdf import convert
INPUT = Path("input_docs/contract.docx")
OUTPUT = Path("output_pdfs/contract.pdf")
try:
OUTPUT.parent.mkdir(parents=True, exist_ok=True)
convert(INPUT, OUTPUT)
print(f"Converted: {OUTPUT}")
except FileNotFoundError as exc:
print(f"Input not found: {exc}")
except Exception as exc:
# On Linux this raises NotImplementedError; on Windows it may raise
# com_error if Word is not installed.
print(f"Conversion failed: {exc}")
convert() accepts a file path or a directory path. When you pass a directory it converts every .docx it finds into the same folder. For custom output directories use the two-argument form shown above.
3. Convert a Single File with LibreOffice Headless (Linux / Server)
LibreOffice's --headless flag runs the full rendering pipeline without spawning a window. The --outdir flag controls where the PDF lands. Use subprocess so Python captures errors and exit codes.
# No pip install needed — requires: sudo apt install libreoffice
import subprocess
from pathlib import Path
INPUT = Path("input_docs/contract.docx")
OUTPUT_DIR = Path("output_pdfs")
try:
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
result = subprocess.run(
[
"soffice",
"--headless",
"--convert-to", "pdf",
"--outdir", str(OUTPUT_DIR),
str(INPUT),
],
capture_output=True,
text=True,
timeout=60,
)
if result.returncode != 0:
raise RuntimeError(result.stderr.strip())
print(f"Converted: {OUTPUT_DIR / (INPUT.stem + '.pdf')}")
except FileNotFoundError:
print("soffice not found — install LibreOffice and ensure it is on PATH.")
except subprocess.TimeoutExpired:
print("Conversion timed out. File may be corrupt or very large.")
except Exception as exc:
print(f"Conversion failed: {exc}")
Note: LibreOffice creates a user profile directory on first run. On headless servers this can default to a locked or read-only location. If you see user installation could not be completed errors, see the variant fix in Fix docx2pdf Error on Linux.
4. docx2pdf vs LibreOffice vs Cloud — Engine Comparison
| Engine | Platform | Fidelity | Speed | Requires |
|---|---|---|---|---|
| docx2pdf | Windows, macOS | Highest — Word renders it | Medium (COM overhead) | Microsoft Word installed |
| LibreOffice headless | Linux, macOS, Windows | Good — minor font diffs possible | Fast, parallelisable | LibreOffice ≥ 7 on PATH |
| Cloud APIs (Adobe, Zamzar, GroupDocs) | Any | Varies by vendor | Network-bound | API key + internet |
| WeasyPrint | Any | HTML/CSS only | Fast | HTML intermediate step |
If your documents use complex Word-specific features — mail merge fields, ActiveX controls, or custom VBA macros — only docx2pdf (which uses Word itself) will reproduce them faithfully. LibreOffice handles standard paragraph styles, tables, headers/footers, and embedded images reliably; fidelity degrades with advanced typography or proprietary OOXML extensions.
5. Batch Convert a Folder
Converting a folder preserves the relative directory structure in the output. Both engines handle this differently; the snippet below wraps both in a unified interface.
# pip install docx2pdf (Windows/macOS path)
# Linux path requires: sudo apt install libreoffice
import platform
import shutil
import subprocess
from pathlib import Path
def batch_convert(input_dir: Path, output_dir: Path) -> list[Path]:
"""Convert all .docx files in input_dir to PDF and write them to output_dir."""
input_dir = input_dir.resolve()
output_dir = output_dir.resolve()
output_dir.mkdir(parents=True, exist_ok=True)
docx_files = list(input_dir.rglob("*.docx"))
if not docx_files:
print("No .docx files found.")
return []
system = platform.system()
converted: list[Path] = []
for docx_path in docx_files:
# Mirror subdirectory structure
rel = docx_path.relative_to(input_dir)
out_subdir = output_dir / rel.parent
out_subdir.mkdir(parents=True, exist_ok=True)
pdf_path = out_subdir / (docx_path.stem + ".pdf")
try:
if system in ("Windows", "Darwin"):
from docx2pdf import convert
convert(docx_path, pdf_path)
else:
result = subprocess.run(
["soffice", "--headless", "--convert-to", "pdf",
"--outdir", str(out_subdir), str(docx_path)],
capture_output=True, text=True, timeout=120,
)
if result.returncode != 0:
raise RuntimeError(result.stderr.strip())
converted.append(pdf_path)
print(f" OK: {rel}")
except Exception as exc:
print(f" FAIL: {rel} — {exc}")
return converted
if __name__ == "__main__":
results = batch_convert(
input_dir=Path("input_docs"),
output_dir=Path("output_pdfs"),
)
print(f"\nConverted {len(results)} file(s).")
6. Font and Layout Fidelity Caveats
Embedded vs. System Fonts
docx2pdf asks Word to render the document, so every font Word can access — including fonts embedded in the .docx — is available. LibreOffice renders with its own font engine; fonts embedded in the OOXML container are extracted at conversion time but system fonts referenced by name must be installed on the server.
Action: On Linux servers, install the Microsoft core fonts package to cover the most common Word typefaces:
# Ubuntu/Debian
sudo apt install ttf-mscorefonts-installer
sudo fc-cache -f -v
If documents use custom brand fonts, copy the .ttf/.otf files to /usr/share/fonts/custom/ and run fc-cache again before converting.
Complex Layouts
LibreOffice may misplace:
- Text boxes anchored to a character position (rather than the page)
- Word Art and SmartArt graphics (rendered as bitmaps at low resolution)
- Tables that span a page break with
Keep togetherenabled - Headers/footers using linked-story chains
For documents with these features, prefer docx2pdf on a Windows CI runner, or convert .docx → HTML via python-docx first and then use WeasyPrint for the PDF step — though HTML conversion itself loses complex formatting.
7. Edge Cases and Variants
Password-Protected DOCX Files
Both engines will fail silently or raise on encrypted .docx files. Strip the password first:
# pip install msoffcrypto-tool
import msoffcrypto
from pathlib import Path
import io
def decrypt_docx(encrypted_path: Path, password: str) -> bytes:
"""Return decrypted .docx bytes, ready for conversion."""
try:
with open(encrypted_path, "rb") as f:
office_file = msoffcrypto.OfficeFile(f)
office_file.load_key(password=password)
output = io.BytesIO()
office_file.decrypt(output)
return output.getvalue()
except msoffcrypto.exceptions.InvalidKeyError:
raise ValueError("Incorrect password for the .docx file.")
Skipping Already-Converted Files
In incremental runs you want to skip files whose PDF is newer than the source .docx:
from pathlib import Path
def needs_conversion(docx_path: Path, pdf_path: Path) -> bool:
"""Return True if the PDF is missing or older than the .docx."""
if not pdf_path.exists():
return True
return docx_path.stat().st_mtime > pdf_path.stat().st_mtime
Running LibreOffice in Docker
For reproducible server deployments avoid relying on system LibreOffice. A minimal Dockerfile:
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
libreoffice \
ttf-mscorefonts-installer \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir docx2pdf # kept for Windows/macOS parity; Linux path uses soffice
CMD ["python", "batch_convert.py"]
8. Validation
After conversion, open-check the PDFs with pypdf to confirm page count and that the file is not corrupt:
# pip install pypdf
from pathlib import Path
from pypdf import PdfReader
def validate_pdf(pdf_path: Path, expected_min_pages: int = 1) -> bool:
"""Return True if the PDF is readable and has at least expected_min_pages."""
try:
reader = PdfReader(pdf_path)
actual = len(reader.pages)
if actual < expected_min_pages:
print(f"[WARN] {pdf_path.name}: only {actual} page(s), expected ≥ {expected_min_pages}")
return False
return True
except Exception as exc:
print(f"[FAIL] {pdf_path.name}: {exc}")
return False
# Quick batch check
output_dir = Path("output_pdfs")
for pdf in output_dir.rglob("*.pdf"):
validate_pdf(pdf)
9. Performance and Scale
LibreOffice parallelism: Each soffice process locks its user-profile directory. Running multiple conversions in parallel fails unless each process gets its own profile:
import subprocess
import tempfile
from pathlib import Path
def soffice_convert_isolated(docx_path: Path, output_dir: Path) -> None:
"""Run soffice with a per-process user profile to allow parallelism."""
with tempfile.TemporaryDirectory() as tmp:
profile_dir = Path(tmp) / "lo_profile"
profile_dir.mkdir()
subprocess.run(
[
"soffice",
f"-env:UserInstallation=file://{profile_dir}",
"--headless",
"--convert-to", "pdf",
"--outdir", str(output_dir),
str(docx_path),
],
capture_output=True, text=True, timeout=120, check=True,
)
docx2pdf parallelism: On Windows, docx2pdf supports passing a directory path. For true parallelism, spawn multiple Word processes via separate COM instances — but in practice this is rarely stable. Prefer sequential conversion with docx2pdf and parallel conversion with LibreOffice.
Memory: LibreOffice loads the full document model into memory. Documents with many embedded images can exceed 2 GB RAM per process. Monitor with psutil and cap concurrent workers accordingly.
10. Troubleshooting
| Error | Root cause | Fix |
|---|---|---|
NotImplementedError: docx2pdf is not implemented for linux | docx2pdf requires Word, unavailable on Linux | Switch to LibreOffice headless; see Fix docx2pdf Error on Linux |
com_error: -2147221005 | Word COM server not registered / Word not installed | Install Microsoft Word or use LibreOffice |
soffice: command not found | LibreOffice not on PATH | sudo apt install libreoffice or add LibreOffice bin dir to PATH |
user installation could not be completed | LibreOffice profile dir locked or read-only | Use -env:UserInstallation with a per-run temp dir (see section 9) |
| Garbled or missing text in PDF | Font not installed on Linux server | Install ttf-mscorefonts-installer and custom fonts; run fc-cache |
| PDF page count is 0 | Empty or corrupt .docx | Validate with python-docx before converting: Document(path).paragraphs |
11. Complete Script
#!/usr/bin/env python3
"""
batch_docx_to_pdf.py — Convert a folder of .docx files to PDF.
Usage:
python batch_docx_to_pdf.py input_docs/ output_pdfs/
Engines:
Windows / macOS: docx2pdf (requires Microsoft Word)
Linux / server: LibreOffice headless (requires soffice on PATH)
pip install docx2pdf pypdf # docx2pdf only needed on Windows/macOS
"""
import argparse
import platform
import subprocess
import tempfile
from pathlib import Path
try:
from pypdf import PdfReader
PYPDF_AVAILABLE = True
except ImportError:
PYPDF_AVAILABLE = False
SYSTEM = platform.system()
def soffice_convert(docx_path: Path, output_dir: Path) -> None:
"""Convert via LibreOffice headless with an isolated user profile."""
with tempfile.TemporaryDirectory() as tmp:
profile = Path(tmp) / "lo_profile"
profile.mkdir()
result = subprocess.run(
[
"soffice",
f"-env:UserInstallation=file://{profile}",
"--headless",
"--convert-to", "pdf",
"--outdir", str(output_dir),
str(docx_path),
],
capture_output=True, text=True, timeout=120,
)
if result.returncode != 0:
raise RuntimeError(result.stderr.strip() or "soffice exited non-zero")
def docx2pdf_convert(docx_path: Path, output_path: Path) -> None:
"""Convert via docx2pdf (Windows/macOS only)."""
from docx2pdf import convert # noqa: PLC0415
convert(docx_path, output_path)
def validate(pdf_path: Path) -> bool:
if not PYPDF_AVAILABLE:
return pdf_path.exists() and pdf_path.stat().st_size > 0
try:
return len(PdfReader(pdf_path).pages) > 0
except Exception:
return False
def batch_convert(input_dir: Path, output_dir: Path, skip_existing: bool = True) -> None:
input_dir = input_dir.resolve()
output_dir = output_dir.resolve()
output_dir.mkdir(parents=True, exist_ok=True)
files = sorted(input_dir.rglob("*.docx"))
if not files:
print("No .docx files found.")
return
ok = fail = skipped = 0
for docx_path in files:
rel = docx_path.relative_to(input_dir)
out_subdir = output_dir / rel.parent
out_subdir.mkdir(parents=True, exist_ok=True)
pdf_path = out_subdir / (docx_path.stem + ".pdf")
if skip_existing and pdf_path.exists() and pdf_path.stat().st_mtime >= docx_path.stat().st_mtime:
skipped += 1
continue
try:
if SYSTEM in ("Windows", "Darwin"):
docx2pdf_convert(docx_path, pdf_path)
else:
soffice_convert(docx_path, out_subdir)
if validate(pdf_path):
print(f" OK: {rel}")
ok += 1
else:
print(f" WARN: {rel} — PDF validation failed")
fail += 1
except Exception as exc:
print(f" FAIL: {rel} — {exc}")
fail += 1
print(f"\nDone: {ok} converted, {fail} failed, {skipped} skipped.")
def main() -> None:
parser = argparse.ArgumentParser(description="Batch convert .docx to PDF")
parser.add_argument("input_dir", type=Path)
parser.add_argument("output_dir", type=Path)
parser.add_argument("--no-skip", action="store_true", help="Re-convert even if PDF exists")
args = parser.parse_args()
batch_convert(args.input_dir, args.output_dir, skip_existing=not args.no_skip)
if __name__ == "__main__":
main()
Related
- Fix docx2pdf Error on Linux — detailed fix for the
NotImplementedErrorand COM errors on Linux - Automating Word Document Creation — generate the .docx files you will be converting
- Generating PDF Reports Dynamically — build PDFs directly from data without the DOCX intermediate step
- Merging and Splitting PDF Documents — combine the PDFs produced by batch conversion