Word Document Templating & Batch Processing
Manual Word workflows fail the moment volume arrives. Copy a template, paste in a client name, fix the date, save with a unique filename, repeat 400 times — and every repetition is a chance for a stale figure, a broken style, or a typo in a contract clause. The work does not scale linearly: it scales with the number of fields times the number of records, and human attention degrades long before the batch finishes. Python replaces that loop with a deterministic pipeline. You author one template, bind it to a row of structured data, and render an identical-quality .docx per record — auditable, repeatable, and fast. This guide covers the full path: designing templates, choosing the right library, ingesting CSV/Excel/JSON data, binding context, looping over records, exporting to PDF, and hardening the job for unattended scheduled runs.
How the pipeline fits together
Every batch job is the same shape: a static template plus a table of variable data, fed through a render step, emitted as one file per row, with an optional PDF conversion at the end. Hold this diagram in mind for the rest of the page — each later section maps to one box.
The four phases — template design, data binding, render loop, output — stay constant whether you generate one offer letter or fifty thousand. What changes at scale is everything around the loop: memory discipline, logging, idempotent naming, and recovery from partial failure. A single-document script and a nightly batch share the same render call but almost nothing else; the gap between them is where most teams lose time. If you are still finding your footing with the single-document case, Automating Word Document Creation covers library selection and the structural API before you scale it into a batch.
The mental model worth internalizing is separation of concerns. The template owns layout and styling. The data source owns content. The script owns the mapping between them and the orchestration around the loop. When those three stay decoupled, a marketing change to the letterhead never touches your code, a new column in the data never breaks rendering, and a bug in the loop never corrupts the template. The sections below walk each box of the diagram in that order, with one substantial, runnable snippet per phase that you can lift directly into a project.
Library ecosystem
There is no single "Word library" in Python. The ecosystem splits cleanly by job: structural editing versus templating versus format conversion. Pick by the question you are answering, not by popularity.
| Library | Best for | Install | When NOT to use |
|---|---|---|---|
| python-docx | Building or editing .docx structurally — paragraphs, tables, runs, styles, core metadata | pip install python-docx | Filling a designer-authored template; its API rebuilds documents node by node and loses layout nuance |
| docxtpl | Rendering a Word-authored template with {{ vars }} and {% loops %} while preserving every style | pip install docxtpl | Generating documents from scratch with no template; you would be fighting it |
| Jinja2 | The expression and control-flow engine inside docxtpl (filters, conditionals, loops) | pip install jinja2 (pulled in by docxtpl) | Direct use against .docx — Jinja2 only understands text, not the document XML wrapper |
| docx2pdf | Quick .docx → .pdf on a machine with Microsoft Word installed (Windows/macOS) | pip install docx2pdf | Linux servers or any host without Word; it drives Word via COM/AppleScript and will fail headless |
LibreOffice (soffice) | Cross-platform, server-safe headless PDF conversion | apt install libreoffice (system package) | Pixel-perfect fidelity to Word's renderer; minor layout drift is possible |
| mailmerge | Filling Word's native MERGEFIELD merge fields without rewriting the template | pip install docx-mailmerge | Conditional logic or loops — it has no expression engine; reach for docxtpl instead |
For most batch work the answer is docxtpl for rendering plus LibreOffice for PDF. python-docx joins in when you need post-render structural edits or metadata. Reserve docx2pdf for Windows desktops and mailmerge for legacy templates that already carry MERGEFIELD markers.
Environment setup
Isolate the project so a global package upgrade never silently changes your rendered output. Pin versions in requirements.txt so a colleague — or a scheduled job six months from now — reproduces the exact same byte output.
# Create and activate an isolated environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Headless PDF conversion needs the system package (not pip)
# Debian/Ubuntu:
sudo apt-get install -y libreoffice
# requirements.txt — pin everything for reproducible batches
python-docx==1.1.2
docxtpl==0.18.0
jinja2==3.1.4
pandas==2.2.2
openpyxl==3.1.5 # Excel ingestion engine for pandas
docx2pdf==0.1.8 # only on Windows/macOS with Word installed
Verify the toolchain before you write a loop. A two-line import check now saves a failed overnight batch later.
# pip install python-docx docxtpl pandas
from pathlib import Path
import docxtpl
import docx
print("docxtpl", docxtpl.__version__)
print("python-docx", docx.__version__)
print("cwd", Path.cwd())
Designing the template
Reliable rendering depends entirely on how the .docx is authored. Word stores text as a sequence of runs — styled spans — and it will silently split a placeholder like {{ client_name }} across several runs if you type it with autocorrect on, or edit it mid-word. docxtpl then sees {{ clie nt_na me }} and renders nothing.
Three rules keep templates render-safe:
- Type placeholders in one pass. Open the template, type the full
{{ variable }}without backspacing or letting autocorrect touch it. If a placeholder fails to render, select it, delete it, and retype it cleanly to collapse the runs. - Match names to your data exactly. A
{{ invoice_total }}placeholder needs aninvoice_totalkey in the context. Align placeholder names to your CSV/Excel column headers or JSON keys up front — see the ingestion section for normalizing messy headers. - Keep static content static. Headers, footers, logos, and boilerplate clauses stay as plain Word content. Only the cells and paragraphs that vary become placeholders. For table rows that grow with the data, use docxtpl's row loop:
{%tr for item in items %}…{%tr endfor %}, which clones the<w:tr>XML node per item and preserves borders and the header row.
Ingestion: loading templates and data sources
The data side of the pipeline is just structured rows. Whether the source is CSV, Excel, or JSON, normalize it to a list of dictionaries — one dict per document — so the render loop stays format-agnostic. pandas handles all three with a uniform interface; the data-cleaning patterns in Cleaning Messy CSV Data with pandas apply directly here, since a malformed source row becomes a malformed document.
# pip install pandas openpyxl
from pathlib import Path
import json
import pandas as pd
def load_records(source: Path) -> list[dict]:
"""Load CSV, Excel, or JSON into a uniform list of context dicts."""
suffix = source.suffix.lower()
try:
if suffix == ".csv":
df = pd.read_csv(source, dtype=str, keep_default_na=False)
elif suffix in {".xlsx", ".xls"}:
df = pd.read_excel(source, dtype=str, engine="openpyxl")
elif suffix == ".json":
return json.loads(source.read_text(encoding="utf-8"))
else:
raise ValueError(f"Unsupported source type: {suffix}")
except FileNotFoundError:
raise SystemExit(f"Data source not found: {source}")
# Normalize headers -> snake_case keys matching the template placeholders
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]
return df.to_dict(orient="records")
if __name__ == "__main__":
records = load_records(Path("data") / "clients.csv")
print(f"Loaded {len(records)} records; first keys: {list(records[0])}")
Reading with dtype=str and keep_default_na=False is deliberate: it stops pandas from turning an empty cell into the float NaN, which would render as the literal text nan in your document. Cast specific numeric or date fields explicitly in the transformation step instead.
The header-normalization line is the seam between data and template. Source spreadsheets arrive with headers like Invoice Total, Client Name, or trailing whitespace from a careless export, while your placeholders read invoice_total and client_name. Lowercasing, stripping, and replacing spaces with underscores collapses that variability into one predictable key shape, so the same template renders against a CSV from accounting and an Excel export from a different team without per-source special-casing. If a source uses wildly different header text, add an explicit rename map rather than relying on the placeholder names to drift toward the data — the template is the contract, and the ingestion layer adapts to it. For Excel sources with multiple sheets or a header row that is not the first row, pass sheet_name= and header= to read_excel; the broader reading patterns in Reading Excel Files with Python cover the messier real-world layouts.
Transformation: binding context and styles
Raw cells rarely render cleanly. A currency column arrives as 1234.5 and needs to read $1,234.50; a date arrives as 2026-06-18 and should read June 18, 2026. Do this shaping in Python before binding, not inside the template, so formatting logic lives in version control rather than buried in a .docx.
# pip install docxtpl
from datetime import datetime
from pathlib import Path
from docxtpl import DocxTemplate
def build_context(row: dict) -> dict:
"""Coerce raw strings into display-ready values for the template."""
ctx = dict(row) # copy so the source record is untouched
if row.get("invoice_total"):
ctx["invoice_total"] = f"${float(row['invoice_total']):,.2f}"
if row.get("issued_on"):
ctx["issued_on"] = datetime.strptime(
row["issued_on"], "%Y-%m-%d"
).strftime("%B %d, %Y")
# docxtpl renders a missing key as empty unless you guard it
ctx.setdefault("notes", "")
return ctx
def render_one(template: Path, row: dict, out_path: Path) -> None:
tpl = DocxTemplate(template) # fresh instance per document
tpl.render(build_context(row)) # bind and substitute
out_path.parent.mkdir(parents=True, exist_ok=True)
tpl.save(out_path)
Re-instantiating DocxTemplate inside the loop is not optional. A DocxTemplate object mutates in place on render(), so reusing one instance across rows leaks the previous document's content into the next — a classic source of every output file containing the first record's data. Styling is inherited from the template itself: define Heading 1, Normal, and any custom table styles in Word, reference them by name, and avoid inline formatting inside loops. For finer control over fonts and runs after rendering, python-docx can reopen the saved file and adjust styles programmatically.
Consolidation: the batch loop
The batch loop ties ingestion, transformation, and rendering together. Two concerns dominate at scale: deterministic naming and resilience. Filenames must be unique and idempotent so a re-run overwrites cleanly rather than producing doc_1 (2).docx collisions, and a single bad row must not abort the whole run.
# pip install docxtpl pandas
import logging
import re
from pathlib import Path
from docxtpl import DocxTemplate
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)s | %(message)s",
)
def safe_name(value: str) -> str:
"""Filesystem-safe slug for deterministic, dedup-friendly filenames."""
slug = re.sub(r"[^A-Za-z0-9_-]+", "_", value.strip())
return slug.strip("_") or "unnamed"
def process_batch(template: Path, records: list[dict], out_dir: Path) -> dict:
out_dir.mkdir(parents=True, exist_ok=True)
seen: set[str] = set()
ok, failed = 0, 0
for idx, row in enumerate(records):
# Tie the filename to a stable business key, not the loop index
key = safe_name(row.get("invoice_id") or f"row_{idx}")
# Dedup: if the key repeats, disambiguate instead of overwriting
name = key
n = 1
while name in seen:
n += 1
name = f"{key}_{n}"
seen.add(name)
out_path = out_dir / f"{name}.docx"
try:
tpl = DocxTemplate(template)
tpl.render(build_context(row)) # from the transformation step
tpl.save(out_path)
logging.info("rendered %s", out_path.name)
ok += 1
except Exception as exc: # isolate the failing row
logging.error("row %s (%s) failed: %s", idx, key, exc)
failed += 1
logging.info("done: %s ok, %s failed", ok, failed)
return {"ok": ok, "failed": failed}
Naming deserves more thought than it usually gets. Tying the filename to the loop index (doc_0.docx, doc_1.docx) is fragile: re-run the batch after the source gains a row and every file shifts by one, breaking any downstream reference. Tie it to a stable business key instead — an invoice ID, a client code, a contract number — so the same record always produces the same filename across runs. The safe_name helper strips out anything a filesystem or downstream system (SharePoint, a network share, a CI artifact store) would choke on, and the seen set guarantees that two records sharing a key disambiguate rather than silently overwrite each other. That combination makes the batch both idempotent and dedup-safe: a clean re-run replaces outputs in place, while genuine duplicates surface as _2, _3 suffixes you can investigate.
Resilience is the other half. Wrapping each render in its own try/except isolates a single malformed row — a bad date, a missing required field — so it logs and the batch continues, rather than one corrupt record aborting 9,999 good documents at 3 a.m. Collect the failed keys and re-run just those once the source is fixed; because the job is idempotent, re-processing the full batch would also work and produce identical output.
For very large inputs, stream rows instead of materializing the whole table. pandas read_csv(..., chunksize=1000) yields DataFrame chunks you can iterate without loading the full file into RAM, which keeps memory flat regardless of batch size. Wrap every file write in the implicit context management that tpl.save() already provides, and never hold more than one rendered document in memory at a time.
Output and serialization: converting to PDF
Distribution almost always means PDF — it is immutable, renders identically everywhere, and cannot be accidentally edited. The conversion strategy depends on the host. On a Linux server or CI runner, drive LibreOffice headless; on a Windows or macOS desktop with Word installed, docx2pdf is simpler. The same docx-to-PDF tradeoffs are covered in depth in Converting DOCX to PDF with Python, and the resulting PDFs slot directly into the report flows in Generating PDF Reports Dynamically.
# pip install docx2pdf (Windows/macOS only)
import shutil
import subprocess
from pathlib import Path
def docx_to_pdf(docx_path: Path, out_dir: Path) -> Path:
"""Convert via LibreOffice headless on Linux/servers, docx2pdf elsewhere."""
out_dir.mkdir(parents=True, exist_ok=True)
soffice = shutil.which("soffice") or shutil.which("libreoffice")
if soffice: # cross-platform, server-safe path
subprocess.run(
[soffice, "--headless", "--convert-to", "pdf",
"--outdir", str(out_dir), str(docx_path)],
check=True, capture_output=True, timeout=120,
)
else: # desktop fallback (requires Microsoft Word)
from docx2pdf import convert
convert(str(docx_path), str(out_dir / f"{docx_path.stem}.pdf"))
pdf_path = out_dir / f"{docx_path.stem}.pdf"
if not pdf_path.exists() or pdf_path.stat().st_size == 0:
raise RuntimeError(f"PDF conversion produced no output for {docx_path.name}")
return pdf_path
The post-conversion check matters: LibreOffice can exit 0 yet silently emit nothing if the input is locked or malformed, so assert the file exists and is non-empty. Validate the .docx itself the same way — scan the rendered output for stray {{ or }} markers, which signal an unbound placeholder that slipped through.
Production hardening
A batch job that runs once on your laptop and one that runs unattended every night are different programs. The unattended version needs scheduling, durable logging, and recovery from transient failures.
- Scheduling. On Linux, a cron entry (
0 6 * * 1 /path/.venv/bin/python /path/run_batch.py) runs the job weekly. In CI, a scheduled GitHub Actions workflow with acron:trigger gives you logs, artifacts, and a clean environment for free. The end-to-end scheduling and logging patterns generalize across document types in Scheduling and Logging Automation Jobs. - Logging to a file. Replace
StreamHandlerwith aRotatingFileHandlerso each run appends to a rotating log you can inspect after the fact. Log the source filename, record count, and per-row outcomes — not just "done". - Retries for I/O. PDF conversion and network-mounted output drives fail transiently. Wrap the conversion call in a small retry with backoff so a momentary file lock does not kill an otherwise-good document.
# pip install (stdlib only)
import logging
import time
from logging.handlers import RotatingFileHandler
from pathlib import Path
def configure_logging(log_file: Path) -> None:
log_file.parent.mkdir(parents=True, exist_ok=True)
handler = RotatingFileHandler(log_file, maxBytes=1_000_000, backupCount=3)
handler.setFormatter(logging.Formatter("%(asctime)s | %(levelname)s | %(message)s"))
logging.basicConfig(level=logging.INFO, handlers=[handler])
def with_retries(func, *args, attempts: int = 3, base_delay: float = 1.0):
"""Retry a flaky I/O operation with exponential backoff."""
for attempt in range(1, attempts + 1):
try:
return func(*args)
except Exception as exc:
if attempt == attempts:
logging.error("gave up after %s attempts: %s", attempts, exc)
raise
wait = base_delay * 2 ** (attempt - 1)
logging.warning("attempt %s failed (%s); retrying in %.1fs", attempt, exc, wait)
time.sleep(wait)
Make the whole job idempotent: deterministic filenames mean a re-run after a crash overwrites the same outputs rather than duplicating them, so you can safely restart a failed batch from the top.
One more production concern is observability after the fact. A scheduled job that runs while no one watches needs to leave a trail you can reconstruct a problem from days later. Log the source file path and its modification time, the record count read, and a per-row success or failure line keyed by the same business identifier used in the filename — that way a complaint about "the wrong total on invoice INV-2207" maps straight to a log line and the source row behind it. Emit a single summary line at the end (done: 412 ok, 3 failed) and, in CI, fail the job's exit status when the failure count crosses a threshold so a broken upstream feed raises an alert instead of quietly shipping blanks. Treat the rendered documents themselves as the final validation gate: a quick pass that opens each output and asserts no {{ or }} survived catches unbound placeholders before they reach a client's inbox.
Common mistakes
| Issue | Root cause | Fix |
|---|---|---|
| Every output file contains the first row's data | A single DocxTemplate instance reused across the loop mutates in place | Re-instantiate DocxTemplate(template) inside the loop for each record |
Placeholder renders as literal {{ name }} | The placeholder text is split across multiple Word runs by autocorrect or mid-word edits | Delete and retype the placeholder in one clean pass to collapse the runs |
Empty cells render as nan | pandas converts blank cells to the float NaN on load | Read with dtype=str, keep_default_na=False, then cast specific columns explicitly |
| PDF conversion fails on the server | docx2pdf drives Microsoft Word, which is absent on Linux | Use LibreOffice --headless --convert-to pdf on servers; reserve docx2pdf for desktops |
MemoryError on large batches | The entire data file is loaded and all documents held in memory | Stream with read_csv(chunksize=...) and write each document before rendering the next |
Frequently Asked Questions
Which library should I use — python-docx or docxtpl?
Use docxtpl when you fill a Word-authored template with {{ placeholders }}; it preserves every style and layout detail. Use python-docx when you build or restructure a document programmatically, or to edit metadata and styles after rendering. Most batch jobs use docxtpl for the render and python-docx only for post-processing.
Can I generate thousands of documents without crashing?
Yes. Stream the data with pandas chunksize instead of loading it all at once, re-instantiate the template per row, and write each file before moving to the next so only one document is in memory at a time. The bottleneck at scale is usually PDF conversion, not rendering — parallelize that step across a process pool if it dominates runtime.
How do I build tables whose row count varies per document?
Use docxtpl's row loop, {%tr for item in items %} … {%tr endfor %}, placed inside the template table. It clones the underlying <w:tr> table-row XML for each item in the bound list, preserving the header row and cell borders, so a record with three line items and one with thirty both render correctly.
Does this work on Linux and macOS, or only Windows?
Template rendering and data binding are fully cross-platform — they only touch the .docx XML. The platform-dependent step is PDF conversion: docx2pdf needs Microsoft Word (Windows/macOS only), while LibreOffice headless runs anywhere, including Linux servers and CI runners.
Where should data formatting live — in Python or in the template?
In Python. Coerce currency, dates, and number formats into display-ready strings before binding the context. Keeping formatting in version-controlled code rather than inside the binary .docx makes the logic reviewable, testable, and consistent across every document in the batch.
Related
- Automating Word Document Creation — the python-docx structural API and library selection before you scale to batches
- Dynamic Mail Merge with Python — conditional blocks, nested data, and personalized bulk output with docxtpl and Jinja2
- Inserting Images into Word Documents — embedding logos and per-record images, with sizing that survives rendering
- Converting DOCX to PDF with Python — headless LibreOffice versus docx2pdf, fidelity tradeoffs, and batch conversion
Part of Python Doc & Data Automation.