Batch Merge PDFs with a Python Script
When a batch merge across a large directory halts, the failure is almost always one of three things: PdfReadError (corrupted or truncated header), PermissionError (unclosed file handle on Windows), or silent misordering because Python's default sorted() puts Report_10.pdf before Report_2.pdf. This guide provides a production-ready merge script using pypdf and pathlib that handles all three.
For the foundational merge and split primitives — PdfWriter.append(), outline preservation, and page-size normalization — see Merging and Splitting PDF Documents.
Root Cause
Three failure modes cover the vast majority of batch merge errors:
- Malformed PDF headers. Missing
%PDF-signatures or truncated cross-reference tables raisepypdf.errors.PdfReadError. One corrupt file in 200 kills the whole run if not caught. - Encrypted files. Password-protected documents raise
FileNotDecryptedErrorthe moment you access any page attribute without first callingreader.decrypt(password). - Unclosed file handles. On Windows,
PdfReaderobjects that remain open after the function exits hold OS locks. A second run on the same directory raisesPermissionError: [Errno 13] Permission denied.
Diagnostic Snippet
Before writing a merge script, confirm which files are problematic:
# pip install pypdf
from pypdf import PdfReader
from pypdf.errors import PdfReadError
from pathlib import Path
def audit_pdfs(input_dir: Path) -> None:
"""Print status of every PDF in a directory: ok / encrypted / corrupt."""
for pdf in sorted(input_dir.glob("*.pdf")):
try:
with open(pdf, "rb") as fh:
reader = PdfReader(fh)
status = "encrypted" if reader.is_encrypted else f"ok ({len(reader.pages)} pages)"
except PdfReadError as exc:
status = f"corrupt: {exc}"
except Exception as exc:
status = f"error: {exc}"
print(f"{pdf.name:40s} {status}")
if __name__ == "__main__":
audit_pdfs(Path("./input_pdfs"))
Run this first. Files marked corrupt need manual repair with pikepdf or removal. Files marked encrypted require either a password or exclusion from the batch.
Fix Implementation: Robust Merge Script
# pip install pypdf
import re
from pathlib import Path
from pypdf import PdfWriter, PdfReader
from pypdf.errors import PdfReadError, FileNotDecryptedError
def natural_sort_key(filepath: Path) -> list:
"""
Split filename into text/integer tokens so that
Report_2.pdf sorts before Report_10.pdf.
"""
return [
int(c) if c.isdigit() else c.lower()
for c in re.split(r"(\d+)", filepath.name)
]
def batch_merge_pdfs(
input_dir: Path,
output_path: Path,
password: str = "",
) -> int:
"""
Merge all readable PDFs in input_dir (natural sort) into output_path.
Returns the number of successfully merged files.
Skips corrupt, encrypted-without-password, and locked files with a log line.
"""
if not input_dir.is_dir():
raise FileNotFoundError(f"Input directory not found: {input_dir}")
pdf_files = sorted(input_dir.glob("*.pdf"), key=natural_sort_key)
writer = PdfWriter()
merged_count = 0
for pdf in pdf_files:
try:
with open(pdf, "rb") as fh: # 'with' guarantees handle closure
reader = PdfReader(fh)
if reader.is_encrypted:
if not password:
print(f"[SKIP] Encrypted (no password supplied): {pdf.name}")
continue
result = reader.decrypt(password)
if result == 0:
print(f"[SKIP] Wrong password for: {pdf.name}")
continue
writer.append(reader, import_outline=True) # preserves bookmarks
merged_count += 1
except PdfReadError as exc:
print(f"[SKIP] Corrupt PDF: {pdf.name} — {exc}")
except FileNotDecryptedError as exc:
print(f"[SKIP] Decrypt failed: {pdf.name} — {exc}")
except PermissionError as exc:
print(f"[SKIP] File locked: {pdf.name} — {exc}")
except Exception as exc:
print(f"[SKIP] Unexpected error on {pdf.name}: {exc}")
if merged_count == 0:
print("[WARN] No valid PDFs found; output not written.")
writer.close()
return 0
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as out:
writer.write(out)
writer.close() # release internal buffer references
print(f"[OK] Merged {merged_count}/{len(pdf_files)} files → {output_path}")
return merged_count
if __name__ == "__main__":
batch_merge_pdfs(
Path("./input_pdfs"),
Path("./output/merged.pdf"),
)
Key decisions in this script:
- Natural sort.
natural_sort_keysplits filenames on digit boundaries soReport_10.pdffollowsReport_9.pdf, notReport_1.pdf. with open()per file. Each reader is closed afterappend()executes. Never accumulate openPdfReaderobjects across iterations.import_outline=True. Passes each source document's bookmarks into the writer. Without this, all navigation structure is lost.writer.close(). Flushes pypdf's internal page-reference cache. Omitting it leaks memory on long-running processes.
Variant: Merge with argparse CLI
The function above is importable. Wrap it with argparse for command-line use:
# pip install pypdf
#!/usr/bin/env python3
"""
merge_folder.py — merge all PDFs in a directory.
Usage: python merge_folder.py --input ./docs --output merged.pdf [--password SECRET]
"""
import argparse
import re
from pathlib import Path
from pypdf import PdfWriter, PdfReader
from pypdf.errors import PdfReadError, FileNotDecryptedError
def natural_sort_key(p: Path) -> list:
return [int(c) if c.isdigit() else c.lower() for c in re.split(r"(\d+)", p.name)]
def run(args: argparse.Namespace) -> None:
input_dir = Path(args.input)
output_path = Path(args.output)
password = args.password or ""
if not input_dir.is_dir():
raise SystemExit(f"Not a directory: {input_dir}")
pdf_files = sorted(input_dir.glob("*.pdf"), key=natural_sort_key)
print(f"Found {len(pdf_files)} PDFs in {input_dir}")
writer = PdfWriter()
merged = 0
for pdf in pdf_files:
try:
with open(pdf, "rb") as fh:
reader = PdfReader(fh)
if reader.is_encrypted:
if not password or reader.decrypt(password) == 0:
print(f"[SKIP] {pdf.name} — encrypted")
continue
writer.append(reader, import_outline=True)
merged += 1
print(f" + {pdf.name}")
except (PdfReadError, FileNotDecryptedError, PermissionError) as exc:
print(f"[SKIP] {pdf.name} — {exc}")
if merged == 0:
raise SystemExit("No files merged; output not written.")
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as out:
writer.write(out)
writer.close()
print(f"\nMerged {merged}/{len(pdf_files)} → {output_path}")
def main() -> None:
ap = argparse.ArgumentParser(description="Merge all PDFs in a folder")
ap.add_argument("--input", required=True, help="Directory of source PDFs")
ap.add_argument("--output", required=True, help="Output PDF path")
ap.add_argument("--password", default="", help="Shared password for encrypted files")
run(ap.parse_args())
if __name__ == "__main__":
main()
Variant: Chunked Merge for 500+ Files
Holding page references for 500 PDFs in a single PdfWriter can exhaust memory if any source contains large embedded images. Merge in chunks of 100, write to temporary files, then merge the temporaries:
# pip install pypdf
import math
import shutil
import tempfile
from pathlib import Path
from pypdf import PdfWriter, PdfReader
from pypdf.errors import PdfReadError
def chunked_merge(
files: list[Path],
output: Path,
chunk_size: int = 100,
) -> None:
"""Merge a large file list in bounded-memory chunks."""
tmp_dir = Path(tempfile.mkdtemp(prefix="pdf_merge_"))
try:
chunk_paths: list[Path] = []
n_chunks = math.ceil(len(files) / chunk_size)
for chunk_idx in range(n_chunks):
chunk = files[chunk_idx * chunk_size : (chunk_idx + 1) * chunk_size]
chunk_out = tmp_dir / f"chunk_{chunk_idx:04d}.pdf"
writer = PdfWriter()
for f in chunk:
try:
with open(f, "rb") as fh:
writer.append(PdfReader(fh), import_outline=True)
except PdfReadError as exc:
print(f"[SKIP] {f.name}: {exc}")
with open(chunk_out, "wb") as out:
writer.write(out)
writer.close()
chunk_paths.append(chunk_out)
print(f"Chunk {chunk_idx + 1}/{n_chunks} written ({len(chunk)} files)")
# Final pass: merge chunk files
final = PdfWriter()
for cp in chunk_paths:
with open(cp, "rb") as fh:
final.append(PdfReader(fh), import_outline=True)
output.parent.mkdir(parents=True, exist_ok=True)
with open(output, "wb") as out:
final.write(out)
final.close()
print(f"Final merge complete → {output}")
finally:
shutil.rmtree(tmp_dir, ignore_errors=True)
Verification
After every batch run, confirm the merged file opens cleanly and has the expected page count:
# pip install pypdf
from pypdf import PdfReader
from pathlib import Path
def verify_merge(output_path: Path, expected_pages: int | None = None) -> bool:
"""Return True if the PDF opens without errors and matches expected page count."""
try:
reader = PdfReader(output_path)
actual = len(reader.pages)
if expected_pages is not None and actual != expected_pages:
print(f"FAIL: expected {expected_pages} pages, got {actual}")
return False
print(f"OK: {output_path.name} ({actual} pages, {len(reader.outline)} outline items)")
return True
except Exception as exc:
print(f"FAIL: {exc}")
return False
if __name__ == "__main__":
verify_merge(Path("./output/merged.pdf"))
Run this check in CI or as the final step of any scheduled merge job. If the expected page count is known (sum of source pages minus skipped files), pass it explicitly to catch silent data loss.
After verifying, the merged PDF can be handed off to the assembly pipeline described in Generating PDF Reports Dynamically, or secured as shown in Watermarking and Securing PDFs.
Common Mistakes
| Issue | Root cause | Fix |
|---|---|---|
Report_10.pdf merged before Report_2.pdf | sorted() uses lexicographic string comparison | Use natural_sort_key with regex digit splitting |
PermissionError on second run (Windows) | PdfReader objects not closed between iterations | Wrap every PdfReader(fh) call inside with open(path, "rb") as fh: |
| Bookmarks absent in merged output | append() called without import_outline=True | Add import_outline=True; confirm pypdf version ≥ 3.0 |
MemoryError on large batches | All page objects held in one PdfWriter | Use chunked merge (100 files per chunk) |
| Merged file is 0 bytes | writer.write() called on empty writer | Guard with if len(writer.pages) > 0 before writing |
FAQ
Why does my script fail on the 50th PDF in a batch?
Likely an accumulated file handle or a corrupt file at position 50. Add try/except PdfReadError per iteration, and ensure every open() is inside a with block. Run the audit snippet above first to identify the bad file.
Can I merge password-protected PDFs automatically?
Only if you know the password. Call reader.decrypt("password") before append() and check the return value: 0 means wrong password, 1 means success with the user password, 2 means success with the owner password.
Does pypdf preserve bookmarks and metadata?PdfWriter.append(reader, import_outline=True) retains hierarchical bookmarks. Metadata from the last appended document overwrites earlier values; set explicit metadata with writer.add_metadata({"/Title": "...", "/Author": "..."}) after all appends.
How do I merge in a specific custom order rather than sorted filename order?
Build the files list yourself (e.g., from a manifest CSV) and pass it directly to batch_merge_pdfs — the natural sort only applies inside the function to the globbed files. Replace the sorted(..., key=natural_sort_key) line with your pre-ordered list.
Related
- Merging and Splitting PDF Documents — full reference:
append()vsadd_page(), outline inspection, page reordering, and chunked streaming - Split a PDF by Page Ranges with Python — inverse operation: parse a ranges string and write one file per range
- Watermarking and Securing PDFs — apply access controls to the merged output
Part of Merging and Splitting PDF Documents.