Batch Merge PDFs with a Python Script

When a batch merge across a large directory halts, the failure is almost always one of three things: PdfReadError (corrupted or truncated header), PermissionError (unclosed file handle on Windows), or silent misordering because Python's default sorted() puts Report_10.pdf before Report_2.pdf. This guide provides a production-ready merge script using pypdf and pathlib that handles all three.

For the foundational merge and split primitives — PdfWriter.append(), outline preservation, and page-size normalization — see Merging and Splitting PDF Documents.

Root Cause

Three failure modes cover the vast majority of batch merge errors:

Malformed PDF headers. Missing %PDF- signatures or truncated cross-reference tables raise pypdf.errors.PdfReadError. One corrupt file in 200 kills the whole run if not caught.
Encrypted files. Password-protected documents raise FileNotDecryptedError the moment you access any page attribute without first calling reader.decrypt(password).
Unclosed file handles. On Windows, PdfReader objects that remain open after the function exits hold OS locks. A second run on the same directory raises PermissionError: [Errno 13] Permission denied.

Diagnostic Snippet

Before writing a merge script, confirm which files are problematic:

# pip install pypdf
from pypdf import PdfReader
from pypdf.errors import PdfReadError
from pathlib import Path

def audit_pdfs(input_dir: Path) -> None:
    """Print status of every PDF in a directory: ok / encrypted / corrupt."""
    for pdf in sorted(input_dir.glob("*.pdf")):
        try:
            with open(pdf, "rb") as fh:
                reader = PdfReader(fh)
                status = "encrypted" if reader.is_encrypted else f"ok ({len(reader.pages)} pages)"
        except PdfReadError as exc:
            status = f"corrupt: {exc}"
        except Exception as exc:
            status = f"error: {exc}"
        print(f"{pdf.name:40s}  {status}")

if __name__ == "__main__":
    audit_pdfs(Path("./input_pdfs"))

Run this first. Files marked corrupt need manual repair with pikepdf or removal. Files marked encrypted require either a password or exclusion from the batch.

Fix Implementation: Robust Merge Script

# pip install pypdf
import re
from pathlib import Path

from pypdf import PdfWriter, PdfReader
from pypdf.errors import PdfReadError, FileNotDecryptedError


def natural_sort_key(filepath: Path) -> list:
    """
    Split filename into text/integer tokens so that
    Report_2.pdf sorts before Report_10.pdf.
    """
    return [
        int(c) if c.isdigit() else c.lower()
        for c in re.split(r"(\d+)", filepath.name)
    ]


def batch_merge_pdfs(
    input_dir: Path,
    output_path: Path,
    password: str = "",
) -> int:
    """
    Merge all readable PDFs in input_dir (natural sort) into output_path.
    Returns the number of successfully merged files.
    Skips corrupt, encrypted-without-password, and locked files with a log line.
    """
    if not input_dir.is_dir():
        raise FileNotFoundError(f"Input directory not found: {input_dir}")

    pdf_files = sorted(input_dir.glob("*.pdf"), key=natural_sort_key)
    writer = PdfWriter()
    merged_count = 0

    for pdf in pdf_files:
        try:
            with open(pdf, "rb") as fh:           # 'with' guarantees handle closure
                reader = PdfReader(fh)
                if reader.is_encrypted:
                    if not password:
                        print(f"[SKIP] Encrypted (no password supplied): {pdf.name}")
                        continue
                    result = reader.decrypt(password)
                    if result == 0:
                        print(f"[SKIP] Wrong password for: {pdf.name}")
                        continue
                writer.append(reader, import_outline=True)  # preserves bookmarks
                merged_count += 1
        except PdfReadError as exc:
            print(f"[SKIP] Corrupt PDF: {pdf.name} — {exc}")
        except FileNotDecryptedError as exc:
            print(f"[SKIP] Decrypt failed: {pdf.name} — {exc}")
        except PermissionError as exc:
            print(f"[SKIP] File locked: {pdf.name} — {exc}")
        except Exception as exc:
            print(f"[SKIP] Unexpected error on {pdf.name}: {exc}")

    if merged_count == 0:
        print("[WARN] No valid PDFs found; output not written.")
        writer.close()
        return 0

    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "wb") as out:
        writer.write(out)
    writer.close()   # release internal buffer references
    print(f"[OK] Merged {merged_count}/{len(pdf_files)} files → {output_path}")
    return merged_count


if __name__ == "__main__":
    batch_merge_pdfs(
        Path("./input_pdfs"),
        Path("./output/merged.pdf"),
    )

Key decisions in this script:

Natural sort. natural_sort_key splits filenames on digit boundaries so Report_10.pdf follows Report_9.pdf, not Report_1.pdf.
with open() per file. Each reader is closed after append() executes. Never accumulate open PdfReader objects across iterations.
import_outline=True. Passes each source document's bookmarks into the writer. Without this, all navigation structure is lost.
writer.close(). Flushes pypdf's internal page-reference cache. Omitting it leaks memory on long-running processes.

Variant: Merge with argparse CLI

The function above is importable. Wrap it with argparse for command-line use:

# pip install pypdf
#!/usr/bin/env python3
"""
merge_folder.py — merge all PDFs in a directory.
Usage: python merge_folder.py --input ./docs --output merged.pdf [--password SECRET]
"""
import argparse
import re
from pathlib import Path

from pypdf import PdfWriter, PdfReader
from pypdf.errors import PdfReadError, FileNotDecryptedError


def natural_sort_key(p: Path) -> list:
    return [int(c) if c.isdigit() else c.lower() for c in re.split(r"(\d+)", p.name)]


def run(args: argparse.Namespace) -> None:
    input_dir = Path(args.input)
    output_path = Path(args.output)
    password = args.password or ""

    if not input_dir.is_dir():
        raise SystemExit(f"Not a directory: {input_dir}")

    pdf_files = sorted(input_dir.glob("*.pdf"), key=natural_sort_key)
    print(f"Found {len(pdf_files)} PDFs in {input_dir}")

    writer = PdfWriter()
    merged = 0
    for pdf in pdf_files:
        try:
            with open(pdf, "rb") as fh:
                reader = PdfReader(fh)
                if reader.is_encrypted:
                    if not password or reader.decrypt(password) == 0:
                        print(f"[SKIP] {pdf.name} — encrypted")
                        continue
                writer.append(reader, import_outline=True)
                merged += 1
                print(f"  + {pdf.name}")
        except (PdfReadError, FileNotDecryptedError, PermissionError) as exc:
            print(f"[SKIP] {pdf.name} — {exc}")

    if merged == 0:
        raise SystemExit("No files merged; output not written.")

    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "wb") as out:
        writer.write(out)
    writer.close()
    print(f"\nMerged {merged}/{len(pdf_files)} → {output_path}")


def main() -> None:
    ap = argparse.ArgumentParser(description="Merge all PDFs in a folder")
    ap.add_argument("--input", required=True, help="Directory of source PDFs")
    ap.add_argument("--output", required=True, help="Output PDF path")
    ap.add_argument("--password", default="", help="Shared password for encrypted files")
    run(ap.parse_args())


if __name__ == "__main__":
    main()

Variant: Chunked Merge for 500+ Files

Holding page references for 500 PDFs in a single PdfWriter can exhaust memory if any source contains large embedded images. Merge in chunks of 100, write to temporary files, then merge the temporaries:

# pip install pypdf
import math
import shutil
import tempfile
from pathlib import Path

from pypdf import PdfWriter, PdfReader
from pypdf.errors import PdfReadError


def chunked_merge(
    files: list[Path],
    output: Path,
    chunk_size: int = 100,
) -> None:
    """Merge a large file list in bounded-memory chunks."""
    tmp_dir = Path(tempfile.mkdtemp(prefix="pdf_merge_"))
    try:
        chunk_paths: list[Path] = []
        n_chunks = math.ceil(len(files) / chunk_size)

        for chunk_idx in range(n_chunks):
            chunk = files[chunk_idx * chunk_size : (chunk_idx + 1) * chunk_size]
            chunk_out = tmp_dir / f"chunk_{chunk_idx:04d}.pdf"
            writer = PdfWriter()
            for f in chunk:
                try:
                    with open(f, "rb") as fh:
                        writer.append(PdfReader(fh), import_outline=True)
                except PdfReadError as exc:
                    print(f"[SKIP] {f.name}: {exc}")
            with open(chunk_out, "wb") as out:
                writer.write(out)
            writer.close()
            chunk_paths.append(chunk_out)
            print(f"Chunk {chunk_idx + 1}/{n_chunks} written ({len(chunk)} files)")

        # Final pass: merge chunk files
        final = PdfWriter()
        for cp in chunk_paths:
            with open(cp, "rb") as fh:
                final.append(PdfReader(fh), import_outline=True)
        output.parent.mkdir(parents=True, exist_ok=True)
        with open(output, "wb") as out:
            final.write(out)
        final.close()
        print(f"Final merge complete → {output}")
    finally:
        shutil.rmtree(tmp_dir, ignore_errors=True)

Verification

After every batch run, confirm the merged file opens cleanly and has the expected page count:

# pip install pypdf
from pypdf import PdfReader
from pathlib import Path


def verify_merge(output_path: Path, expected_pages: int | None = None) -> bool:
    """Return True if the PDF opens without errors and matches expected page count."""
    try:
        reader = PdfReader(output_path)
        actual = len(reader.pages)
        if expected_pages is not None and actual != expected_pages:
            print(f"FAIL: expected {expected_pages} pages, got {actual}")
            return False
        print(f"OK: {output_path.name}  ({actual} pages, {len(reader.outline)} outline items)")
        return True
    except Exception as exc:
        print(f"FAIL: {exc}")
        return False


if __name__ == "__main__":
    verify_merge(Path("./output/merged.pdf"))

Run this check in CI or as the final step of any scheduled merge job. If the expected page count is known (sum of source pages minus skipped files), pass it explicitly to catch silent data loss.

After verifying, the merged PDF can be handed off to the assembly pipeline described in Generating PDF Reports Dynamically, or secured as shown in Watermarking and Securing PDFs.

Common Mistakes

Issue	Root cause	Fix
`Report_10.pdf` merged before `Report_2.pdf`	`sorted()` uses lexicographic string comparison	Use `natural_sort_key` with regex digit splitting
`PermissionError` on second run (Windows)	`PdfReader` objects not closed between iterations	Wrap every `PdfReader(fh)` call inside `with open(path, "rb") as fh:`
Bookmarks absent in merged output	`append()` called without `import_outline=True`	Add `import_outline=True`; confirm pypdf version ≥ 3.0
`MemoryError` on large batches	All page objects held in one `PdfWriter`	Use chunked merge (100 files per chunk)
Merged file is 0 bytes	`writer.write()` called on empty writer	Guard with `if len(writer.pages) > 0` before writing

FAQ

Why does my script fail on the 50th PDF in a batch? Likely an accumulated file handle or a corrupt file at position 50. Add try/except PdfReadError per iteration, and ensure every open() is inside a with block. Run the audit snippet above first to identify the bad file.

Can I merge password-protected PDFs automatically? Only if you know the password. Call reader.decrypt("password") before append() and check the return value: 0 means wrong password, 1 means success with the user password, 2 means success with the owner password.

Does pypdf preserve bookmarks and metadata?PdfWriter.append(reader, import_outline=True) retains hierarchical bookmarks. Metadata from the last appended document overwrites earlier values; set explicit metadata with writer.add_metadata({"/Title": "...", "/Author": "..."}) after all appends.

How do I merge in a specific custom order rather than sorted filename order? Build the files list yourself (e.g., from a manifest CSV) and pass it directly to batch_merge_pdfs — the natural sort only applies inside the function to the globbed files. Replace the sorted(..., key=natural_sort_key) line with your pre-ordered list.

Merging and Splitting PDF Documents — full reference: append() vs add_page(), outline inspection, page reordering, and chunked streaming
Split a PDF by Page Ranges with Python — inverse operation: parse a ranges string and write one file per range
Watermarking and Securing PDFs — apply access controls to the merged output

Part of Merging and Splitting PDF Documents.