Split a PDF by Page Ranges with Python

You have a 40-page PDF — a monthly report, a contract, a scanned bundle — and you need to produce separate files for pages 1–10, 11–25, and 26–40. A naive slice loop either writes the wrong pages or crashes because pypdf uses 0-based indexing while every PDF viewer shows 1-based page numbers. This guide covers the correct split pattern, the range-string parser that powers a --split "1-10,11-25,26-40" CLI argument, and three variants: split every N pages, split on bookmarks, and handle invalid ranges gracefully.

All split operations use pypdf's PdfReader and PdfWriter; the same library handles the merging side of the workflow in Merging and Splitting PDF Documents.

Root Cause: Off-by-One Between UI and pypdf

PDF viewers display page 1 as the first page. pypdf stores pages in a zero-indexed list: reader.pages[0] is page 1, reader.pages[1] is page 2, and so on.

The error appears silently: a script that writes range(start, end) with 1-based start/end values shifts every page one position forward and drops the real last page of each range.

The fix is mechanical — always subtract 1 from the user-visible start when entering pypdf's index space, and use end (not end - 1) as the exclusive stop of range():

User says "pages 1–5"
pypdf indices: 0, 1, 2, 3, 4
range() call:  range(1 - 1, 5)  →  range(0, 5)  ✓

Diagnostic: Confirm Page Count Before Splitting

# pip install pypdf
from pypdf import PdfReader
from pathlib import Path

def inspect(path: Path) -> None:
    """Print page count and first-level outline entries."""
    try:
        reader = PdfReader(path)
        print(f"{path.name}: {len(reader.pages)} pages, encrypted={reader.is_encrypted}")
        for item in reader.outline:
            if hasattr(item, "title"):
                pg = reader.get_destination_page_number(item) + 1
                print(f"  [p{pg}] {item.title}")
    except Exception as exc:
        print(f"Could not inspect {path.name}: {exc}")

if __name__ == "__main__":
    inspect(Path("./source_document.pdf"))

Run this before splitting. Knowing the total page count lets you validate every requested range before writing a single byte. The outline listing is useful for the bookmark-split variant below.

Fix Implementation: Range-Based Split

# pip install pypdf
from pypdf import PdfReader, PdfWriter
from pathlib import Path


def split_by_ranges(
    input_path: Path,
    output_dir: Path,
    ranges: list[tuple[int, int]],
) -> list[Path]:
    """
    Split input_path into one PDF per range.

    ranges is a list of (start, end) tuples using 1-based page numbers
    (matching what PDF viewers show).  Both start and end are inclusive.

    Returns the list of output paths created.
    """
    output_dir.mkdir(parents=True, exist_ok=True)
    created: list[Path] = []

    try:
        with open(input_path, "rb") as fh:
            reader = PdfReader(fh)
            total = len(reader.pages)

            for idx, (start, end) in enumerate(ranges, start=1):
                # Validate before touching the writer
                if start < 1:
                    raise ValueError(f"Range {idx}: start={start} must be >= 1")
                if end > total:
                    raise ValueError(
                        f"Range {idx}: end={end} exceeds document length ({total} pages)"
                    )
                if start > end:
                    raise ValueError(f"Range {idx}: start={start} > end={end}")

                writer = PdfWriter()
                # KEY: subtract 1 from start to convert 1-based → 0-based;
                #      end is the exclusive upper bound for range(), so no adjustment.
                for page_idx in range(start - 1, end):
                    writer.add_page(reader.pages[page_idx])

                out_path = output_dir / f"{input_path.stem}_part{idx:02d}.pdf"
                with open(out_path, "wb") as out:
                    writer.write(out)
                writer.close()
                created.append(out_path)
                print(f"part{idx:02d}: pages {start}–{end} ({end - start + 1} pages) → {out_path.name}")

    except Exception as exc:
        print(f"Split failed: {exc}")
        raise

    return created


if __name__ == "__main__":
    split_by_ranges(
        Path("./annual_report.pdf"),
        Path("./output/splits"),
        [(1, 10), (11, 25), (26, 40)],
    )

Each range gets its own fresh PdfWriter. Reusing a single writer across ranges would concatenate all ranges into one file instead of producing separate outputs.

Parse a Ranges String from the Command Line

Rather than hard-coding tuples, accept a string like "1-10,11-25,26-40":

# pip install pypdf
import re
from pathlib import Path
from pypdf import PdfReader, PdfWriter


def parse_ranges(spec: str) -> list[tuple[int, int]]:
    """
    Parse '1-10,11-25,26-40' → [(1,10),(11,25),(26,40)].
    Single-page entries like '5' become (5,5).
    Raises ValueError for malformed tokens.
    """
    result: list[tuple[int, int]] = []
    for token in spec.split(","):
        token = token.strip()
        if not token:
            continue
        match = re.fullmatch(r"(\d+)(?:-(\d+))?", token)
        if not match:
            raise ValueError(f"Invalid range token: {token!r}")
        a = int(match.group(1))
        b = int(match.group(2)) if match.group(2) else a
        if a > b:
            raise ValueError(f"Start {a} > end {b} in token {token!r}")
        result.append((a, b))
    return result


if __name__ == "__main__":
    print(parse_ranges("1-10,11-25,26-40"))   # [(1, 10), (11, 25), (26, 40)]
    print(parse_ranges("5"))                   # [(5, 5)]
    print(parse_ranges("1-5, 7-10"))           # [(1, 5), (7, 10)]

Connect this to the CLI:

# pip install pypdf
#!/usr/bin/env python3
"""
split_pdf.py — split a PDF by page ranges.
Usage: python split_pdf.py --input report.pdf --output ./splits --ranges "1-10,11-25"
"""
import argparse
import re
from pathlib import Path

from pypdf import PdfReader, PdfWriter


def parse_ranges(spec: str) -> list[tuple[int, int]]:
    result = []
    for token in spec.split(","):
        token = token.strip()
        m = re.fullmatch(r"(\d+)(?:-(\d+))?", token)
        if not m:
            raise argparse.ArgumentTypeError(f"Bad range: {token!r}")
        a, b = int(m.group(1)), int(m.group(2) or m.group(1))
        if a > b:
            raise argparse.ArgumentTypeError(f"start > end in {token!r}")
        result.append((a, b))
    return result


def split(input_path: Path, output_dir: Path, ranges: list[tuple[int, int]]) -> None:
    output_dir.mkdir(parents=True, exist_ok=True)
    with open(input_path, "rb") as fh:
        reader = PdfReader(fh)
        total = len(reader.pages)
        print(f"{input_path.name}: {total} pages")
        for idx, (start, end) in enumerate(ranges, 1):
            if not (1 <= start <= end <= total):
                print(f"[SKIP] Range ({start}-{end}) invalid for {total}-page document")
                continue
            writer = PdfWriter()
            for i in range(start - 1, end):       # 0-based index
                writer.add_page(reader.pages[i])
            out = output_dir / f"{input_path.stem}_part{idx:02d}.pdf"
            with open(out, "wb") as f:
                writer.write(f)
            writer.close()
            print(f"  part{idx:02d}: p{start}–p{end} → {out.name}")


def main() -> None:
    ap = argparse.ArgumentParser(description="Split a PDF by page ranges")
    ap.add_argument("--input", required=True, type=Path)
    ap.add_argument("--output", required=True, type=Path)
    ap.add_argument("--ranges", required=True, help='e.g. "1-10,11-25,26-40"')
    args = ap.parse_args()
    split(args.input, args.output, parse_ranges(args.ranges))


if __name__ == "__main__":
    main()

Variant: Split Every N Pages

When you do not know the ranges in advance — just "give me 10-page chunks":

# pip install pypdf
import math
from pathlib import Path
from pypdf import PdfReader, PdfWriter


def split_every_n(input_path: Path, output_dir: Path, n: int) -> list[Path]:
    """
    Split input_path into files of n pages each.
    The last file may have fewer than n pages.
    Returns list of created file paths.
    """
    if n < 1:
        raise ValueError(f"n must be >= 1, got {n}")

    output_dir.mkdir(parents=True, exist_ok=True)
    created: list[Path] = []

    with open(input_path, "rb") as fh:
        reader = PdfReader(fh)
        total = len(reader.pages)
        n_chunks = math.ceil(total / n)

        for chunk in range(n_chunks):
            start_idx = chunk * n              # 0-based
            end_idx = min(start_idx + n, total)  # exclusive
            writer = PdfWriter()
            for i in range(start_idx, end_idx):
                writer.add_page(reader.pages[i])
            out_path = output_dir / f"{input_path.stem}_chunk{chunk + 1:03d}.pdf"
            with open(out_path, "wb") as out:
                writer.write(out)
            writer.close()
            created.append(out_path)
            # Report in 1-based page numbers for readability
            print(f"chunk{chunk + 1:03d}: pages {start_idx + 1}–{end_idx} → {out_path.name}")

    return created


if __name__ == "__main__":
    split_every_n(Path("./large_export.pdf"), Path("./chunks"), n=10)

Note: start_idx is already 0-based here because we compute it directly from the chunk index. No subtraction needed — the subtraction is only required when converting a 1-based user input.

Variant: Split on Bookmarks

When the PDF has a well-structured outline, split at each top-level bookmark:

# pip install pypdf
from pathlib import Path
from pypdf import PdfReader, PdfWriter


def split_on_bookmarks(input_path: Path, output_dir: Path) -> list[Path]:
    """
    Split a PDF at each top-level bookmark.
    Each section runs from its bookmark's page to the page before the next bookmark.
    Returns list of created file paths.
    """
    output_dir.mkdir(parents=True, exist_ok=True)
    created: list[Path] = []

    with open(input_path, "rb") as fh:
        reader = PdfReader(fh)
        total = len(reader.pages)

        # Collect top-level bookmarks (skip nested lists)
        top_level = [
            item for item in reader.outline
            if hasattr(item, "title")
        ]
        if not top_level:
            print("No top-level bookmarks found; nothing to split on.")
            return created

        # Build (start_page_0based, title) pairs
        sections: list[tuple[int, str]] = []
        for item in top_level:
            pg_0based = reader.get_destination_page_number(item)
            sections.append((pg_0based, item.title))

        # Each section ends one page before the next section starts
        for i, (start_idx, title) in enumerate(sections):
            end_idx = sections[i + 1][0] if i + 1 < len(sections) else total
            if start_idx >= end_idx:
                print(f"[SKIP] Empty section: {title!r}")
                continue

            writer = PdfWriter()
            for page_idx in range(start_idx, end_idx):
                writer.add_page(reader.pages[page_idx])

            # Sanitize title for use as filename
            safe_name = "".join(c if c.isalnum() or c in " _-" else "_" for c in title)
            out_path = output_dir / f"{i + 1:02d}_{safe_name[:40]}.pdf"
            with open(out_path, "wb") as out:
                writer.write(out)
            writer.close()
            created.append(out_path)
            # Display in 1-based page numbers
            print(f"  [{i+1}] '{title}' (p{start_idx+1}–p{end_idx}) → {out_path.name}")

    return created


if __name__ == "__main__":
    split_on_bookmarks(Path("./report_with_bookmarks.pdf"), Path("./sections"))

get_destination_page_number() returns a 0-based index — no subtraction needed since we use it directly as a slice start, not from user input.

Verification

# pip install pypdf
from pypdf import PdfReader
from pathlib import Path


def verify_splits(
    source: Path,
    output_dir: Path,
    ranges: list[tuple[int, int]],
) -> bool:
    """
    Confirm each split file has the expected page count.
    Assumes files are named *_part01.pdf, *_part02.pdf, ...
    """
    all_ok = True
    for idx, (start, end) in enumerate(ranges, 1):
        expected = end - start + 1
        out_path = output_dir / f"{source.stem}_part{idx:02d}.pdf"
        try:
            actual = len(PdfReader(out_path).pages)
            status = "OK" if actual == expected else f"FAIL (expected {expected}, got {actual})"
        except Exception as exc:
            status = f"FAIL ({exc})"
            actual = -1
        print(f"  part{idx:02d}: {status}")
        if "FAIL" in status:
            all_ok = False
    return all_ok


if __name__ == "__main__":
    source = Path("./annual_report.pdf")
    ranges = [(1, 10), (11, 25), (26, 40)]
    split_by_ranges(source, Path("./output/splits"), ranges)
    ok = verify_splits(source, Path("./output/splits"), ranges)
    print("All splits verified." if ok else "Some splits failed verification.")

The total pages across all split files should equal the number of pages covered by all ranges. If ranges overlap, a page may appear in multiple outputs — that is valid but worth logging explicitly.

After splitting, the individual files can be watermarked or password-protected per recipient using the patterns in Watermarking and Securing PDFs, or re-assembled with cover pages via Generating PDF Reports Dynamically.

Common Mistakes

Mistake	Symptom	Fix
`range(start, end)` with 1-based `start`	First page of each range is page 2 in the output; last real page is skipped	Use `range(start - 1, end)` — subtract 1 from start only
Reusing one `PdfWriter` across ranges	All ranges concatenated into a single output file	Create a new `PdfWriter()` inside the loop, before each range
Not checking `end <= total`	Silent extra blank pages or `IndexError`	Validate each range against `len(reader.pages)` before writing
Open `PdfReader` across all chunks	`PermissionError` on Windows; memory growth	Open with `with open(path, "rb") as fh: reader = PdfReader(fh)` and close after each range group
Using `get_destination_page_number()` output as 1-based	Off-by-one in bookmark splits	That method already returns 0-based; use directly in `range()` without subtracting

FAQ

Why does my split file have one fewer page than expected? You used range(start, end) with 1-based end instead of 0-based. The fix is range(start - 1, end) — end stays unchanged because range() excludes its upper bound, which exactly cancels out the needed off-by-one.

Can I split overlapping ranges (e.g., pages 1–5 and 3–8)? Yes. Each range gets its own PdfWriter and output file. Pages 3–5 will appear in both outputs. This is intentional for use cases like generating an "executive summary" (pages 1–5) that overlaps a "full section" (pages 3–8).

How do I split a scanned PDF where there are no bookmarks? Use the range-based approach with manually specified page boundaries, or run OCR first with the pattern in Scanning and OCR Processing with Python to detect section boundaries from text content.

Does splitting strip PDF forms or annotations?writer.add_page() does a shallow copy that preserves page-level annotations (highlights, comments, form fields on that page). Interactive form state (/AcroForm) at the document level is not copied — re-attach it with writer.clone_reader_document_root(reader) if needed.

Merging and Splitting PDF Documents — full reference including bookmark preservation, page reordering, and large-batch streaming
Batch Merge PDFs with a Python Script — the inverse operation: merge a folder of PDFs with natural sort and error recovery
Watermarking and Securing PDFs — apply per-recipient watermarks or password protection to the split output files

Part of Merging and Splitting PDF Documents.