Split a PDF by Page Ranges with Python
You have a 40-page PDF — a monthly report, a contract, a scanned bundle — and you need to produce separate files for pages 1–10, 11–25, and 26–40. A naive slice loop either writes the wrong pages or crashes because pypdf uses 0-based indexing while every PDF viewer shows 1-based page numbers. This guide covers the correct split pattern, the range-string parser that powers a --split "1-10,11-25,26-40" CLI argument, and three variants: split every N pages, split on bookmarks, and handle invalid ranges gracefully.
All split operations use pypdf's PdfReader and PdfWriter; the same library handles the merging side of the workflow in Merging and Splitting PDF Documents.
Root Cause: Off-by-One Between UI and pypdf
PDF viewers display page 1 as the first page. pypdf stores pages in a zero-indexed list: reader.pages[0] is page 1, reader.pages[1] is page 2, and so on.
The error appears silently: a script that writes range(start, end) with 1-based start/end values shifts every page one position forward and drops the real last page of each range.
The fix is mechanical — always subtract 1 from the user-visible start when entering pypdf's index space, and use end (not end - 1) as the exclusive stop of range():
User says "pages 1–5"
pypdf indices: 0, 1, 2, 3, 4
range() call: range(1 - 1, 5) → range(0, 5) ✓
Diagnostic: Confirm Page Count Before Splitting
# pip install pypdf
from pypdf import PdfReader
from pathlib import Path
def inspect(path: Path) -> None:
"""Print page count and first-level outline entries."""
try:
reader = PdfReader(path)
print(f"{path.name}: {len(reader.pages)} pages, encrypted={reader.is_encrypted}")
for item in reader.outline:
if hasattr(item, "title"):
pg = reader.get_destination_page_number(item) + 1
print(f" [p{pg}] {item.title}")
except Exception as exc:
print(f"Could not inspect {path.name}: {exc}")
if __name__ == "__main__":
inspect(Path("./source_document.pdf"))
Run this before splitting. Knowing the total page count lets you validate every requested range before writing a single byte. The outline listing is useful for the bookmark-split variant below.
Fix Implementation: Range-Based Split
# pip install pypdf
from pypdf import PdfReader, PdfWriter
from pathlib import Path
def split_by_ranges(
input_path: Path,
output_dir: Path,
ranges: list[tuple[int, int]],
) -> list[Path]:
"""
Split input_path into one PDF per range.
ranges is a list of (start, end) tuples using 1-based page numbers
(matching what PDF viewers show). Both start and end are inclusive.
Returns the list of output paths created.
"""
output_dir.mkdir(parents=True, exist_ok=True)
created: list[Path] = []
try:
with open(input_path, "rb") as fh:
reader = PdfReader(fh)
total = len(reader.pages)
for idx, (start, end) in enumerate(ranges, start=1):
# Validate before touching the writer
if start < 1:
raise ValueError(f"Range {idx}: start={start} must be >= 1")
if end > total:
raise ValueError(
f"Range {idx}: end={end} exceeds document length ({total} pages)"
)
if start > end:
raise ValueError(f"Range {idx}: start={start} > end={end}")
writer = PdfWriter()
# KEY: subtract 1 from start to convert 1-based → 0-based;
# end is the exclusive upper bound for range(), so no adjustment.
for page_idx in range(start - 1, end):
writer.add_page(reader.pages[page_idx])
out_path = output_dir / f"{input_path.stem}_part{idx:02d}.pdf"
with open(out_path, "wb") as out:
writer.write(out)
writer.close()
created.append(out_path)
print(f"part{idx:02d}: pages {start}–{end} ({end - start + 1} pages) → {out_path.name}")
except Exception as exc:
print(f"Split failed: {exc}")
raise
return created
if __name__ == "__main__":
split_by_ranges(
Path("./annual_report.pdf"),
Path("./output/splits"),
[(1, 10), (11, 25), (26, 40)],
)
Each range gets its own fresh PdfWriter. Reusing a single writer across ranges would concatenate all ranges into one file instead of producing separate outputs.
Parse a Ranges String from the Command Line
Rather than hard-coding tuples, accept a string like "1-10,11-25,26-40":
# pip install pypdf
import re
from pathlib import Path
from pypdf import PdfReader, PdfWriter
def parse_ranges(spec: str) -> list[tuple[int, int]]:
"""
Parse '1-10,11-25,26-40' → [(1,10),(11,25),(26,40)].
Single-page entries like '5' become (5,5).
Raises ValueError for malformed tokens.
"""
result: list[tuple[int, int]] = []
for token in spec.split(","):
token = token.strip()
if not token:
continue
match = re.fullmatch(r"(\d+)(?:-(\d+))?", token)
if not match:
raise ValueError(f"Invalid range token: {token!r}")
a = int(match.group(1))
b = int(match.group(2)) if match.group(2) else a
if a > b:
raise ValueError(f"Start {a} > end {b} in token {token!r}")
result.append((a, b))
return result
if __name__ == "__main__":
print(parse_ranges("1-10,11-25,26-40")) # [(1, 10), (11, 25), (26, 40)]
print(parse_ranges("5")) # [(5, 5)]
print(parse_ranges("1-5, 7-10")) # [(1, 5), (7, 10)]
Connect this to the CLI:
# pip install pypdf
#!/usr/bin/env python3
"""
split_pdf.py — split a PDF by page ranges.
Usage: python split_pdf.py --input report.pdf --output ./splits --ranges "1-10,11-25"
"""
import argparse
import re
from pathlib import Path
from pypdf import PdfReader, PdfWriter
def parse_ranges(spec: str) -> list[tuple[int, int]]:
result = []
for token in spec.split(","):
token = token.strip()
m = re.fullmatch(r"(\d+)(?:-(\d+))?", token)
if not m:
raise argparse.ArgumentTypeError(f"Bad range: {token!r}")
a, b = int(m.group(1)), int(m.group(2) or m.group(1))
if a > b:
raise argparse.ArgumentTypeError(f"start > end in {token!r}")
result.append((a, b))
return result
def split(input_path: Path, output_dir: Path, ranges: list[tuple[int, int]]) -> None:
output_dir.mkdir(parents=True, exist_ok=True)
with open(input_path, "rb") as fh:
reader = PdfReader(fh)
total = len(reader.pages)
print(f"{input_path.name}: {total} pages")
for idx, (start, end) in enumerate(ranges, 1):
if not (1 <= start <= end <= total):
print(f"[SKIP] Range ({start}-{end}) invalid for {total}-page document")
continue
writer = PdfWriter()
for i in range(start - 1, end): # 0-based index
writer.add_page(reader.pages[i])
out = output_dir / f"{input_path.stem}_part{idx:02d}.pdf"
with open(out, "wb") as f:
writer.write(f)
writer.close()
print(f" part{idx:02d}: p{start}–p{end} → {out.name}")
def main() -> None:
ap = argparse.ArgumentParser(description="Split a PDF by page ranges")
ap.add_argument("--input", required=True, type=Path)
ap.add_argument("--output", required=True, type=Path)
ap.add_argument("--ranges", required=True, help='e.g. "1-10,11-25,26-40"')
args = ap.parse_args()
split(args.input, args.output, parse_ranges(args.ranges))
if __name__ == "__main__":
main()
Variant: Split Every N Pages
When you do not know the ranges in advance — just "give me 10-page chunks":
# pip install pypdf
import math
from pathlib import Path
from pypdf import PdfReader, PdfWriter
def split_every_n(input_path: Path, output_dir: Path, n: int) -> list[Path]:
"""
Split input_path into files of n pages each.
The last file may have fewer than n pages.
Returns list of created file paths.
"""
if n < 1:
raise ValueError(f"n must be >= 1, got {n}")
output_dir.mkdir(parents=True, exist_ok=True)
created: list[Path] = []
with open(input_path, "rb") as fh:
reader = PdfReader(fh)
total = len(reader.pages)
n_chunks = math.ceil(total / n)
for chunk in range(n_chunks):
start_idx = chunk * n # 0-based
end_idx = min(start_idx + n, total) # exclusive
writer = PdfWriter()
for i in range(start_idx, end_idx):
writer.add_page(reader.pages[i])
out_path = output_dir / f"{input_path.stem}_chunk{chunk + 1:03d}.pdf"
with open(out_path, "wb") as out:
writer.write(out)
writer.close()
created.append(out_path)
# Report in 1-based page numbers for readability
print(f"chunk{chunk + 1:03d}: pages {start_idx + 1}–{end_idx} → {out_path.name}")
return created
if __name__ == "__main__":
split_every_n(Path("./large_export.pdf"), Path("./chunks"), n=10)
Note: start_idx is already 0-based here because we compute it directly from the chunk index. No subtraction needed — the subtraction is only required when converting a 1-based user input.
Variant: Split on Bookmarks
When the PDF has a well-structured outline, split at each top-level bookmark:
# pip install pypdf
from pathlib import Path
from pypdf import PdfReader, PdfWriter
def split_on_bookmarks(input_path: Path, output_dir: Path) -> list[Path]:
"""
Split a PDF at each top-level bookmark.
Each section runs from its bookmark's page to the page before the next bookmark.
Returns list of created file paths.
"""
output_dir.mkdir(parents=True, exist_ok=True)
created: list[Path] = []
with open(input_path, "rb") as fh:
reader = PdfReader(fh)
total = len(reader.pages)
# Collect top-level bookmarks (skip nested lists)
top_level = [
item for item in reader.outline
if hasattr(item, "title")
]
if not top_level:
print("No top-level bookmarks found; nothing to split on.")
return created
# Build (start_page_0based, title) pairs
sections: list[tuple[int, str]] = []
for item in top_level:
pg_0based = reader.get_destination_page_number(item)
sections.append((pg_0based, item.title))
# Each section ends one page before the next section starts
for i, (start_idx, title) in enumerate(sections):
end_idx = sections[i + 1][0] if i + 1 < len(sections) else total
if start_idx >= end_idx:
print(f"[SKIP] Empty section: {title!r}")
continue
writer = PdfWriter()
for page_idx in range(start_idx, end_idx):
writer.add_page(reader.pages[page_idx])
# Sanitize title for use as filename
safe_name = "".join(c if c.isalnum() or c in " _-" else "_" for c in title)
out_path = output_dir / f"{i + 1:02d}_{safe_name[:40]}.pdf"
with open(out_path, "wb") as out:
writer.write(out)
writer.close()
created.append(out_path)
# Display in 1-based page numbers
print(f" [{i+1}] '{title}' (p{start_idx+1}–p{end_idx}) → {out_path.name}")
return created
if __name__ == "__main__":
split_on_bookmarks(Path("./report_with_bookmarks.pdf"), Path("./sections"))
get_destination_page_number() returns a 0-based index — no subtraction needed since we use it directly as a slice start, not from user input.
Verification
# pip install pypdf
from pypdf import PdfReader
from pathlib import Path
def verify_splits(
source: Path,
output_dir: Path,
ranges: list[tuple[int, int]],
) -> bool:
"""
Confirm each split file has the expected page count.
Assumes files are named *_part01.pdf, *_part02.pdf, ...
"""
all_ok = True
for idx, (start, end) in enumerate(ranges, 1):
expected = end - start + 1
out_path = output_dir / f"{source.stem}_part{idx:02d}.pdf"
try:
actual = len(PdfReader(out_path).pages)
status = "OK" if actual == expected else f"FAIL (expected {expected}, got {actual})"
except Exception as exc:
status = f"FAIL ({exc})"
actual = -1
print(f" part{idx:02d}: {status}")
if "FAIL" in status:
all_ok = False
return all_ok
if __name__ == "__main__":
source = Path("./annual_report.pdf")
ranges = [(1, 10), (11, 25), (26, 40)]
split_by_ranges(source, Path("./output/splits"), ranges)
ok = verify_splits(source, Path("./output/splits"), ranges)
print("All splits verified." if ok else "Some splits failed verification.")
The total pages across all split files should equal the number of pages covered by all ranges. If ranges overlap, a page may appear in multiple outputs — that is valid but worth logging explicitly.
After splitting, the individual files can be watermarked or password-protected per recipient using the patterns in Watermarking and Securing PDFs, or re-assembled with cover pages via Generating PDF Reports Dynamically.
Common Mistakes
| Mistake | Symptom | Fix |
|---|---|---|
range(start, end) with 1-based start | First page of each range is page 2 in the output; last real page is skipped | Use range(start - 1, end) — subtract 1 from start only |
Reusing one PdfWriter across ranges | All ranges concatenated into a single output file | Create a new PdfWriter() inside the loop, before each range |
Not checking end <= total | Silent extra blank pages or IndexError | Validate each range against len(reader.pages) before writing |
Open PdfReader across all chunks | PermissionError on Windows; memory growth | Open with with open(path, "rb") as fh: reader = PdfReader(fh) and close after each range group |
Using get_destination_page_number() output as 1-based | Off-by-one in bookmark splits | That method already returns 0-based; use directly in range() without subtracting |
FAQ
Why does my split file have one fewer page than expected?
You used range(start, end) with 1-based end instead of 0-based. The fix is range(start - 1, end) — end stays unchanged because range() excludes its upper bound, which exactly cancels out the needed off-by-one.
Can I split overlapping ranges (e.g., pages 1–5 and 3–8)?
Yes. Each range gets its own PdfWriter and output file. Pages 3–5 will appear in both outputs. This is intentional for use cases like generating an "executive summary" (pages 1–5) that overlaps a "full section" (pages 3–8).
How do I split a scanned PDF where there are no bookmarks? Use the range-based approach with manually specified page boundaries, or run OCR first with the pattern in Scanning and OCR Processing with Python to detect section boundaries from text content.
Does splitting strip PDF forms or annotations?writer.add_page() does a shallow copy that preserves page-level annotations (highlights, comments, form fields on that page). Interactive form state (/AcroForm) at the document level is not copied — re-attach it with writer.clone_reader_document_root(reader) if needed.
Related
- Merging and Splitting PDF Documents — full reference including bookmark preservation, page reordering, and large-batch streaming
- Batch Merge PDFs with a Python Script — the inverse operation: merge a folder of PDFs with natural sort and error recovery
- Watermarking and Securing PDFs — apply per-recipient watermarks or password protection to the split output files
Part of Merging and Splitting PDF Documents.