feedback_pdf_text_extraction

For any Claude agent reading a book from ~/off-github/library/indexed/<slug>/, the primary read source is <slug>-text.txt (produced by ~/off-github/library/extract-pdf-text.sh via pdftotext -layout + [[PAGE N]] markers at form-feeds). The PDF stays canonical for pagination + visuals; only open it directly to verify a diagram, table, or fine typography.

Why: Phase 2.5 parallel dispatch on Tuesday 21 April 2026 had 2 of 7 agents fail against the Claude Code Read tool’s 32 MB request-size limit when reading image-heavy PDFs (Cheshire 18 MB, Hallaq 4.9 MB). Recovery via EPUB worked but Rich pushed back on “EPUB-first” — EPUBs are flowable (page refs not stable), not all books have them (Meeker, Labra Gayo are PDF-only), and the Read tool doesn’t support EPUB natively (requires Python zip extraction). Text extraction via pdftotext -layout is the standard Unix approach: poppler-utils already installed, 1-2 sec per book, 3-60× size reduction (Allemang 52 MB → 0.9 MB extreme; Cheshire 17.5 MB → 6.1 MB typical), page numbers preserved via [[PAGE N]] markers.

How to apply:

In agent prompts (notes-taking, critique, research), direct agents at <slug>-text.txt with Read(offset, limit) + grep [[PAGE 345]] for navigation.
Do NOT send large PDF page ranges to the model unless the agent specifically needs to see a diagram — send text.
If <slug>-text.txt is missing for a book, run ~/off-github/library/extract-pdf-text.sh <slug> — idempotent, skips existing files without --force.
At indexing time, always extract text as step 4 of the per-book indexing workflow (see filing-system.md v1.2).
The principle is general: if Claude keeps biting off more than it can chew, the fix is pre-extraction + chunked consumption, not bigger reads or format-switching.

Do NOT:

Default to EPUB-first — Rich rejected this; pagination and PDF-only books make it worse.
Feed raw PDF page ranges to agents for bulk text extraction — that’s what tripped the 32 MB limit twice.
Assume the Read tool will handle large PDFs gracefully — it won’t if image content is present.

TT Claude Memory

Explorer

feedback_pdf_text_extraction_first

Graph View

Backlinks