When using PDF Services API, the Adobe service is failing to recognize multiple pages from the input PDF, even though all pages in the document are identically formatted.
I used Claude.AI to troubleshoot, here is the feedback:
Adobe Extract PDF API is silently dropping pages from this 94-page PDF. The text is fine. There's no difference in text format or layout between the missed and found contracts — pdftotext extracts them identically. So the common thread isn't about the content, it's about which pages Adobe Extract chose to skip. — pdftotext extracts every contract perfectly. Adobe just doesn't return text elements for certain pages, so the workflow never sees those Contract IDs.
- The input PDF has 94 pages with 52 unique contracts
- All 20 missing contracts have
Contract ID: NNNNNNNin identical format to the 32 that succeeded — there's no text/regex issue - Every page has a footer:
Affidavit: Page X of Y(that's the numbering you mentioned) - 18 of 20 missing contracts are single-page (Page 1 of 1)
- Pages 1-6 are ALL missing — the first 6 contracts were entirely skipped
- The remaining missing contracts are scattered (pages 21, 29, 48-55, 74-75, 77, 84, 94)
