Skip to main content
Participating Frequently
October 13, 2025
Question

OCR sometimes turns entire scanned PDF page into a single image — no text extracted during OCR + Ext

  • October 13, 2025
  • 0 replies
  • 91 views

Hi everyone,

I’m using Adobe PDF Services API with NestJS to process scanned PDFs.
My goal is to run OCR on the scanned file, then extract structured text and layout information for rendering in a Flutter frontend.

However, I’ve noticed that sometimes the entire page is treated as one large image, and no text is extracted — even though the PDF clearly contains readable text after OCR.
This causes layout issues in my Flutter app, where the image layer overlaps or replaces text.

Implementation details

 

const pollingURL = await pdfServices.submit({ job });
const pdfServicesResponse = await pdfServices.getJobResult({
  pollingURL,
  resultType: OCRResult
});


Extract step:

const params = new ExtractPDFParams({
  elementsToExtract: [ExtractElementType.TEXT, ExtractElementType.TABLES],
  addCharInfo: true,
  getStylingInfo: true,
  elementsToExtractRenditions: [
    ExtractRenditionsElementType.FIGURES,
    ExtractRenditionsElementType.TABLES,
  ],
});

! The issue

  • For some scanned PDFs, Adobe returns text and layout perfectly.

  • But for others, after OCR, the ExtractPDFOperation only returns an image rendition of the page — with some part of text elements at all.

  • It looks like the OCR recognized the content, but the extract phase still treats the entire page as an image.


My questions

  • Is there a way to ensure OCR always embeds or exposes recognized text, instead of producing a full-page image?

  • Can we configure OCR or ExtractPDF to force text-layer extraction even for low-quality scans?

  • How can I detect programmatically (from the JSON output) when a page has only image content and no text layer?

  • Are there known best practices for chaining OCR → ExtractPDF to ensure consistent text extraction results?


💻 Tech context

  • Backend: NestJS

  • Adobe PDF Services SDK: Node.js

  • Frontend: Flutter (renders text + layout from extracted JSON)

  • File type: scanned invoices and forms (mixed quality)


Any guidance, configuration examples, or recommended OCR parameters would be super helpful 🙏
Thanks in advance!