Adobe PDF Extract API Question on Extraction Method

Report · Aug 23, 2023

How does the Adobe PDF Extract API extract text from PDF (to go from PDF to CSV)?

Does it naturally extract text from PDF and convert to CSV as if we were doing it ourself using Acrobat in Desktop? Or does it always try to use OCR and Sensei AI to extract and structure text?

Basically, I am trying to understand how much reliance is on AI here versus Adobe's natural ability to convert a pdf into csv based on the actual text/characters.

Report · Aug 23, 2023

We use both AI and algorithms but we only OCR when we get an image-only PDF. Most of the time we operate on native PDF.

Report · Aug 23, 2023

So Export / Convert PDF does conversion from PDF to XLSX using native PDF, as if I were doing it in Acrobat Desktop - no AI and OCR.

And then Extract PDF uses AI / Algorithms to extract text, image (OCR), and tables.

Is this correct way to understand this?

Report · Aug 23, 2023

Correct. The AI in Extract does a much better job of "understanding" complex tables. For example, tables with merged cells and rows with verticallyand horizontally centered cells.