Fast ways to extract flat text from a PDF document?

Report · Feb 25, 2019

Good evening, my coworker, and I, are trying to find the fastest way to extract all the text from a flat text PDF, and using the text for the rest of our program.

Currently, we are able to extract all the text, but Acrobat seems to want to be slow... our program was initially very fast at extracting all of the text, but we think Adobe might have made an update which made extracting text take longer?

Any help or advice would be greatly appreciated!

Report · Feb 25, 2019

What method do you use now? SaveAs or GetPageNthWord Or other?

there is no such thing as a flat text PDF. All text in a PDF is graphical objects and text streams, which have to be Analyzed.

Report · Feb 25, 2019

We don't use either of those methods in our program, we currently use, "getPageNumWords" in a loop to find specific strings in our PDF

Report · Feb 25, 2019

Using getPageNumWords in a loop will give you the number of words in a loop, but no words.

Report · Feb 25, 2019

Ah, is there a fast way to extract all of the words themselves? I am tempted to try saving specific pages as a text file, then using that text file to grab information.

Report · Feb 25, 2019

The fastest way is to export the PDF file to a text file and then process it elsewhere.

The text processing capabilities of Acrobat are quite limited and very slow. I've written many tools that do that and they almost never work if you try to process more than 100 pages at a time (plus or minus, depending on the complexity and length, of course).

Also, you need to restart Acrobat between each run because its memory handling is terrible and it gets extremely slow if you try to run such a script multiple times.

Report · Feb 25, 2019

getPageNthWord is essentially what Acrobat does to get each word in every PDF, for export or anything else. To get the text on a page, Acrobat has to scan all graphics on the page, and assemble the text, using guesswork and fuzzy logic. But getPageNthWord is slow essentially because of the overhead of handling each single word in JavaScript. The plug-in API does the same job, but in C, so with much less overhead. So a plug-in may be the way to go.

Fast ways to extract flat text from a PDF document?

1 Correct answer