Known Participant

Answered

Fast ways to extract flat text from a PDF document?

Forum|Forum|7 years ago
February 25, 2019
3 replies
2095 views

Good evening, my coworker, and I, are trying to find the fastest way to extract all the text from a flat text PDF, and using the text for the rest of our program.

Currently, we are able to extract all the text, but Acrobat seems to want to be slow... our program was initially very fast at extracting all of the text, but we think Adobe might have made an update which made extracting text take longer?

Any help or advice would be greatly appreciated!

This topic has been closed for replies.

Correct answer Test Screen Name

getPageNthWord is essentially what Acrobat does to get each word in every PDF, for export or anything else. To get the text on a page, Acrobat has to scan all graphics on the page, and assemble the text, using guesswork and fuzzy logic. But getPageNthWord is slow essentially because of the overhead of handling each single word in JavaScript. The plug-in API does the same job, but in C, so with much less overhead. So a plug-in may be the way to go.

T

Test Screen NameCorrect answer

Legend

getPageNthWord is essentially what Acrobat does to get each word in every PDF, for export or anything else. To get the text on a page, Acrobat has to scan all graphics on the page, and assemble the text, using guesswork and fuzzy logic. But getPageNthWord is slow essentially because of the overhead of handling each single word in JavaScript. The plug-in API does the same job, but in C, so with much less overhead. So a plug-in may be the way to go.

T

Test Screen Name

Legend

Using getPageNumWords in a loop will give you the number of words in a loop, but no words.

logistics227043683Author

Known Participant

Ah, is there a fast way to extract all of the words themselves? I am tempted to try saving specific pages as a text file, then using that text file to grab information.

try67

Community Expert

The fastest way is to export the PDF file to a text file and then process it elsewhere.

The text processing capabilities of Acrobat are quite limited and very slow. I've written many tools that do that and they almost never work if you try to process more than 100 pages at a time (plus or minus, depending on the complexity and length, of course).

Also, you need to restart Acrobat between each run because its memory handling is terrible and it gets extremely slow if you try to run such a script multiple times.

T

Test Screen Name

Legend

What method do you use now? SaveAs or GetPageNthWord Or other?

there is no such thing as a flat text PDF. All text in a PDF is graphical objects and text streams, which have to be Analyzed.

logistics227043683Author

Known Participant

We don't use either of those methods in our program, we currently use, "getPageNumWords" in a loop to find specific strings in our PDF

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded