Copy link to clipboard
Copied
Good evening, my coworker, and I, are trying to find the fastest way to extract all the text from a flat text PDF, and using the text for the rest of our program.
Currently, we are able to extract all the text, but Acrobat seems to want to be slow... our program was initially very fast at extracting all of the text, but we think Adobe might have made an update which made extracting text take longer?
Any help or advice would be greatly appreciated!
getPageNthWord is essentially what Acrobat does to get each word in every PDF, for export or anything else. To get the text on a page, Acrobat has to scan all graphics on the page, and assemble the text, using guesswork and fuzzy logic. But getPageNthWord is slow essentially because of the overhead of handling each single word in JavaScript. The plug-in API does the same job, but in C, so with much less overhead. So a plug-in may be the way to go.
Copy link to clipboard
Copied
What method do you use now? SaveAs or GetPageNthWord Or other?
there is no such thing as a flat text PDF. All text in a PDF is graphical objects and text streams, which have to be Analyzed.
Copy link to clipboard
Copied
We don't use either of those methods in our program, we currently use, "getPageNumWords" in a loop to find specific strings in our PDF
Copy link to clipboard
Copied
Using getPageNumWords in a loop will give you the number of words in a loop, but no words.
Copy link to clipboard
Copied
Ah, is there a fast way to extract all of the words themselves? I am tempted to try saving specific pages as a text file, then using that text file to grab information.
Copy link to clipboard
Copied
The fastest way is to export the PDF file to a text file and then process it elsewhere.
The text processing capabilities of Acrobat are quite limited and very slow. I've written many tools that do that and they almost never work if you try to process more than 100 pages at a time (plus or minus, depending on the complexity and length, of course).
Also, you need to restart Acrobat between each run because its memory handling is terrible and it gets extremely slow if you try to run such a script multiple times.
Copy link to clipboard
Copied
getPageNthWord is essentially what Acrobat does to get each word in every PDF, for export or anything else. To get the text on a page, Acrobat has to scan all graphics on the page, and assemble the text, using guesswork and fuzzy logic. But getPageNthWord is slow essentially because of the overhead of handling each single word in JavaScript. The plug-in API does the same job, but in C, so with much less overhead. So a plug-in may be the way to go.