Good evening, my coworker, and I, are trying to find the fastest way to extract all the text from a flat text PDF, and using the text for the rest of our program.
Currently, we are able to extract all the text, but Acrobat seems to want to be slow... our program was initially very fast at extracting all of the text, but we think Adobe might have made an update which made extracting text take longer?
Any help or advice would be greatly appreciated!
What method do you use now? SaveAs or GetPageNthWord Or other?
there is no such thing as a flat text PDF. All text in a PDF is graphical objects and text streams, which have to be Analyzed.
We don't use either of those methods in our program, we currently use, "getPageNumWords" in a loop to find specific strings in our PDF
Using getPageNumWords in a loop will give you the number of words in a loop, but no words.
Ah, is there a fast way to extract all of the words themselves? I am tempted to try saving specific pages as a text file, then using that text file to grab information.
The fastest way is to export the PDF file to a text file and then process it elsewhere.
The text processing capabilities of Acrobat are quite limited and very slow. I've written many tools that do that and they almost never work if you try to process more than 100 pages at a time (plus or minus, depending on the complexity and length, of course).
Also, you need to restart Acrobat between each run because its memory handling is terrible and it gets extremely slow if you try to run such a script multiple times.
Copy link to clipboard