• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Fast ways to extract flat text from a PDF document?

Explorer ,
Feb 25, 2019 Feb 25, 2019

Copy link to clipboard

Copied

Good evening, my coworker, and I, are trying to find the fastest way to extract all the text from a flat text PDF, and using the text for the rest of our program.

Currently, we are able to extract all the text, but Acrobat seems to want to be slow... our program was initially very fast at extracting all of the text, but we think Adobe might have made an update which made extracting text take longer?

Any help or advice would be greatly appreciated!

TOPICS
Acrobat SDK and JavaScript , Windows

Views

944

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

LEGEND , Feb 25, 2019 Feb 25, 2019

getPageNthWord is essentially what Acrobat does to get each word in every PDF, for export or anything else. To get the text on a page, Acrobat has to scan all graphics on the page, and assemble the text, using guesswork and fuzzy logic. But getPageNthWord is slow essentially because of the overhead of handling each single word in JavaScript. The plug-in API does the same job, but in C, so with much less overhead. So a plug-in may be the way to go.

Votes

Translate

Translate
LEGEND ,
Feb 25, 2019 Feb 25, 2019

Copy link to clipboard

Copied

What method do you use now? SaveAs or GetPageNthWord Or other?

there is no such thing as a flat text PDF. All text in a PDF is graphical objects and text streams, which have to be Analyzed.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Feb 25, 2019 Feb 25, 2019

Copy link to clipboard

Copied

We don't use either of those methods in our program, we currently use, "getPageNumWords" in a loop to find specific strings in our PDF

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 25, 2019 Feb 25, 2019

Copy link to clipboard

Copied

Using getPageNumWords in a loop will give you the number of words in a loop, but no words.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Feb 25, 2019 Feb 25, 2019

Copy link to clipboard

Copied

Ah, is there a fast way to extract all of the words themselves?  I am tempted to try saving specific pages as a text file, then using that text file to grab information.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 25, 2019 Feb 25, 2019

Copy link to clipboard

Copied

The fastest way is to export the PDF file to a text file and then process it elsewhere.

The text processing capabilities of Acrobat are quite limited and very slow. I've written many tools that do that and they almost never work if you try to process more than 100 pages at a time (plus or minus, depending on the complexity and length, of course).

Also, you need to restart Acrobat between each run because its memory handling is terrible and it gets extremely slow if you try to run such a script multiple times.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 25, 2019 Feb 25, 2019

Copy link to clipboard

Copied

LATEST

getPageNthWord is essentially what Acrobat does to get each word in every PDF, for export or anything else. To get the text on a page, Acrobat has to scan all graphics on the page, and assemble the text, using guesswork and fuzzy logic. But getPageNthWord is slow essentially because of the overhead of handling each single word in JavaScript. The plug-in API does the same job, but in C, so with much less overhead. So a plug-in may be the way to go.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines