• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers

Extract text from pdf with javascript

New Here ,
Sep 07, 2022 Sep 07, 2022

Copy link to clipboard

Copied

I have a slew of pdf documents, from which I need to extract data. I am using 32-bit Adobe Acrobat Pro. I believe the files were originally scanned from excel, but I have no way of knowing. I have tried to pull the data using various tools, including the most straight forward method - extracting to excel. This works, but I am looking for a more elegant solution. I would like to use javascript to iterate through the documents, which all have the same structure. My current stumbling block is that there are undefined fields, mostly text fields, and I am not familiar enough with the object model to be able to iterate through programatically. How can I iterate through using the debugger  to list each field? When I look at the document as a form, the fields I need to identify have no properties windows, but I am assuming I can still manipulate them with a script.

Thanks! (I can't upload a sample file, btw)

TOPICS
Acrobat SDK and JavaScript

Views

303

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

LEGEND , Sep 09, 2022 Sep 09, 2022

Ok, I probably misunderstand, but if I am right, you are trying to use Prepare Form to get the information on the page organised so you can extract the text as form fields? I'd say that's a complete non-starter. Prepare Form is wild guesswork looking at where the text is on the page and what lines are drawn. Only things that it decides (somehow) are form fields become fields; the rest is considered background and left alone.

 

The canonical (but difficult) way to extract text with JavaScript is ge

...

Likes

Translate

Translate
Community Expert ,
Sep 07, 2022 Sep 07, 2022

Copy link to clipboard

Copied

What does you see when you use Tools > Prepare Form?

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 07, 2022 Sep 07, 2022

Copy link to clipboard

Copied

Hi,

What do you mean by "undefined field" ? A blank field???

If possible please share an example.

@+

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 08, 2022 Sep 08, 2022

Copy link to clipboard

Copied

Do you mean that they do not respond in the Form editor, so they aren't form fields at all? Or are you using "field" to just mean "text on page"?

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 09, 2022 Sep 09, 2022

Copy link to clipboard

Copied

I'll try to explain what I am seeing - the form I am using has too much PHI on it to share. 

 

On the right hand side, there is a listing, by page, of the fields identified/generated by acrobat when I run the prepare forms module. Some of the text on the page is not picked up on the field list. I can select this text, but I can't "do" anything with it. I am guessing this is a function of how the document was made into a pdf orginally, as well as how the acrobat form engine then processes the pdf. The data I'd like to inspect/manipulate is there, and when I export it to an xml file, for example, I can see it and go from there. That is not, for me, an optimal solution. I'd rather do all the necessary processing within acrobat, leveraging the built-in javascript engine.

Thanks!!

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 09, 2022 Sep 09, 2022

Copy link to clipboard

Copied

Ok, I probably misunderstand, but if I am right, you are trying to use Prepare Form to get the information on the page organised so you can extract the text as form fields? I'd say that's a complete non-starter. Prepare Form is wild guesswork looking at where the text is on the page and what lines are drawn. Only things that it decides (somehow) are form fields become fields; the rest is considered background and left alone.

 

The canonical (but difficult) way to extract text with JavaScript is getPageNthWord and getPageNthWordQuads. This gives you the text and position of each word, separately, one at a time. If your target layout is absolutely fixed, it can do a pretty good job. Otherwise, you are going to have to do a lot of guesswork. Reusing PDF files is often necessary, but isn't what they were designed for.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 09, 2022 Sep 09, 2022

Copy link to clipboard

Copied

LATEST

I was afraid of that, but not surprised. I have used getPageNthWord, as you suggested, but I need a bit more functionality than that. The layout is fixed, but there is variability in the length of recognized fields, etc. I am having some luck using the "M" ETL language which now comes with Excel - on the XML data I pull from the pdf. Not great, but workable. Having the orginal doc would be best....so it goes! Thanks....

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines