Extract text from pdf with javascript

Report · Sep 07, 2022

I have a slew of pdf documents, from which I need to extract data. I am using 32-bit Adobe Acrobat Pro. I believe the files were originally scanned from excel, but I have no way of knowing. I have tried to pull the data using various tools, including the most straight forward method - extracting to excel. This works, but I am looking for a more elegant solution. I would like to use javascript to iterate through the documents, which all have the same structure. My current stumbling block is that there are undefined fields, mostly text fields, and I am not familiar enough with the object model to be able to iterate through programatically. How can I iterate through using the debugger to list each field? When I look at the document as a form, the fields I need to identify have no properties windows, but I am assuming I can still manipulate them with a script.

Thanks! (I can't upload a sample file, btw)

Report · Sep 07, 2022

What does you see when you use Tools > Prepare Form?

Report · Sep 07, 2022

Hi,

What do you mean by "undefined field" ? A blank field???

If possible please share an example.

@+

Report · Sep 08, 2022

Do you mean that they do not respond in the Form editor, so they aren't form fields at all? Or are you using "field" to just mean "text on page"?

Report · Sep 09, 2022

I'll try to explain what I am seeing - the form I am using has too much PHI on it to share.

On the right hand side, there is a listing, by page, of the fields identified/generated by acrobat when I run the prepare forms module. Some of the text on the page is not picked up on the field list. I can select this text, but I can't "do" anything with it. I am guessing this is a function of how the document was made into a pdf orginally, as well as how the acrobat form engine then processes the pdf. The data I'd like to inspect/manipulate is there, and when I export it to an xml file, for example, I can see it and go from there. That is not, for me, an optimal solution. I'd rather do all the necessary processing within acrobat, leveraging the built-in javascript engine.

Thanks!!

Report · Sep 09, 2022

Ok, I probably misunderstand, but if I am right, you are trying to use Prepare Form to get the information on the page organised so you can extract the text as form fields? I'd say that's a complete non-starter. Prepare Form is wild guesswork looking at where the text is on the page and what lines are drawn. Only things that it decides (somehow) are form fields become fields; the rest is considered background and left alone.

The canonical (but difficult) way to extract text with JavaScript is getPageNthWord and getPageNthWordQuads. This gives you the text and position of each word, separately, one at a time. If your target layout is absolutely fixed, it can do a pretty good job. Otherwise, you are going to have to do a lot of guesswork. Reusing PDF files is often necessary, but isn't what they were designed for.

Report · Sep 09, 2022

I was afraid of that, but not surprised. I have used getPageNthWord, as you suggested, but I need a bit more functionality than that. The layout is fixed, but there is variability in the length of recognized fields, etc. I am having some luck using the "M" ETL language which now comes with Excel - on the XML data I pull from the pdf. Not great, but workable. Having the orginal doc would be best....so it goes! Thanks....

Extract text from pdf with javascript

1 Correct answer