Extract text from pdf with javascript

Question

I have a slew of pdf documents, from which I need to extract data. I am using 32-bit Adobe Acrobat Pro. I believe the files were originally scanned from excel, but I have no way of knowing. I have tried to pull the data using various tools, including the most straight forward method - extracting to excel. This works, but I am looking for a more elegant solution. I would like to use javascript to iterate through the documents, which all have the same structure. My current stumbling block is that there are undefined fields, mostly text fields, and I am not familiar enough with the object model to be able to iterate through programatically. How can I iterate through using the debugger to list each field? When I look at the document as a form, the fields I need to identify have no properties windows, but I am assuming I can still manipulate them with a script.

Thanks! (I can't upload a sample file, btw)

Test Screen Name · Accepted Answer

Ok, I probably misunderstand, but if I am right, you are trying to use Prepare Form to get the information on the page organised so you can extract the text as form fields? I'd say that's a complete non-starter. Prepare Form is wild guesswork looking at where the text is on the page and what lines are drawn. Only things that it decides (somehow) are form fields become fields; the rest is considered background and left alone.

The canonical (but difficult) way to extract text with JavaScript is getPageNthWord and getPageNthWordQuads. This gives you the text and position of each word, separately, one at a time. If your target layout is absolutely fixed, it can do a pretty good job. Otherwise, you are going to have to do a lot of guesswork. Reusing PDF files is often necessary, but isn't what they were designed for.

Test Screen Name · Answer

Do you mean that they do not respond in the Form editor, so they aren't form fields at all? Or are you using "field" to just mean "text on page"?

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded