Skip to main content
Known Participant
July 31, 2017
Answered

Grabbing text data from a pdf to use in javascript

  • July 31, 2017
  • 3 replies
  • 5664 views

I need to be able to grab the invoice number from pdfs and add to filename.  Customer always sends their invoices in the same format.  Is there a way to get the text from the pdf and add it to the filename while resaving the document?

I am using DC professional

This topic has been closed for replies.
Correct answer Karl Heinz Kremer

So, you can't just point to the x-y position of the text even if its page and  position does not change from document to document?


If you know exactly where the text is, you can crop the page down to just that portion, and then iterate over all words in that area using Doc.getPageNthWord() (Acrobat DC SDK Documentation​) you should be able to extract just the text you are interested in. If you look through the archives, and search for getPageNthWord, you should find a number of examples.

3 replies

Legend
August 2, 2017

You can use app.alert to write the file name to the console and see what is going on as the script runs.

iu-userAuthor
Known Participant
August 2, 2017

Thanks, just tried this and app.alert brings up each filename correctly, I click OK and then I get the error message with no file save.

try67
Community Expert
Community Expert
August 2, 2017

Copy the actual file name that you see in the alert (or output it to the console, and then copy it from there) into your saveAs command and run it manually from the console. Does it work?

If a file with the same name exists it will simply be overwritten. However, if that file is open, locked or is set as read-only it will fail and an error message will appear.

Legend
August 1, 2017

1. I don't like the look of trying to save as test. Even if it succeeds it will just be called test and won't automatically open in Acrobat. Try test.pdf.

2. Are you able to save to the folder "O:\1_invoice staging" manually?

iu-userAuthor
Known Participant
August 2, 2017

Okay,

Adding the ".pdf" extension to the code makes it work in the console window.  So, executing that line with named test file works.  I can't use the script line exactly because the filename contains one variable and a user entered value, (+ .pdf)

The error seems to me to be that the file is viewed as open -  "exception in line 56 of function top level, script Batch:exec  Raise error: the file may be read only ...."  or the pronmbr variable is not changing with the iteration through the selected files, so it thinks it is trying to save the exact same name again - maybe???  I'm at a loss.

try67
Community Expert
Community Expert
July 31, 2017

Assuming this is "real" text and not an image of text then yes, it might be possible.

However, it requires a way of identifying the invoice number, for example based on its format, location on the page or context, or a combination of these methods. Each one will require a different kind of script, though, and of course it will only work if the files are fairly consistent with each other.

iu-userAuthor
Known Participant
July 31, 2017

I have the x, y position of the text on the page.  It is real text that can be highlighted and the pdfs from this vendor are very consistent in their format.  I would like to grab the text (actually a number) and add it to the beginning of the filename.

try67
Community Expert
Community Expert
July 31, 2017

OK, in that case it should be possible, but it's a tricky task. You will need to create a loop that iterates over all the words in the page (or the entire file, if it's not always on a specific page), get their location on the page (using the getPageNthWordQuads method), and then compare it to the area where you expect the target text to be located. Definitely not a simple task if you don't have experience with Acrobat JS...

I've developed many similar tools in the past so if you're interested in hiring someone to do it for you, for a small fee, feel free to contact me privately at try6767 at gmail.com.