Grabbing text data from a pdf to use in javascript

Report · Jul 31, 2017

I need to be able to grab the invoice number from pdfs and add to filename. Customer always sends their invoices in the same format. Is there a way to get the text from the pdf and add it to the filename while resaving the document?

I am using DC professional

Report · Jul 31, 2017

Assuming this is "real" text and not an image of text then yes, it might be possible.

However, it requires a way of identifying the invoice number, for example based on its format, location on the page or context, or a combination of these methods. Each one will require a different kind of script, though, and of course it will only work if the files are fairly consistent with each other.

Report · Jul 31, 2017

I have the x, y position of the text on the page. It is real text that can be highlighted and the pdfs from this vendor are very consistent in their format. I would like to grab the text (actually a number) and add it to the beginning of the filename.

Report · Jul 31, 2017

OK, in that case it should be possible, but it's a tricky task. You will need to create a loop that iterates over all the words in the page (or the entire file, if it's not always on a specific page), get their location on the page (using the getPageNthWordQuads method), and then compare it to the area where you expect the target text to be located. Definitely not a simple task if you don't have experience with Acrobat JS...

I've developed many similar tools in the past so if you're interested in hiring someone to do it for you, for a small fee, feel free to contact me privately at try6767 at gmail.com.

Report · Aug 01, 2017

So, you can't just point to the x-y position of the text even if its page and position does not change from document to document?

Report · Aug 01, 2017

If you know exactly where the text is, you can crop the page down to just that portion, and then iterate over all words in that area using Doc.getPageNthWord() (Acrobat DC SDK Documentation) you should be able to extract just the text you are interested in. If you look through the archives, and search for getPageNthWord, you should find a number of examples.

Report · Aug 01, 2017

Actually, I just realized that most of these examples are over at the old AcrobatUsers.com site. Take a look here: Reverse Crop With Javascript (JavaScript)

Report · Aug 01, 2017

So, I ran this script from an example - thanks.

var PageText = "";

for (var j = 0; j < 30;j++) {
var word = this.getPageNthWord(1,j,false);
PageText += word;
}

app.alert(PageText);

I found the text I need to be the 13th word on the page. I can now just use the getPageNthWord function and assign a variable then insert the variable in a filename function to put the invoice number into the filename.

Thank you I think I can muddle on now.

I don't see a need for cropping or iterating over the whole document. Am I wrong in this?

Report · Aug 01, 2017

Are you sure the number will always be the 13th word on each page of each

file? If so then you can do it like that...

Report · Aug 01, 2017

A small sampling shows these documents to be fairly consistent and software generated. Possibly a form that has been flattened or some other structured document.

I will go with this - and move on to tackling the problem of making this rename batches of 20 - 100 files at a time. If the documents prove to be inconsistent, I will need to muddle through the more formal way - right now, down and dirty seems to be working and fits my time schedule. I'm sorry if this proves to be an anathema those wholly vested in the process. Thank you all for your help. I may be back with batch renaming issues.

Report · Aug 01, 2017

If it works, that's all that matters...

Report · Aug 01, 2017

arrgh

I've got it stamping and renaming files properly and using the 13th word in the filename even. But, I am getting this error when it tries to execute this.saveas; "exception in line 56 of function top level, script Batch:exec Raise error: the file may be read only blah, blah, blah" The path is good, tried many different approaches - even local.

Here is what I am working with:

// Begin job

if ( typeof global.counter == "undefined" || global.date_reply == null ) {

console.println("Begin Job Code");

global.counter = 0;

// Grab date from User to be stamped

var dialogNumber = "Number of Files";

global.FileCnt = app.response("Number of Files to be Processed:", dialogNumber);

var dialogTitle = "Date Received";

var defaultAnswer = util.printd("mm-dd", new Date());

global.date_reply = app.response("Date Received:",

dialogTitle, defaultAnswer);

}

// Main code to process each of the selected files

try {

global.counter++

console.println("Processing File #" + global.counter);

// insert batch code here.

this.addWatermarkFromText({

cText: "GHC Received " + global.date_reply,

nTextAlign: app.constants.align.left,

nHorizAlign: app.constants.align.left,

nVertAlign: app.constants.align.bottom,

nHorizValue: 1, nVertValue: 1,

nFontSize: 8,});

this.addWatermarkFromText({

cText: "Finance Inbox",

nTextAlign: app.constants.align.right,

nHorizAlign: app.constants.align.right,

nVertAlign: app.constants.align.bottom,

nHorizValue: -4, nVertValue: 1,

nFontSize: 8,

aColor: ["G",.5]

});

} catch(e) {

console.println("Batch aborted on run #" + global.counter);

delete global.counter; // Try again, and avoid End Job code

event.rc = false; // Abort batch

}

var pronmbr = getPageNthWord(0,13,false)

var re = /\.pdf$/;

var date_replace = global.date_reply.replace(/[?:\\/|<>"*]/g,"");

var fname = this.documentFileName.replace(re,"_");

var filename = pronmbr + "ART INV" + date_replace + ".pdf";

console.println(filename);

// File path must be changed manually to correct directory

this.saveAs("/O/1_invoice staging/" + filename);

// End job

if ( global.counter == global.FileCnt ) {

console.println("End Job Code");

// Insert endJob code here

// Remove any global variables used in case user wants to run

// another batch sequence using the same variables

delete global.counter;

delete global.date_reply;

delete global.FileCnt;

}

Report · Aug 01, 2017

What's the full file-name that you're trying to use?

Report · Aug 01, 2017

pronmbr + "ART INV" + date_replace + ".pdf";

would be something like "105063 ART INV 08-01.pdf"

with pronmbr being the 13th word, ART INV being inserted text and date_replace being the user date entered in the dialogue box. I get an appropriate filename in the console screen with each error message. One for each file batched - always the same error, but it saves as the original filename.

Report · Aug 01, 2017

From what context are you running the code?

Does it work if you only execute the saveAs command from the console with the full path, hard-coded into the code?

Report · Aug 01, 2017

part of an action in Acrobat X pro. I took one I use that works and added the var pronmbr = getPageNthWord(0,13,false) command.

Actually, i get an undefined error when I try the console:

saveAs("/O/1_invoice staging/" test filename)

undefined

Report · Aug 01, 2017

"Undefined" is not an error message. It just means the code executed without returning any values.

Do you see the file saved in the target folder?

Report · Aug 01, 2017

sorry - no it is not saving to the target folder.

Report · Aug 01, 2017

Can you post the exact code you're executing?

Report · Aug 01, 2017

// Begin job

if ( typeof global.counter == "undefined" || global.date_reply == null ) {

console.println("Begin Job Code");

global.counter = 0;

// Grab date from User to be stamped

var dialogNumber = "Number of Files";

global.FileCnt = app.response("Number of Files to be Processed:", dialogNumber);

var dialogTitle = "Date Received";

var defaultAnswer = util.printd("mm-dd", new Date());

global.date_reply = app.response("Date Received:",

dialogTitle, defaultAnswer);

}

// Main code to process each of the selected files

try {

global.counter++

console.println("Processing File #" + global.counter);

// insert batch code here.

this.addWatermarkFromText({

cText: "GHC Received " + global.date_reply,

nTextAlign: app.constants.align.left,

nHorizAlign: app.constants.align.left,

nVertAlign: app.constants.align.bottom,

nHorizValue: 1, nVertValue: 1,

nFontSize: 8,});

this.addWatermarkFromText({

cText: "Finance Inbox",

nTextAlign: app.constants.align.right,

nHorizAlign: app.constants.align.right,

nVertAlign: app.constants.align.bottom,

nHorizValue: -4, nVertValue: 1,

nFontSize: 8,

aColor: ["G",.5]

});

} catch(e) {

console.println("Batch aborted on run #" + global.counter);

delete global.counter; // Try again, and avoid End Job code

event.rc = false; // Abort batch

}

var pronmbr = getPageNthWord(0,13,false)

var re = /\.pdf$/;

var date_replace = global.date_reply.replace(/[?:\\/|<>"*]/g,"");

var fname = this.documentFileName.replace(re,"_");

var filename = pronmbr + "ART INV" + date_replace + ".pdf";

console.println(filename);

// File path must be changed manually to correct directory

this.saveAs("/O/1_invoice staging/" + filename);

// End job

if ( global.counter == global.FileCnt ) {

console.println("End Job Code");

// Insert endJob code here

// Remove any global variables used in case user wants to run

// another batch sequence using the same variables

delete global.counter;

delete global.date_reply;

delete global.FileCnt;

}

Report · Aug 01, 2017

No, I mean when you test just the saveAs command from the console window,

what code did you execute, exactly?

Report · Aug 01, 2017

Report · Aug 01, 2017

You can't be executing the code, because it should have failed (because you didn't include the ".pdf" suffix).

To execute it you must first select it and then press Ctrl+Enter.

Report · Aug 01, 2017

The 13th word rule might work for you, but it seems risky to me. Are you quite sure that every word there today will always be there? That there will never be another word? And that you might not get extra words (for example an extra space)?

The "canonical" way to solve this is to use getPageNthWord and getPageNthWordQuads. The Quads give the location of a quadrilateral containing the word. You can't use the size exactly, nor the X,Y directly, but you could use some fuzzy logic to see if this information seems to be from about the right part of the page.

Report · Aug 01, 2017

You can, but it's not a trivial task. There's no command that says "give me the text in location x,y on page z"...

Grabbing text data from a pdf to use in javascript

1 Correct answer

Photos