Want to extract document metadata and doc info via a script

Report · Apr 25, 2016

I'm not a Javascript nor Java programmer so I might be missing one or more steps.

Looking at the Javascript info I have I see the following code:

var r = new Report();

r.writeText(this.metadata);

r.open("myMetadataReportFile");

save("/c/myreport.pdf"));

The code doesn't seem to be working when run from the console. If I execute "this.metadata" I get the information that I expect. This suggests that the problem is with report creation and or saving the document.

I haven't yet figured out how to get information out of the Doc Info dictionary. This is another need.

NOTE: In both cases (XMP and DocInfo) we're adding CUSTOM metadata.

Ideally I'd like to save both sets of information XMP and DocInfo as XML . This way we can run a comparison between the two.

Finally whatever code I end-up with needs to be able to run in the Action Wizard over about 10,000 files. If the input file is "file.pdf" the output should be "file.xml"

Thanks.

Ira

Report · Apr 25, 2016

Let's start from the end: You will not be able to run an Action on 10,000 files in a single go. If that's your goal then you should abandon it now and look for an alternative to Acrobat, as it simply can't handle that many files without hanging or crashing.

Processing 500 files should be taken as the maximum amount possible.

Report · Apr 25, 2016

Thanks I didn't realize that Acrobat had such a limit. But even if we have to do this 200 files at a time it is worthwhile doing.

How do we accomplish that?

Report · Apr 25, 2016

OK, second issue: If you want the output to be an XML file, why are you using the Report object? The Report is a PDF file, you know...

Report · Apr 25, 2016

No. I didn't know that. As I mentioned earlier, I took the code from an Acrobat Javascript manual that I have.

I hope I'm clear about what we're trying to accomplish.

Report · Apr 25, 2016

I think I understand, but it's not that simple. It might be possible with the Report object if you saved it as an XML file after opening it, but the result might look a bit strange. So let's go back to your code. First of all, you have a syntax error in the last line, as there are two closing parentheses but only one opening one. So you need to fix that.

Beyond that I see two other issues:

1. It doesn't make sense to use both the open command and the save command. If you want to just save the report then use only save. If you want to view it, use open.

2. You can't save files to the root folder of a drive, it's considered unsafe. Change the path to somewhere else, like C:\Temp\ or C:\Reports\ or something like that.

Fix those issues and try again.

Report · Apr 26, 2016

I'm trying to run the script in the console.

When I type:

var Rep = new Report();

The console responds:

undefined

Not sure what's going on.

Report · Apr 26, 2016

What did you expect to happen? This just means that the code completed running without errors or return values.

Report · Apr 26, 2016

I would have expected not to get "undefined".

When I ran:

var r = new Report();

r.writeText(this.metadata);

r.open("myMetadataReportFile");

And the console responded with:

GeneralError: Operation failed.

Report.open:1:Console undefined:Exec

undefined

It did NOT OPEN a file.

Report · Apr 26, 2016

I have a feeling you did not select all of the code when you run it,

because the open command is not in the first line of your code... So you

probably only executed the last line, which would have failed. "undefined"

in this case means the script ended, and you have the error message before

that, which caused it to stop running.

Report · Apr 26, 2016

OK. I figure out what my problem was. I need to press ctrl-enter on each line.

So how do I take the code and make it run on a batch of files? Is it as simple as putting the code in the Action Wizard?

If yes, how do I access all the information in the DocInfo Dictionary (both normal and custom)?

Thanks

Report · Apr 26, 2016

To run multiple lines in the console you need to select them all with the

mouse and then press Ctrl+Enter.

Later on you can place the code as a part of an Action and run it like

that, yes.

The "metadata" property should return the full XMP file, including any

custom properties.

Report · Apr 26, 2016

Feel a bit silly that the solution was that simple. Thanks for your patience.

XMP data is half the battle.

I need to do the same kind of data extraction using the information in the DocInfo dictionary. I didn't really see anything in the Javascript documentation to get to this info. Any recommendations/suggestions?

Report · Apr 26, 2016

Do you mean the values under the info property of the Document object? If so, you can access them like this:

this.info.Title

this.info.Author

this.info.Subject

etc.

Report · Apr 26, 2016

If I'm understanding what I need to do. If I wanted to add say ISBN and DOI I would use

this.info.ISBN

this.info.DOI

Is this right?

Thanks again.

Report · Apr 26, 2016

If those properties were defined for that document, yes.

Report · Apr 27, 2016

Is there a way to get a list of ALL the info properties (i.e. both standard and custom)?

Report · Apr 27, 2016

Sure:

for (var i in this.info)
     console.println(i + ": "+ this.info);

Report · Apr 27, 2016

Thanks.

This has been a really helpful discussion!!!

Report · Apr 27, 2016

Almost got it

The following code almost works at least for XMP:

//Step 1

var name = (this.documentFileName);

name=name.replace (".pdf", "");

var r = new Report();

r.writeText (this.metadata);

r.save ("/c/Users/ipolans/Desktop/PDF Metadata/XMP-Data/" + name + "-XML.pdf");

//Step 2

app.openDoc("/c/Users/ipolans/Desktop/PDF Metadata/XMP-Data/" + name + "-XML.pdf");

console.println("the current document is "+ this.documentFileName);

var fil = (name + "-XML");

saveAs ("/c/Users/ipolans/Desktop/PDF Metadata/XMP-Data/" + fil + ".txt", "com.adobe.acrobat.plain-text");

The main problem with the code above is that I haven't been able to figure out how to get "saveAs" to use the document opened with "app.openDoc". Instead it is using the document processed in "Step 1". This is verified by the "console.println" statement

Even if this issue is fixed according to the JavaScript documentation "app.OpenDoc" is not allowed in a batch file (which I assume includes the "Action Wizard"). Which touches on the issue of how much of the code will need modification to work in the "Action Wizard"

Ideally I'd like the JavaScript to avoid having to create a temporary PDF file. Rather I'd want to (1) query the PDF for the XMP metadata and then (2) write directly to a "text" file.

Report · Apr 27, 2016

Instead of using save and then open and then re-save, just use the open command of the Report object to generate a new Document object. Then use saveAs to convert it to a text file. Something like this:

//Step 1
var name = (this.documentFileName);
name=name.replace (".pdf", "");
var r = new Report();
r.writeText(this.metadata);
var newDoc = r.open("XMP Report");
//Step 2
var fil = (name + "-XML");
newDoc.saveAs("/c/Users/ipolans/Desktop/PDF Metadata/XMP-Data/" + fil + ".txt", "com.adobe.acrobat.plain-text");
newDoc.closeDoc(true);

Report · May 03, 2016

That works.

But I'm finding that "r.writeText" doesn't always produce a new line at the end.

Here's an example showing the problem:

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 81.157285, 2014/12/12-00:43:15 "> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:pdf="http://ns.adobe.com/pdf/1.3/" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/" dc:format="application/pdf" pdf:Producer="Adobe PDF Library 4.0; modified using iText 2.1.7 by 1T3XT" pdf:keywords="" pdf:Keywords="" xmp:CreateDate="2004-05-12T07:31:12+05:30" xmp:ModifyDate="2016-03-22T08:29:47-04:00" xmp:CreatorTool="Acrobat Capture 3.0" pdfx:Article_Title="IEEE Standard for Shunt Power Capacitors" pdfx:DOI="10.1109/IEEESTD.1980.79668" pdfx:DOI_Link="https://dx.doi.org/10.1109/IEEESTD.1980.79668" pdfx:IEEE_Publication_No.="2459" pdfx:IEEE_Xplore_Article_No.="26642" pdfx:Page_Numbers="1 - 23" pdfx:Publication_Title="ANSI/IEEE Std 18-1980" pdfx:Style="Searchable Image (Exact)"> <dc:description> <rdf:Alt>

I've even tried "r.writeText (" "); without any luck.

Report · May 03, 2016

It should do, but maybe the line-breaks disappear when the file is converted to a text file.

What application are you using to view the text file in? If you're using something like Notepad++ check if there's a CR and an LF char at the end of each line. Maybe there's just a CR, which some applications (like the regular Notepad) do not pick up as a line-break, if I recall correctly.

Report · May 03, 2016

I opened the file in Word. Don't have anything handy ton my PC hat shows the hex representation.

What I found is that some of the lines have spaces at the end and others have cr/lf pairs (at least as far as Word is concerned).

I'll do some more investigating tomorrow.

Report · May 03, 2016

It might be because you're printing out the entire metadata, which includes line-breaks already, and that might not come through when you use writeText. In that case I would recommend splitting the metadata string to individual lines and then printing each one of those lines to the report on its own.

And I highly recommend Notepad++ for both writing code and examining plain-text files.

Adobe Community

Want to extract document metadata and doc info via a script

1 Correct answer