Programatic access to PDF Meta Data, including XMP Metadata, using Excel VBA.

Report · Nov 19, 2021

Trying to find a way to access the XMP Metadata using Excel VBA with a reference to Adobe Acrobat 10.0 Type Library.

Found some comparable documentation within:

Adobe Acrobat SDK - JS for Acrobat API Reference:

Metadata topic (page 248 of 2021-02-01 ed).

for JavaScript.

Anyway to do this via excel VBA?

GetInfo and SetInfo work very well for the basic Metadata (Creator, Producer, Author, Title, etc.). Is there a way to access the XMP Metadata -- specifically the copyright?

Thanks!

Report · Nov 25, 2021

HI,

Not sure how you would do it in Excel, but the metadata in most PDFs is stored as plain text, so if you can read the file as standard text and get the information you would like :

1 0 obj
<</Length 3546/Subtype/XML/Type/Metadata>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 7.0-c000 79.1357c9e, 2021/07/14-00:39:56        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
            xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
         <xmp:ModifyDate>2021-11-25T11:43:29Z</xmp:ModifyDate>
         <xmp:CreateDate>2021-09-07T20:52:16+01:00</xmp:CreateDate>
         <xmp:MetadataDate>2021-11-25T11:43:29Z</xmp:MetadataDate>
         <xmp:CreatorTool>Adobe Acrobat 21.5</xmp:CreatorTool>
         <dc:format>application/pdf</dc:format>
         <dc:rights>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">Some text here</rdf:li>
            </rdf:Alt>
         </dc:rights>
         <xmpMM:DocumentID>uuid:0f59de51-a56d-8543-960c-877e9d88a112</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:e5b4a1e7-e85d-2f45-aa24-5303d10abc5d</xmpMM:InstanceID>
         <pdf:Producer>Adobe Acrobat 21.5 Image Conversion Plug-in</pdf:Producer>
         <pdf:Keywords/>
         <xmpRights:Marked>True</xmpRights:Marked>
         <xmpRights:WebStatement>someinfohere</xmpRights:WebStatement>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

The above is taken straight from a PDF file, the tags you are looking for are

<dc:right>

<xmpRights:*>

Note: some PDF's might have more than one xmp metadata entries.

Report · Sep 21, 2022

PNW, did you ever find a way to do this? I know its been done, seen extractions in other apps, but no one seems to know how its done. It has to be a dictionary or object somewhere.

Report · Sep 21, 2022

I tried a few things, and ended up with a mash up that got us through. It's not pretty, so YMMV.

We got an older version of Adobe Acrobat Pro and used a reference to the "Adobe Acrobat 10.0 Type Library" for part of it.
Based on the very helpful post by @BarlaeDC , I used the LineInput statement to scan the PDF.
- This does a nice job of reading them in.
- I then tried using MSXML2 as it is a bit more refined uses DOM -- Critical code components below.

Dim aApp                As Acrobat.AcroApp
Dim AV_Doc              As Acrobat.AcroAVDoc
Dim PD_Doc              As Acrobat.AcroPDDoc
Dim aFlags              As Acrobat.PDDocFlags
Dim a                   As Acrobat.AcroPDAnnot

Dim oXML                As MSXML2.DOMDocument60
Dim XML_NList           As MSXML2.IXMLDOMNodeList

'-- Core Code
    On Error Resume Next
    For Each y In XML_NList
        Set XML_N(0) = y
        sParents = ""
        Do While Not XML_N(0).ParentNode Is Nothing
            sParents = XML_N(0).ParentNode.nodeName & ";" & sParents
            Set XML_N(0) = XML_N(0).ParentNode
        Loop
        With y
            oShtOut.Cells(l, 1) = sParents
            oShtOut.Cells(l, 2) = .Name
            oShtOut.Cells(l, 3) = .nodeName
            oShtOut.Cells(l, 4) = .NodeType
            oShtOut.Cells(l, 5) = .nodeTypeString
            oShtOut.Cells(l, 6) = .NodeValue
            oShtOut.Cells(l, 7) = .tagName
            oShtOut.Cells(l, 8) = .Text
            oShtOut.Cells(l, 9) = .Value
            oShtOut.Cells(l, 10) = .XML
            oShtOut.Cells(l, 11) = .ChildNodes.Length
        End With
        oShtOut.Rows(l).EntireRow.WordWrap = False
        l = l + 1
    Next
    On Error GoTo 0

For bulk update of the PDF files, I ended up using a VBA command line call to exiftool with the following structure:

exiftool -xmp-dc:Rights="DC Rights" -xmp-xmprights:Marked="True" -xmp-xmprights:Owner="xmprights:Owner" -xmp-xmprights:UsageTerms="xmprights:UsageTerms" -xmp-xmprights:WebStatement="xmprights:WebStatement" 1.pdf

Report · Sep 21, 2022

VBA can use the VB:JavaScript connector, and directly call JavaScript functions.

Report · Sep 21, 2022

@Test Screen Name

I had trouble finding good documentation on the VBA:JS implimentation.

Here is some of what I found:

In general, good VBA: JS Object reference documentation is thin at best. Am I missing it somewhere?

Any suggestions on the best VBA: JS Object documentation? -- Thanks!!

Report · Sep 22, 2022

The key thing is that the interface is to an object called JSObject. Using OLE — Acrobat-PDFL SDK: Interapplication Communication (adobe.com)

Programatic access to PDF Meta Data, including XMP Metadata, using Excel VBA.

Photos