Skip to main content
Participating Frequently
March 1, 2021
Question

Need to parse the pdf to get all object from meta-data.

  • March 1, 2021
  • 4 replies
  • 840 views

I need to parse the meta-data of a given PDF file to get counters of different types of objects contained in a pdf and extract the various object. Say object of type "/JavaScript" or "/ObjStm".

This topic has been closed for replies.

4 replies

Brainiac
March 2, 2021

The Cos layer is what you get. It can enumerate all actual objects. If this isn't enough for you, Adobe don't have anything else, but there are many PDF libraries out there.

VS_NoviceAuthor
Participating Frequently
March 2, 2021

Thanks!
Wanted to check is there anything other than the COS layer that can help( I may not aware of it).
Or if   Acrobat SDK has some added functionality for this as compared to PDFL SDK.

I tried using open source libs, those are good but give some internal logic/ number error for a few malicious pdfs. So thought this is the most reliable one to go with.

Brainiac
March 1, 2021

The Cos API gives access to all objects. But not to objstm. 

VS_NoviceAuthor
Participating Frequently
March 2, 2021

Yeah, that's one of the cases.  And I need to maintain a counter of all kinds of objects even "/JS"  and all "/AA". So I need some sort of parser or enumerator.

try67
Adobe Expert
March 1, 2021

This type of objects are not a part of a file's metadata, but the actual data...

VS_NoviceAuthor
Participating Frequently
March 1, 2021

Yeah, true they should be called structural data of PDF.
What I am trying to do is to extract all the structural objects, based on their type, and categorize those, Basically maintaining a counter of objects in each category.

But couldn't find the right set of APIs or not even sure does the SDK enables us with any such kind of functionality.

VS_NoviceAuthor
Participating Frequently
March 1, 2021

I am trying to do with PDF Library SDK for C++.


Any leads would be really helpful.


Thanks in advance!