Need to parse the pdf to get all object from meta-data.

Report · Mar 01, 2021

I need to parse the meta-data of a given PDF file to get counters of different types of objects contained in a pdf and extract the various object. Say object of type "/JavaScript" or "/ObjStm".

Report · Mar 01, 2021

I am trying to do with PDF Library SDK for C++.

Any leads would be really helpful.

Thanks in advance!

Report · Mar 01, 2021

This type of objects are not a part of a file's metadata, but the actual data...

Report · Mar 01, 2021

Yeah, true they should be called structural data of PDF.
What I am trying to do is to extract all the structural objects, based on their type, and categorize those, Basically maintaining a counter of objects in each category.

But couldn't find the right set of APIs or not even sure does the SDK enables us with any such kind of functionality.

Report · Mar 01, 2021

The Cos API gives access to all objects. But not to objstm.

Report · Mar 01, 2021

Yeah, that's one of the cases. And I need to maintain a counter of all kinds of objects even "/JS" and all "/AA". So I need some sort of parser or enumerator.

Report · Mar 02, 2021

The Cos layer is what you get. It can enumerate all actual objects. If this isn't enough for you, Adobe don't have anything else, but there are many PDF libraries out there.

Report · Mar 02, 2021

Thanks!
Wanted to check is there anything other than the COS layer that can help( I may not aware of it).
Or if Acrobat SDK has some added functionality for this as compared to PDFL SDK.

I tried using open source libs, those are good but give some internal logic/ number error for a few malicious pdfs. So thought this is the most reliable one to go with.

Need to parse the pdf to get all object from meta-data.

Photos