Copy link to clipboard
Copied
This may sound very specific and advanced, but I really don't know much about PDFs yet.
What I want to do: take an existing PDF and export it in Adobe Acrobat Pro DC conforming to PDF/A2u standards. These are standards for the long-term archiving of PDFs. No experience with these standards is necessary to answer my question though, I think.
Some PDFs are unable to be exported that way because they have some toUnicode mappings that don't conform to the standards. More specifically: "'ToUnicode'-cmap contains zero as a Unicode value". This is not a huge issue I reckon, but I'd still like my PDFs to conform to the standard at the end.
Is there any way to access these mappings? I imagine them as a simple dictionary with key-value pairs of glyphs and Unicode values. As such, it should be easy to change. I can't find anything in the Acrobat though.
Can anyone help me with this please?
Copy link to clipboard
Copied
In PDF internals, a ToUnicode map is a text stream embedded in the PDF. Almost always compressed so it will not be simply editable or readable without using software that can decode PDF structures.
Copy link to clipboard
Copied
Thank you for the reply! Do you happen to know of any software that can decode those structures? I would assume only Adobe software can do that truly well. Since I'm a bit of a programmer, I'm keen on learning more!
Copy link to clipboard
Copied
The format of PDF is not a secret. It's described in this 1000 page book: https://www.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf
Copy link to clipboard
Copied
While I'm thankful for this and certainly will look into it, this does not directly answer the question if there is a reliable way/software for uncompressing and editing the text streams within a pdf. Maybe reading the document will help, though. I've also ordered a book from one of the devs at Adobe who worked on the format.
Copy link to clipboard
Copied
You might look into the tool PDF Can Opener.