I'm working on an application that ingests PDF files, parses, and produces metrics. I ran into issues with these files because they are generated by different applications and I don't have control over how the PDF files are generated. The contents of the file is well defined, but applications produce PDF files via 3rd party libraries, tools, or API's. Some of these applications are old and are possibly using PDF generating libraries that are out dated.
I'm looking for a good way to verify and validate PDF files. Could anyone point me to a tool, library, or API that I can use to help me get some insights from each PDF file? I'd like to include a step in my workflow where I can get some metadata out of PDF files and determine whether it is valid or could lead to potential issues in my workflow. It would be even better if there was a way to take a PDF file and pass it through a "cleaning" process which would make the file more up to date.
There should be no problem with old libraries. There is no such thing as an out-dated PDF, since the original PDF 1.0 files are still completely valid today. There is no such thing as an "up to date" PDF and hence no tools to make them. However, libaries, both old and new, can have bugs. But you say you have problems. What sort of problems? Are they limitations in your reading software, or violations of the rules of PDF?