Verifying and validating PDF files
Hi,
I'm working on an application that ingests PDF files, parses, and produces metrics. I ran into issues with these files because they are generated by different applications and I don't have control over how the PDF files are generated. The contents of the file is well defined, but applications produce PDF files via 3rd party libraries, tools, or API's. Some of these applications are old and are possibly using PDF generating libraries that are out dated.
I'm looking for a good way to verify and validate PDF files. Could anyone point me to a tool, library, or API that I can use to help me get some insights from each PDF file? I'd like to include a step in my workflow where I can get some metadata out of PDF files and determine whether it is valid or could lead to potential issues in my workflow. It would be even better if there was a way to take a PDF file and pass it through a "cleaning" process which would make the file more up to date.
Any help would be greatly appreciated.
-M
