Hi @S. S,
Thank you for your reply. I appreciate it. I'm planning to switch to PDF IFilter 9. However, I'm encountering an issue: PDF documents generated with PDFium that contain images cannot be parsed. Have you encountered this or have any suggestions to resolve it?
Thanks
Hi @xiangtian_2217,
Thanks for the response.
Since the PDF iFilter was released back in 2010 and has been End of Life for a long time now, I may not have the most concrete solution you are looking for.
However, I did my research and could find certain limitations that you might have encountered with:
-
No support for image-only PDFs (scanned PDFs without an OCR text layer)
-
No support for some modern compression schemes (JPEG2000, JBIG2 variations, Flate streams produced by certain libraries)
-
No support for objects that use XObject-only layouts where no text operators exist
-
Problems with PDFs missing a ToUnicode map
-
Incomplete support for incremental updates
-
Does not load or invoke OCR, therefore, cannot extract text from images
So, if your PDFs generated
-
embed a bitmap or vector as the page content
-
do NOT include any real text operators (Tj, TJ, etc.)
→ iFilter 9 will report the document as “empty”, or fail to parse it.
You may see issues with the result.
You can check the documentation for further reference: https://adobe.ly/4iIkFWw
I hope this provides some clarity on the issue.
Regards,
Souvik.