Copy link to clipboard
Copied
I’m interested in determining when I can rely on extracted PDF data to be 100% accurate.
I have been investigating how text is rendered into a PDF and have a couple questions I would appreciate some clarification on the following.
My basic understanding is that text is rendered from a PDF by:
The encoded text can be easily extracted via copy and paste, but the image-based fonts requires OCR. I understand that OCR'd text should not be considered 100% accurate. Therefore, my questions:
I would appreciate any insight into these topics.
Note: This is with regards to converted text files, such as a Word document that contains 1) text and 2) images that contain text, and not scanned files.
Copy link to clipboard
Copied
If your document contains ligatures, such as fi, fl, etc, copy and paste is inaccurate. I don’t know how Word handles ligatures, but InDesign uses ligatures.
I also do not know how accurate text is when the PDF is converted back to a Word document. Are you wanting just visual accuracy when viewing the PDF, or accuracy when it is converted to another file format or viewed in another way, like in Read mode?
Regarding OCR of an image containing text, I am assuming you mean a raster image. I believe that Acrobat would treat that image the same as it would treat a scanned image, since both are raster. The accuracy would depend upon the resolution of the image, how and the ability of Acrobat to run OCR on it: is the text over a background image, is it a legible font, etc.
How will you be extracting the text?
Get ready! An upgraded Adobe Community experience is coming in January.
Learn more