PDF text rendering questions
I’m interested in determining when I can rely on extracted PDF data to be 100% accurate.
I have been investigating how text is rendered into a PDF and have a couple questions I would appreciate some clarification on the following.
My basic understanding is that text is rendered from a PDF by:
- converting the encoded text letter by letter, or
- using image-based fonts.
The encoded text can be easily extracted via copy and paste, but the image-based fonts requires OCR. I understand that OCR'd text should not be considered 100% accurate. Therefore, my questions:
- Is my understanding of the rendering mechanisms accurate?
- Is text that has been converted through text encoding always accurate?
- Is there a way to tell which mechanism is used if only provided with a PDF?
I would appreciate any insight into these topics.
Note: This is with regards to converted text files, such as a Word document that contains 1) text and 2) images that contain text, and not scanned files.
