Skip to main content
Participant
September 28, 2018
Answered

PDF text rendering questions

  • September 28, 2018
  • 1 reply
  • 708 views

I’m interested in determining when I can rely on extracted PDF data to be 100% accurate.

I have been investigating how text is rendered into a PDF and have a couple questions I would appreciate some clarification on the following.

My basic understanding is that text is rendered from a PDF by:

  1. converting the encoded text letter by letter, or
  2. using image-based fonts.

The encoded text can be easily extracted via copy and paste, but the image-based fonts requires OCR. I understand that OCR'd text should not be considered 100% accurate. Therefore, my questions:

  1. Is my understanding of the rendering mechanisms accurate?
  2. Is text that has been converted through text encoding always accurate?
  3. Is there a way to tell which mechanism is used if only provided with a PDF?

I would appreciate any insight into these topics.

Note: This is with regards to converted text files, such as a Word document that contains 1) text and 2) images that contain text, and not scanned files.

This topic has been closed for replies.
Correct answer Test Screen Name

What you see on screen might look like text but could come from different kinds of input. You have identified just two of them.

* From actual text, with a named font, character point, encoding

* From raster (bitmap) data from scanning

* From raster (bitmap) data from other sources

* From raster fonts (these are vanishingly rare, you'll probably never see one, I haven't).

* From vector (outlined) data

In the case of actual text, sometimes it will extract accurately, sometimes not. Don't be surprised to find files which are complete gobbledygook. Extraction of the actual characters does not guarantee anything else like line or paragraph breaks or tables; even word breaks are pot luck.

To your original question, I'd say you can NEVER guarantee anything unless you try it and check carefully by hand.

1 reply

Test Screen NameCorrect answer
Legend
September 28, 2018

What you see on screen might look like text but could come from different kinds of input. You have identified just two of them.

* From actual text, with a named font, character point, encoding

* From raster (bitmap) data from scanning

* From raster (bitmap) data from other sources

* From raster fonts (these are vanishingly rare, you'll probably never see one, I haven't).

* From vector (outlined) data

In the case of actual text, sometimes it will extract accurately, sometimes not. Don't be surprised to find files which are complete gobbledygook. Extraction of the actual characters does not guarantee anything else like line or paragraph breaks or tables; even word breaks are pot luck.

To your original question, I'd say you can NEVER guarantee anything unless you try it and check carefully by hand.