Why does export to PDF output as DOCX have different text to the OCR in the original document?

Report · Jan 12, 2024

I am trying to understand why the result of converting a PDF to DOCX differs so much from the OCR text scan embedded in the original document.

For instance, there is a row of capital letters that just says "RELATED AGENCIES" in the OCR. But when I export to DOCX that part of the document is "REL.A.TED AGENCIES". This is one of the more beneign differences, other discrepancies have letters missing from words in the DOCX that are fine in the OCR.

Why is this happening, and what is the source of the differences? Is Adobe Acrobat re-scanning to create the DOCX? Is it possible to export the text as embeded but allow for the formatting (e.g. font face and size) to be exported?

For context, I want the font sizes of the letters and simply extracting the OCR text directly from the PDF does not provide that. Another way to arrive at a solution would be to get Adobe Acrobat to scan for font sizes only ... but I don't think that is possible.

Report · Jan 12, 2024

Can you please share a screenshot of the PDF that's been OCRed? Either that, or can you share a page of the PDF?

Thanks

Report · Jan 13, 2024

Sure, here is the page with the issue in the example.

Report · Jan 13, 2024

When exporting a PDF to DOCX using Adobe Acrobat, differences between the OCR text in the original document and the resulting DOCX file may emerge. The conversion process involves interpreting and transforming content, which can lead to variations in text representation and formatting. The OCR text, derived from optical character recognition applied to scanned or image-based content, captures the visual aspect of text but may lack precise font information. Font substitution during conversion, especially in the absence of embedded fonts, and the interpretation of complex formatting elements contribute to discrepancies. Adobe Acrobat typically doesn't rescan the document but interprets existing text and layout information.

Report · Jan 13, 2024

What you are seeing is (unfortunately) fairly common. The poor-quality scan* (where you can see the text on the back side of the page) is partly to blame.

It does appear that the softness of the scan is causing a "merging" of adjoining letters. A common example of this is "ir" being merged into "n." [Please note that this text is san-serif, and it's not as likely to occur, but with serif text, you'll see what I'm talking about.] As far as the periods within the words, this is probably partly caused by the text on the reverse side bleeding through and the poor kerning of the original.

One thing that could help this immensely is if the OCR applications added AI to the scanning process, but to date, that has not happened yet.

Please read the attached link, try the suggestions I put forth, and see if that helps. Please let me know.

*[For better quality scanning, please check out this blog I wrote years ago:

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785...]

Report · Jan 13, 2024

I zoomed into your example area, and here you can see what "can" cause these issues.

Here you can see the serif of the "L" goes up and seems to have a bump. Also. the "A" seems to have an extended "foot." The kerning around "As" in general is often a challenge that can be interpreted by OCR software to consider this a space, thus justifying the period. Why this didn't happen with the "A" from "Agencie" is anyone's guess.

When it gets right down to it, OCR is a miracle that works as well as it does. It does beat retyping the whole thing, but sometimes, if all you need is something that's not long, it can be faster to retype the whole thing.

Oh, BTW, you asked why the material looks fine in the PDF but copies wrong; that's because one of the options for Acrobat's OCR is to place the OCRed text as an invisible layer in the document. So, what you see is the original document, but what you copy can be something wholly different.

Report · Jan 13, 2024

For instance, there is a row of capital letters that just says "RELATED AGENCIES" in the OCR. But when I export to DOCX that part of the document is "REL.A.TED AGENCIES". This is one of the more beneign differences, other discrepancies have letters missing from words in the DOCX that are fine in the OCR.

By @Dabraham

What you see in your PDF is not the OCRed text, but the scanned image. Adobe is placing the text "behind" the image, so that the PDF is searchable. So, normally you do not see, when the OCR is imperfect. Searching fore "related", however, may not find the occurrence here.

When you export to Word, you just get what OCR tried to understand in your text. As @gary_sc correctly says, the quality of your scan is not the best, as the text from the verso shines through. That disturbes the OCR operation. You can normally avoid that, by changing scanning parameters.

If you can't rescan, you may have other options to enhance the quality of what you see, but it really depends on your tools you have at your disposal.

ABAMBO | Hard- and Software Engineer | Photographer