Why does export to PDF output as DOCX have different text to the OCR in the original document?
I am trying to understand why the result of converting a PDF to DOCX differs so much from the OCR text scan embedded in the original document.
For instance, there is a row of capital letters that just says "RELATED AGENCIES" in the OCR. But when I export to DOCX that part of the document is "REL.A.TED AGENCIES". This is one of the more beneign differences, other discrepancies have letters missing from words in the DOCX that are fine in the OCR.
Why is this happening, and what is the source of the differences? Is Adobe Acrobat re-scanning to create the DOCX? Is it possible to export the text as embeded but allow for the formatting (e.g. font face and size) to be exported?
For context, I want the font sizes of the letters and simply extracting the OCR text directly from the PDF does not provide that. Another way to arrive at a solution would be to get Adobe Acrobat to scan for font sizes only ... but I don't think that is possible.
