Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Black box problems with viewing invisible text layer

New Here ,
Jan 17, 2024 Jan 17, 2024

In Acrobat Pro 2023.008.20470 under Windows 11, I'm having trouble viewing the invisible text layer of PDFs that contain OCR'd text. The main problem I'm having is that big black boxes are present when the layer is visible where the OCR text should be located. It's shaped like the underlying text but it's completely unreadable. See attached image.

 

Clicking Recognize Text....Correct Recognized Text, and then ensuring that the "Review recognized text" checkbox is checked doesn't help.

 

In other google searches, I saw others solve similar issues with playing around with Rendering...Smooth Text being set to None. Another poster mentioned Accessibility.... replace document colors being checked accidentally..... Neither of those options help.

 

I attached the PDF in question.

 

Appreciate any help!

 

TOPICS
Scan documents and OCR
874
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 19, 2024 Jan 19, 2024

I can just confirm that I get a similar issue when using your file. I do not know why.

ABAMBO | Hard- and Software Engineer | Photographer
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2024 Jan 20, 2024

Thanks @Abambo appreciate you giving it a try.

 

One thing I didn't mention was that this PDF was created by ScanSnap, which is the scanner software used with a Fujitsu ix500 scanner. I've made an interesting discovery. What I believe is happening is that this software creates this OCR text layer uses a glyphless font. I use that 'layer' term loosely, it's not actually a layer, because my weak understanding is that PDF's don't support layers beyond what's called optional content groups). It's just text present on the page with opacity set to zero.

 

A glyphless font is one that does not have an actual associated rendered image for each character. How this works exactly, I do not know, but I'm pretty sure Acrobat simply cannot render it, because, well, there's nothing to render. I discovered just now, however, is that if you do a more tools....preflight....make ocr layer, and then EDIT those blackboxes, is that if you change the font on the black boxes, lo and behold, it renders. It renders upside down. If I copy/paste the image into Photoshop, and do an Image...Image Rotation....Flip Canvas Vertical, it reads just fine.

 

This problem affects more than just ScanSnap'd created PDFs. Tesseract, the popular open source OCR package, also uses these fonts for the OCR layer.

 

I think my practical solution to this is to generate the OCR layer in a different fashion. I'm using some custom code with Python and PyMuPDF to render the PDF pages. I can pick my font used in that case.

 

I'm just surprised more people haven't run into this, or perhaps my google fu skills need sharpening.... Now that I know the reason, maybe a more targeted search would produce better solutions....

 

Thanks

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 20, 2024 Jan 20, 2024
LATEST

Oh, it's a layer in Acrobat. But if Acrobat did not create the file, it's not Acrobat's issue. 

ABAMBO | Hard- and Software Engineer | Photographer
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines