Character encoding issues when a document is autotagged

Forum|Forum|1 year ago
December 3, 2024
3 replies
1074 views

I've been having this issue recently that when I autotag a document, it leads to character encoding issues. Except, it doesn't always show up as a failure in that accessibility checker. Sometimes letters just *disappear*. As in, I can see them on the page, but they're no longer in the content containers when I check the tag tree and aren't voiced with a screen reader. Some examples are, "refective, fltered, beneft, specifc, defned". I'm unsure why some seemingly random characters are just missing. This has happened with several PDFs, and don't have access to the source documents, either. When I check character encoding before autotagging, no issues come up. But in most cases, I have to autotag, because the documents don't have space at the end of lines of text in a paragraph, meaning the first and last words will be smooshed together, and autotagging is the only way I've found to fix that. Does anyone know why this is happening or if I can fix it? Furthermore, is there any way to catch it early on, rather than while listening to the document after fully remediating it? Thank you.

MikeCraghead_biz

Participating Frequently

In your case the problem is likely an improperly-encoded ligature in the font: there's a combination character for when an "f" and an "i" end up next to each other, but all Acrobat can read is the "f." Calibri is a common culprit.
If you have access to the authoring doc you can turn off ligatures and dodge the issue entirely, but re-typing the missing letters will work too.
When autotag manifests character encoding errors at the ends of sentences, that requires a professsional exorcist.

Souvik Sadhu

Legend

Hi @CM1002,

Hope you are doing well. Sorry for the trouble, and the delayed response.

In case you are still looking for a solution, you might want to try the below steps:

Extract the Text Layer & Check for Encoding Issues:

Run "Save As" → Plain Text (.txt) in Acrobat.
Open the text file to see if letters are already missing before autotagging.
If characters are missing, the PDF itself is corrupt at the encoding level.

Force a Proper Unicode Text Layer

1. Open Preflight (Ctrl + Shift + X).
2. Under Fixups, search for “Embed missing fonts” and apply it.
3. If the font embedding doesn’t help, use OCR (even if text is selectable):
  - Go to Scan & OCR → Recognize Text in This File → Set as Editable Text.

Check & Manually Correct in the Tags Panel

Open Tags Panel (View > Show/Hide > Navigation Panels > Tags).
Check if missing characters exist in the actual tag tree.
If they are missing, try manually retyping the word in the tag’s Properties.

If autotagging is corrupting the text, try:

Export the PDF as Word (.docx).
Open in Word → Check text integrity → Reconvert to PDF.
Then manually tag in Acrobat.

Before fully remediating:

Use Read Out Loud (Shift + Ctrl + Y in Acrobat) to test early.
Try exporting as a Tagged PDF and reopening to catch missing characters.

Hope this helps.

-Souvik

R

ravinderg62643219

Adobe Employee

Hi @CM1002 ,

Thanks for posting your issue to Adobe. Would it be possible for you to share the pdf file for which you are getting the issue ?

Regards

Ravi

Extract the Text Layer & Check for Encoding Issues:

Force a Proper Unicode Text Layer

Check & Manually Correct in the Tags Panel

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.