Bug in Unicode processing of a ligature ﬃ [U+FB03 : LATIN SMALL LIGATURE FFI] and others

Report · Jun 30, 2020

Why Android Adobe PDF reader and Chrome support "U+FB03 : LATIN SMALL LIGATURE FFI" and Adobe Acrobat does not?? The same about PDF-Xchange, it works good... See "and all integers p and q with suﬃciently" (ﬃ is one symbol here) in the first paragraph of https://sites.math.rutgers.edu/~zeilberg/mamarim/mamarimPDF/pimeas.pdf

Also SOMEHOW it copies it as U+000E : <control> SHIFT OUT [SO], why?? Latex source https://sites.math.rutgers.edu/~zeilberg/mamarim/mamarimTeX/pimeas.tex
Also look here https://www.babelstone.co.uk/Unicode/whatisit.html

and here https://github.com/alif-type/libertinus/issues/143 (it has nice compilation of all (?) ligatures).

P.S.

"Beta: Use Unicode UTF-8 for worldwide language support" or "Edit-->> Preferences-->> Language" do not fix the issue.

Likes

Report · Jul 01, 2020

Can you confirm if this is also an issue that could be related to how Unicode is supported by the operating system where Acrobat is installed?

Like for example, have you been able to test if the version of your Adobe Acrobat Pro DC behaves the same way in a computer using macOS Catalina(or older version), MS Windows 8 and/or MS Windows 10.

Since you've mentioned about Android OS , maybe it is worth to look also at the operating system where it is running from.

Just recently the last update of June 2020 addressed an issue that was aimed at Acrobat running on MS Windows, in which users were reporting back to the forums that the Weblink plug-in was not encoding/decoding URLs porpperly , for example.

This, however, is not necessarily related to your inquiry, but the fact that UTF-8 encoded URLs were malformed to begin with, it made some sense to me to ask this question because the last update only fixed this problem of Acrobat Pro DC for Windows, not macOS.

Meanwhile, some other Acrobat users who have older versions of the product, like Acrobat Pro X, Acrobat Pro XI, Acrobat DC 2017, have reported back as not experiencing the URL issue.

Have you been able to test or ask friends and/or other users if the LATIN SMALL LIGATURE FFI ligarure issue manifests consistently accross all versions of their Acrobat?

Report · Jul 01, 2020

Indeed, Android supports ligatures much better than current version of windows (1909, did not test 2004 yet) does. In particular it recognises Unicode ligatures as simultaneously one symbol and multiple symbols. So when you press backspace it will delete ﬃ (ligarture) and recreate ff (not ligarture). This is how it is supposed to work, so that search still works on multi codepoint Unicode and find letters in ligartures.

Obviously, this has nothing to do with URL processing that is a complex beast as well. Again, it is very dangerous that Acrobat processes Unicode incorrectly. I have no friends to test it with and I only use latest Acrobat DC. I have MacOS Catalina, but I only use windows 10 on my macbook, so sorry, but you will have to test it yourself.

I will ALSO POINT OUT that it is craziness that you use Acrobat for Android codebase that is different from Acrobat for Catalina (64 bit, hehe, so different) and windows 10.

Report · Jul 01, 2020

So, I used an online Unicode converter and I noticed that when you convert this ﬃ Unicode text character (LATIN SMALL LIGATURE) you'll get ufb03 which codebase belongs to UTF-16, not UTF-8.

UTF-8 codebase, on the other hand, returns efac83 and this ï¬ƒ as UTF-8 text.

This is weird because the UTF-8 specification should be backward compatible which also performs recognition with both Free Type and Open Type fonts.

My guess is that the encoding/decoding problem happens when UTF-8 is used and for some reason it becomes unmappable.

In my humble opinion, I think that this may explain why it gives the impression that when you use Acrobat Reader in Android OS (and other platforms) it seems to work OK because they're not using UTF-8. They're using UTF-16 instead.

To work around this in MS Windows try this:

Go to Control Panel\All Control Panel Items\Region .

Under under the "Formats" tab select "Match Windows display language (recommended)" instead of "English (United States)".

Then click on the Administrative tab, and then click on the "Change system locale..." button.

A popup will open next.

In that Regions Settings popup, uncheck the box that says " Beta: Use Unicode UTF-8 for worldwide language support", then click OK and restart.

See slide:

Usually you change the "Change the system locale" setting if your non Unicode programs are set in a different language that doesn't support Unicode, but Adobe Acrobat supports Unicode in various many languages.

For this particular reason, I would also suggest to open Acrobat , and in Edit-->> Preferences-->> Language, instead of setting the application to English, select "Same as the operating system".

After these changes are done you will be able to copy the ﬃ ligature and paste it MS Word, notepad or even in Acrobat without it being copied as U+000E (SHIFT OUT). It will (or should) be recognized as a single character symbol too.

There is an interesting discussion in this thread about this particular ligature:

https://apple.stackexchange.com/questions/130638/what-are-these-characters-from-the-os-x-keyboard

Report · Jul 02, 2020

Mmm. UTF-8 is the same as UTF-16, just it uses variable-width character encoding.

ufb03 is actually U+FB03. efac83 in UTF-8 is decoded as follows: 0xef is 11101111 so it is 3 byte. See https://en.wikipedia.org/wiki/UTF-8#Description table. Next, you extract "x" bits from all 3 bytes as said there: 1111 from first byte, 101100 from second byte and 000011 from 3rd byte. When you concatenate that you get 1111101100000011 or 0xFB03. So it is the same.

I will test UTF-8.

Report · Jul 02, 2020

Beta: Use Unicode UTF-8 for worldwide language support or Edit-->> Preferences-->> Language do not fix the issue.

Report · Jul 02, 2020

Thank you for taking the time to break this down all way down to binary.

This is a great teaching lesson for me.

Strange enough though, I had this setting enabled by default in Ms Windows 10 and in none of my programs I was able to get the right characters either by pasting or using the keyboard method "ALT+".

When I disabled it, it allowed to copy the ligature from an HTML source(web browser page) and paste into a document.

It was recognized as single character symbol too.

I was able to use the "ALT+" keyboard method to invoke other characters.

I was not able to use the keyboard "ALT+" method just for this particular ligature or any of it variants though, if this is what you're referring as not working.

Report · Jul 02, 2020

I was refering to not be able to copy paste with that particular document. It is nothing new. https://superuser.com/questions/375449/why-does-the-text-fi-get-cut-when-i-copy-from-a-pdf-or-print-... (please do not read it, it is VERY VERY outdated, most of it is wrong, pdfLatex supports even PDF 2.0 already, after all).

Also this https://webcache.googleusercontent.com/search?q=cache:c8tgVo3R4L4J:https://acrobat.uservoice.com/for...

So again, it is an old issue. IMHO, fix for it should be really simple. Please, just note to devs that they should check whether double click will select the word with that ligature. Again see how it works in PDF-XChange.

Report · Jul 02, 2020

Thank you so much for sharing your thourough knowledge in this area.

I read in another thread here in the forums about this same issue. It was posted back in January 2020 which is recent.

A macOS user was asking exactly the same question and the fix (or workaround) was not that simple, just like you have pointed out. So you're right, there is no easy simple fix at this time or lack of a feature in Acrobat as you've seen it with PDF-XChange.

This looks more like a good opportunity to submit a feature request:

https://www.adobe.com/products/wishform.html

Report · Jul 02, 2020

+++ MY LAST UPDATE ON THIS

I forgot to add, that my quickest work around (and maybe you won't agree with) was to export to MS Word document and convert back to PDF using the Adobe PDF Maker add-on.

I noticed that the problem in that document was the LaTEX PDF producer that exported the source document to Acrobat PDF.

It works fine now on my end.

In the original PDF document that you posted I wasn't even able to search for the whole string "and all integers p and q with suﬃciently". Now I am able to.

Report · Jul 02, 2020

Well, this again proves that bug can be simply fixed 😉 If export to MS Word document produces the right unicode.

Report · Jul 05, 2020

I did a little more digging while I was helping another user with an OCR issue and I noticed that the file that you shared is mainly based on scanned images.

So I opened up Acrobat and used the "Scan & OCR" tool to perform a text recognition on this file, and chose to set the output to "Editable Text & Images". An error message said "Acrobat could not perform recognition because: This page contains renderable text".

Then, I noted that if one need to copy the "ffi" part from the word "sufficiently" in that document, when you select a word and right-click on it, the context menu offers two copy options:

Copy
Copy With Formatting

Copying the selected text just using "Copy" won't work because of the rendered text that was produced and laid out by the producing software on top of the scanned image layer.

Using "Copy With Formatting" instead, allows to copy the content to the clipboard as a text string, and be able to paste it in any other program or document as text (not as a ligature).

Now, opening the the Edit PDF tool, or right-clicking on the document and selecting "Edit Text" or "Edit Text & Images" allow to copy that ligature with no problem, recognized as a single symbol character, and also be able to paste it as is in other documents.

So the Unicode recognition is working.

Now that I noted this, I think there's really not a bug or problem with the Unicode, since the issue is related to renderable text over scanned images. Using the copy method described above really does the trick.

Any thoughts on this?

Report · Jul 05, 2020

Sigh. I did not want to spam further but yes. When I did "export to word" I thought it is strange that the text is done with different fonts, how is this possible, also when I checked for "ffi" it was 3 letters, not a ligature, so I thought, it should have OCR'd partially for those letters that cannot be converted to Unicode. Logical, BTW. I mean Abbyy does the same. But still it can be that it converts ligature to three symbols, that is recommnded for fonts that do not support ligatures (as in this case font substitution will be used). Or if ligatures are off by default (again the case with my Office 365 beta channel).

Report · Jul 05, 2020

Actually looks like Microsoft Word only supports ligtures in OpenType (not TrueType fonts). So, Georgia/Bookman Old Style are not automatically ligarture'd. You can check in Word with right click -> Font -> Advanced -> OpenType Fonts (ligatures). But still it works if you will copy to Word 0xFB03 (ﬃ) though it will use non-Georgia font (not that obvious as it still will write Georgia, indeed if future versions of fonts files will include ligatures binary definitions or Word will start supporting TrueType Collections it will start using Georgia font)... But then again maybe it is using Georgia)) There are rules that can do ligatures without fonts supporting them. Who knows.

Report · Jul 05, 2020

That is so interesting though.

I thank you once more for your patience and taking the time to explain such a complicated topic in a very convenient way to understand.

I really have nothing but mad respect for you in whatever line of work you're in.

Report · Jul 12, 2020

Booted into my MacOS Catalina today and reinstalled all you crazy 32-bit to 64-bit staff (crazy Apple, 32 bit is HARDWARE thing, CPU still supports it: it is not like with 16 bit, that one is not fully backward compatible) and checked it. The bug is also there (SOMEHOW it copies it as U+000E)! hahah, it is not an issue in default system MacOS viewer app. Really, Adobe loves Apple, so I think they will fix it ASAP)) Right?))

Report · Jul 12, 2020

Hopefully they will.

I would suggest using the wishform/report a bug that I posted earlier.

I read before in the forums that reporting a bug through that channel increases that chance for the Adobe's engineering teams to actually look at the issue(s) promptly.

I don't think the user-to-user support forums are the most efficient medium to report a bug.

However, hopefully an Adobe employee is reading this thread and join the discussion (at least to acknowledge this issue).

Report · Apr 07, 2024

This is now a worse issue since after Edge added support for pdf from Adobe this issue is there too (edge://flags, New Pdf Viewer).