Copy link to clipboard
Copied
Why Android Adobe PDF reader and Chrome support "U+FB03 : LATIN SMALL LIGATURE FFI" and Adobe Acrobat does not?? The same about PDF-Xchange, it works good... See "and all integers p and q with sufficiently" (ffi is one symbol here) in the first paragraph of https://sites.math.rutgers.edu/~zeilberg/mamarim/mamarimPDF/pimeas.pdf
Also SOMEHOW it copies it as U+000E : <control> SHIFT OUT [SO], why?? Latex source https://sites.math.rutgers.edu/~zeilberg/mamarim/mamarimTeX/pimeas.tex
Also look here https://www.babelstone.co.uk/Unicode/whatisit.html
"Beta: Use Unicode UTF-8 for worldwide language support" or "Edit-->> Preferences-->> Language" do not fix the issue.
Likes
Copy link to clipboard
Copied
Can you confirm if this is also an issue that could be related to how Unicode is supported by the operating system where Acrobat is installed?
Like for example, have you been able to test if the version of your Adobe Acrobat Pro DC behaves the same way in a computer using macOS Catalina(or older version), MS Windows 8 and/or MS Windows 10.
Since you've mentioned about Android OS , maybe it is worth to look also at the operating system where it is running from.
Just recently the last update of June 2020 addressed an issue that was aimed at Acrobat running on MS Windows, in which users were reporting back to the forums that the Weblink plug-in was not encoding/decoding URLs porpperly , for example.
This, however, is not necessarily related to your inquiry, but the fact that UTF-8 encoded URLs were malformed to begin with, it made some sense to me to ask this question because the last update only fixed this problem of Acrobat Pro DC for Windows, not macOS.
Meanwhile, some other Acrobat users who have older versions of the product, like Acrobat Pro X, Acrobat Pro XI, Acrobat DC 2017, have reported back as not experiencing the URL issue.
Have you been able to test or ask friends and/or other users if the LATIN SMALL LIGATURE FFI ligarure issue manifests consistently accross all versions of their Acrobat?
Copy link to clipboard
Copied
Indeed, Android supports ligatures much better than current version of windows (1909, did not test 2004 yet) does. In particular it recognises Unicode ligatures as simultaneously one symbol and multiple symbols. So when you press backspace it will delete ffi (ligarture) and recreate ff (not ligarture). This is how it is supposed to work, so that search still works on multi codepoint Unicode and find letters in ligartures.
Obviously, this has nothing to do with URL processing that is a complex beast as well. Again, it is very dangerous that Acrobat processes Unicode incorrectly. I have no friends to test it with and I only use latest Acrobat DC. I have MacOS Catalina, but I only use windows 10 on my macbook, so sorry, but you will have to test it yourself.
I will ALSO POINT OUT that it is craziness that you use Acrobat for Android codebase that is different from Acrobat for Catalina (64 bit, hehe, so different) and windows 10.
Copy link to clipboard
Copied
So, I used an online Unicode converter and I noticed that when you convert this ffi Unicode text character (LATIN SMALL LIGATURE) you'll get ufb03 which codebase belongs to UTF-16, not UTF-8.
UTF-8 codebase, on the other hand, returns efac83 and this ffi as UTF-8 text.
This is weird because the UTF-8 specification should be backward compatible which also performs recognition with both Free Type and Open Type fonts.
My guess is that the encoding/decoding problem happens when UTF-8 is used and for some reason it becomes unmappable.
In my humble opinion, I think that this may explain why it gives the impression that when you use Acrobat Reader in Android OS (and other platforms) it seems to work OK because they're not using UTF-8. They're using UTF-16 instead.
To work around this in MS Windows try this:
A popup will open next.
See slide:
Usually you change the "Change the system locale" setting if your non Unicode programs are set in a different language that doesn't support Unicode, but Adobe Acrobat supports Unicode in various many languages.
For this particular reason, I would also suggest to open Acrobat , and in Edit-->> Preferences-->> Language, instead of setting the application to English, select "Same as the operating system".
After these changes are done you will be able to copy the ffi ligature and paste it MS Word, notepad or even in Acrobat without it being copied as U+000E (SHIFT OUT). It will (or should) be recognized as a single character symbol too.
There is an interesting discussion in this thread about this particular ligature:
https://apple.stackexchange.com/questions/130638/what-are-these-characters-from-the-os-x-keyboard
Copy link to clipboard
Copied
Mmm. UTF-8 is the same as UTF-16, just it uses variable-width character encoding.
ufb03 is actually U+FB03. efac83 in UTF-8 is decoded as follows: 0xef is 11101111 so it is 3 byte. See https://en.wikipedia.org/wiki/UTF-8#Description table. Next, you extract "x" bits from all 3 bytes as said there: 1111 from first byte, 101100 from second byte and 000011 from 3rd byte. When you concatenate that you get 1111101100000011 or 0xFB03. So it is the same.
I will test UTF-8.
Copy link to clipboard
Copied
Beta: Use Unicode UTF-8 for worldwide language support or Edit-->> Preferences-->> Language do not fix the issue.
Copy link to clipboard
Copied
Thank you for taking the time to break this down all way down to binary.
This is a great teaching lesson for me.
Strange enough though, I had this setting enabled by default in Ms Windows 10 and in none of my programs I was able to get the right characters either by pasting or using the keyboard method "ALT+".
When I disabled it, it allowed to copy the ligature from an HTML source(web browser page) and paste into a document.
It was recognized as single character symbol too.
I was able to use the "ALT+" keyboard method to invoke other characters.
I was not able to use the keyboard "ALT+" method just for this particular ligature or any of it variants though, if this is what you're referring as not working.
Copy link to clipboard
Copied
I was refering to not be able to copy paste with that particular document. It is nothing new. https://superuser.com/questions/375449/why-does-the-text-fi-get-cut-when-i-copy-from-a-pdf-or-print-... (please do not read it, it is VERY VERY outdated, most of it is wrong, pdfLatex supports even PDF 2.0 already, after all).
So again, it is an old issue. IMHO, fix for it should be really simple. Please, just note to devs that they should check whether double click will select the word with that ligature. Again see how it works in PDF-XChange.
Copy link to clipboard
Copied
Thank you so much for sharing your thourough knowledge in this area.
I read in another thread here in the forums about this same issue. It was posted back in January 2020 which is recent.
A macOS user was asking exactly the same question and the fix (or workaround) was not that simple, just like you have pointed out. So you're right, there is no easy simple fix at this time or lack of a feature in Acrobat as you've seen it with PDF-XChange.
This looks more like a good opportunity to submit a feature request:
Copy link to clipboard
Copied
+++ MY LAST UPDATE ON THIS
I forgot to add, that my quickest work around (and maybe you won't agree with) was to export to MS Word document and convert back to PDF using the Adobe PDF Maker add-on.
I noticed that the problem in that document was the LaTEX PDF producer that exported the source document to Acrobat PDF.
It works fine now on my end.
In the original PDF document that you posted I wasn't even able to search for the whole string "and all integers p and q with sufficiently". Now I am able to.
Copy link to clipboard
Copied
Well, this again proves that bug can be simply fixed 😉 If export to MS Word document produces the right unicode.
Copy link to clipboard
Copied
I did a little more digging while I was helping another user with an OCR issue and I noticed that the file that you shared is mainly based on scanned images.
So I opened up Acrobat and used the "Scan & OCR" tool to perform a text recognition on this file, and chose to set the output to "Editable Text & Images". An error message said "Acrobat could not perform recognition because: This page contains renderable text".
Then, I noted that if one need to copy the "ffi" part from the word "sufficiently" in that document, when you select a word and right-click on it, the context menu offers two copy options:
Copying the selected text just using "Copy" won't work because of the rendered text that was produced and laid out by the producing software on top of the scanned image layer.
Using "Copy With Formatting" instead, allows to copy the content to the clipboard as a text string, and be able to paste it in any other program or document as text (not as a ligature).
Now, opening the the Edit PDF tool, or right-clicking on the document and selecting "Edit Text" or "Edit Text & Images" allow to copy that ligature with no problem, recognized as a single symbol character, and also be able to paste it as is in other documents.
So the Unicode recognition is working.
Now that I noted this, I think there's really not a bug or problem with the Unicode, since the issue is related to renderable text over scanned images. Using the copy method described above really does the trick.
Any thoughts on this?
Copy link to clipboard
Copied
Sigh. I did not want to spam further but yes. When I did "export to word" I thought it is strange that the text is done with different fonts, how is this possible, also when I checked for "ffi" it was 3 letters, not a ligature, so I thought, it should have OCR'd partially for those letters that cannot be converted to Unicode. Logical, BTW. I mean Abbyy does the same. But still it can be that it converts ligature to three symbols, that is recommnded for fonts that do not support ligatures (as in this case font substitution will be used). Or if ligatures are off by default (again the case with my Office 365 beta channel).
Copy link to clipboard
Copied
Actually looks like Microsoft Word only supports ligtures in OpenType (not TrueType fonts). So, Georgia/Bookman Old Style are not automatically ligarture'd. You can check in Word with right click -> Font -> Advanced -> OpenType Fonts (ligatures). But still it works if you will copy to Word 0xFB03 (ffi) though it will use non-Georgia font (not that obvious as it still will write Georgia, indeed if future versions of fonts files will include ligatures binary definitions or Word will start supporting TrueType Collections it will start using Georgia font)... But then again maybe it is using Georgia)) There are rules that can do ligatures without fonts supporting them. Who knows.
Copy link to clipboard
Copied
That is so interesting though.
I thank you once more for your patience and taking the time to explain such a complicated topic in a very convenient way to understand.
I really have nothing but mad respect for you in whatever line of work you're in.
Copy link to clipboard
Copied
Booted into my MacOS Catalina today and reinstalled all you crazy 32-bit to 64-bit staff (crazy Apple, 32 bit is HARDWARE thing, CPU still supports it: it is not like with 16 bit, that one is not fully backward compatible) and checked it. The bug is also there (SOMEHOW it copies it as U+000E)! hahah, it is not an issue in default system MacOS viewer app. Really, Adobe loves Apple, so I think they will fix it ASAP)) Right?))
Copy link to clipboard
Copied
Hopefully they will.
I would suggest using the wishform/report a bug that I posted earlier.
I read before in the forums that reporting a bug through that channel increases that chance for the Adobe's engineering teams to actually look at the issue(s) promptly.
I don't think the user-to-user support forums are the most efficient medium to report a bug.
However, hopefully an Adobe employee is reading this thread and join the discussion (at least to acknowledge this issue).
Copy link to clipboard
Copied
This is now a worse issue since after Edge added support for pdf from Adobe this issue is there too (edge://flags, New Pdf Viewer).