Unicode Text Not Encoding Properly

Forum|Forum|6 years ago
September 20, 2019
11 replies
33887 views

Hello,

When I convert a Microsoft Word document to PDF (using either the Adobe Printer or Save As > PDF), the Unicode text in the document does not encode properly. The text _appears_ correct, but when I copy and paste it to another program, such as Notepad, there are errors:

For example, the text in Microsoft Word is

बूढ़े पिता ने ऐसी भूमिका ...

The text _appears_ correct in the PDF file, but when I copy and paste it Notepad, it is:

बूढ़े 􀉟पता ने ऐसी भू􀉠मका ...

Note the boxes.

This erroneous encoding means it is not possible to search the PDF document properly. For example, if I search for "पिता" (the second word), I will not get a match. I would have to search "􀉟पता".

Embedding the fonts makes no difference.

I have attached a Word and PDF file for reference.

Please advise how I can create a PDF properly, so that the encoding is exactly like it is in any other program, i.e. without any of these boxes or other issues.

Thank you

Sim

word.docx

word.pdf

This topic has been closed for replies.

Correct answer seligs30151338

I had something that sounds similar. I had English in the PDF but jibberish when i copied and pasted. So I exported the pages as images (to JPEG) files. Each page unfortunately became a separate file. Then i pulled them all back in using the Combine Files button. They came back in and the OCR automatically read them. When i copied and pasted all was fine. I think the author of the original file used a custom font and coding that was unknown to the program i was pasting to. Not sure how good the OCR is in Asian languages though.

Show previous replies

ls_rbls

Community Expert

See if you can download a better editor like Notepad2, gEdit or UltraEdit (paid for).

See some answers offered in this link:

https://community.adobe.com/t5/Acrobat/Specials-characters-in-app-response/m-p/10621527#M150868

S

sa222222222Author

Participating Frequently

Hi,

I'm not sure how a different editor would help. The encoding is wrong in the PDF document, so it doesn't matter which program I paste the text to.

ls_rbls

Community Expert

In the link above one user was asking a similar question. He was using BBEdit which is in macOS (I believe) , and he was adivised to change the type of encoding via editor. Some text editors don't have the capability of viewing an XML file for example (if that is the case with yours). Test Screen Name suggested to change the type of encoding and the other user showed him how that editor have the capability of finding and replaceing characters like those. Worst case scenario, could it be that there is no font support fort that particular language ? Can you share a screenshot of the steps you are following ? I would like to document myself better before I give you wrong answers

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded