Copy link to clipboard
Copied
Hello,
When I convert a Microsoft Word document to PDF (using either the Adobe Printer or Save As > PDF), the Unicode text in the document does not encode properly. The text _appears_ correct, but when I copy and paste it to another program, such as Notepad, there are errors:
For example, the text in Microsoft Word is
बूढ़े पिता ने ऐसी भूमिका ...
The text _appears_ correct in the PDF file, but when I copy and paste it Notepad, it is:
बूढ़े पता ने ऐसी भूमका ...
Note the boxes.
This erroneous encoding means it is not possible to search the PDF document properly. For example, if I search for "पिता" (the second word), I will not get a match. I would have to search "पता".
Embedding the fonts makes no difference.
I have attached a Word and PDF file for reference.
Please advise how I can create a PDF properly, so that the encoding is exactly like it is in any other program, i.e. without any of these boxes or other issues.
Thank you
Sim
Copy link to clipboard
Copied
I had something that sounds similar. I had English in the PDF but jibberish when i copied and pasted. So I exported the pages as images (to JPEG) files. Each page unfortunately became a separate file. Then i pulled them all back in using the Combine Files button. They came back in and the OCR automatically read them. When i copied and pasted all was fine. I think the author of the original file used a custom font and coding that was unknown to the program i was pasting to. Not sure how good the OCR is in Asian languages though.
Copy link to clipboard
Copied
See if you can download a better editor like Notepad2, gEdit or UltraEdit (paid for).
See some answers offered in this link:
https://community.adobe.com/t5/Acrobat/Specials-characters-in-app-response/m-p/10621527#M150868
Copy link to clipboard
Copied
Hi,
I'm not sure how a different editor would help. The encoding is wrong in the PDF document, so it doesn't matter which program I paste the text to.
Copy link to clipboard
Copied
Copy link to clipboard
Copied
Copy link to clipboard
Copied
I am examining your files right now.
I also foound this link for you: https://support.office.com/en-us/article/choose-text-encoding-when-you-open-and-save-files-60d59c21-...
Copy link to clipboard
Copied
The problem is like it happened to another user yesterday that I was trying to help.
There is embedded or merged formatting in those characters wherever you are copying and pasting from .
The solution is very easy. ALL YOU HAVE TO DO WITH WORD , OR ANY OTHER TEXT EDITOR IS SAVE THE FILE AS PLAIN TEXT.
Look at the slides:
IN HERE THE SOLUTION IS FROM THE LAST LINK I SENT YOU. I AM DOING IT FROM MS WORD
IN HERE I AM USING A MORE ADVANCED TEXT EDITOR (NOTEPAD++)... SAME PROCEDURE : SAVE AS PLAIN TEXT.. YOU CAN SEE THE ENCRYPTED OR SCRAMBLED XML ENCODING. WILL OPEN FINE IN NOTEPAD AFTER YOU SAVE AS PLAIN TEXT
THIS IS THE SAME FILE FROM ABOVE OPENED IN MS WORD AFTER IT WAS SAVED AS PLAIN TEXT
Copy link to clipboard
Copied
Copy link to clipboard
Copied
Copy link to clipboard
Copied
Sir,
I forgot to add one step to your question.
When you copy and paste to note pad you have to change the encoding to UTF-8
Copy link to clipboard
Copied
Copy link to clipboard
Copied
I uploaded this video here just to make sure you understand the issue: https://youtu.be/JgRINuI36vg
Copy link to clipboard
Copied
Copy link to clipboard
Copied
Copy link to clipboard
Copied
to convert the file just open it with Adobe Acrobat and then save it as PDF. This method won't work if you are trying it with Acrobat Reader.
The other ways of converting the fil to PDF is also very easy. See slides:
Copy link to clipboard
Copied
Copy link to clipboard
Copied
This is my last reply on this topic.
Here is how it looks on my end:
Copy link to clipboard
Copied
Copy link to clipboard
Copied
I see what you are saying in my case I mentioned about the fonts earlier because when choose Edit Text and Images in PDF you can copy and paste in there and you will see the pasted text incomplete. but when you change the font to "Nirmala UI", for example, you get the whole string visible.
I tried different things going to Preferences, Export, From MS Word, Edit, Fonts, Embedded Fonts and changed some things in there.
In MS Word did the same thing plus played around with different encoding.
Copy link to clipboard
Copied
Copy link to clipboard
Copied
Have you checked to confirm that the version of the font is OpenType/Unicode? (Nirmala UI, I believe, is what you're using.) There are many knock-off versions of this font on the web that could be causing the encoding problems. The real font is copyrighted by Microsoft, installs with legitimate versions of Windows since version 8, and is OpenType/Unicode.
Copy link to clipboard
Copied
Copy link to clipboard
Copied
Copy link to clipboard
Copied
Copy link to clipboard
Copied
FYI, Noto Sans Devanagarie is an open source (free) font from Google https://fonts.google.com/. It's an excellent Unicode/OpenType font. Best practice = use the same font the original author used.