Participating Frequently

Answered

Unicode Text Not Encoding Properly

Forum|Forum|6 years ago
September 20, 2019
11 replies
33887 views

Hello,

When I convert a Microsoft Word document to PDF (using either the Adobe Printer or Save As > PDF), the Unicode text in the document does not encode properly. The text _appears_ correct, but when I copy and paste it to another program, such as Notepad, there are errors:

For example, the text in Microsoft Word is

बूढ़े पिता ने ऐसी भूमिका ...

The text _appears_ correct in the PDF file, but when I copy and paste it Notepad, it is:

बूढ़े 􀉟पता ने ऐसी भू􀉠मका ...

Note the boxes.

This erroneous encoding means it is not possible to search the PDF document properly. For example, if I search for "पिता" (the second word), I will not get a match. I would have to search "􀉟पता".

Embedding the fonts makes no difference.

I have attached a Word and PDF file for reference.

Please advise how I can create a PDF properly, so that the encoding is exactly like it is in any other program, i.e. without any of these boxes or other issues.

Thank you

Sim

word.docx

word.pdf

This topic has been closed for replies.

Correct answer seligs30151338

I had something that sounds similar. I had English in the PDF but jibberish when i copied and pasted. So I exported the pages as images (to JPEG) files. Each page unfortunately became a separate file. Then i pulled them all back in using the Combine Files button. They came back in and the OCR automatically read them. When i copied and pasted all was fine. I think the author of the original file used a custom font and coding that was unknown to the program i was pasting to. Not sure how good the OCR is in Asian languages though.

P

Pranita286878915w0f

New Participant

Hi ,

I am trying to convert the Marathi pdf document to the word document.

It changes the font to incorrect characters as below:

किव कु लगु^ कािलदास सं^ृ त िव^िव^ालय, रामटेक, नागपूर

Could you please help me to know what needs to be done to keep the correct format? Thanks in advace.

ls_rbls

Braniac

Hi,

Are you exporting a document to Microsoft Word with Adobe Acrobat Pro DC?

What operaring system are you on?

P

Pranita286878915w0f

New Participant

Hi ,

Yes. Adobe Actobat pro to Ms word 2016.

OS . Windows 10 pro

seligs30151338Correct answer

New Participant

I had something that sounds similar. I had English in the PDF but jibberish when i copied and pasted. So I exported the pages as images (to JPEG) files. Each page unfortunately became a separate file. Then i pulled them all back in using the Combine Files button. They came back in and the OCR automatically read them. When i copied and pasted all was fine. I think the author of the original file used a custom font and coding that was unknown to the program i was pasting to. Not sure how good the OCR is in Asian languages though.

ls_rbls

Braniac

Thank you for updating this old thread with that solution.

I learned something new today.

ls_rbls

Braniac

Hey Madhuris,

Just a quick follow up and checking if you were able to resolve your issue.

Thank you.

Bevi Chagnon - PubCom.com

Braniac

Have you checked to confirm that the version of the font is OpenType/Unicode? (Nirmala UI, I believe, is what you're using.) There are many knock-off versions of this font on the web that could be causing the encoding problems. The real font is copyrighted by Microsoft, installs with legitimate versions of Windows since version 8, and is OpenType/Unicode.

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents ||    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |

ls_rbls

Braniac

Nirmala UI is what I was able to find that was recognized in Adobe. The original Font type he sent the word document has Noto Sans Devanagarie

ls_rbls

Braniac

Nirmala UI seems not to miss the characters that peter was pointing out earlier in the thread; in my case I recreated various scenarios and it was the only font that allowed me to perform a copy and paste search within Adobe Acrobat and actually point to the right documents or words in a document.

ls_rbls

Braniac

I see what you are saying in my case I mentioned about the fonts earlier because when choose Edit Text and Images in PDF you can copy and paste in there and you will see the pasted text incomplete. but when you change the font to "Nirmala UI", for example, you get the whole string visible.

I tried different things going to Preferences, Export, From MS Word, Edit, Fonts, Embedded Fonts and changed some things in there.

In MS Word did the same thing plus played around with different encoding.

ls_rbls

Braniac

Just make sure that when you saved the edited document in MS Word you are using "Nirmala UI" for the whole document. This will allow you to copy and paste text into the "Search " and will will find the words without dispplaying those dingwings symbols

ls_rbls

Braniac

This is my last reply on this topic.

Here is how it looks on my end:

S

sa222222222Author

Participating Frequently

Yes, and as I stated, the first word "बूढ़े" is wrong. It reads "बूढ़" - there is a character missing because of the incorrect coding.

ls_rbls

Braniac

to convert the file just open it with Adobe Acrobat and then save it as PDF. This method won't work if you are trying it with Acrobat Reader.

The other ways of converting the fil to PDF is also very easy. See slides:

S

sa222222222Author

Participating Frequently

Yes, but then try copying that text from the PDF. You will see that the first word is not correct. It pastes as "बूढ़" rather than बूढ़े"

ls_rbls

Braniac

Sir,

I forgot to add one step to your question.

When you copy and paste to note pad you have to change the encoding to UTF-8

S

sa222222222Author

Participating Frequently

Okay. This is all fine, but how does this convert the file to a PDF?

S

sa222222222Author

Participating Frequently

I uploaded this video here just to make sure you understand the issue: https://youtu.be/JgRINuI36vg

ls_rbls

Braniac

The problem is like it happened to another user yesterday that I was trying to help.

There is embedded or merged formatting in those characters wherever you are copying and pasting from .

The solution is very easy. ALL YOU HAVE TO DO WITH WORD , OR ANY OTHER TEXT EDITOR IS SAVE THE FILE AS PLAIN TEXT.

Look at the slides:

IN HERE THE SOLUTION IS FROM THE LAST LINK I SENT YOU. I AM DOING IT FROM MS WORD

IN HERE I AM USING A MORE ADVANCED TEXT EDITOR (NOTEPAD++)... SAME PROCEDURE : SAVE AS PLAIN TEXT.. YOU CAN SEE THE ENCRYPTED OR SCRAMBLED XML ENCODING. WILL OPEN FINE IN NOTEPAD AFTER YOU SAVE AS PLAIN TEXT

THIS IS THE SAME FILE FROM ABOVE OPENED IN MS WORD AFTER IT WAS SAVED AS PLAIN TEXT

P

peterc94883614

Participating Frequently

This is Peter Copeland: My problem is these crunching texts are occuring in Acrobat Pro DC on documents that are saved as .pdf files. Yes, I am sending them to YOU in Word files. But if I copy the sention of 'crushed text' in Acrobat Pro DC, and paste it in a clear section of the document, it 'uncrushes it'. If I try to retype it in the place it come from, it recrushes MORE , because of 'codes?' within the edited box. The text also 'Left adjusts it' not allowing the line to go to edg

P

peterc94883614

Participating Frequently

Sorry! somthing sent message before I was ready :backspace. The text also 'Left adjusts the line' not allowing the text to finish at the CR or line edge of the edit box. If you notice the 'the prophet of Zerubbabel'text above. See how, in pdf, 'theprophet' [ e and p] have displaced over each other, when I tried to reinsert the section of text. It doesn't recognise spaces. Thankyou for your trouble. 2. I have also noted, that, sometimes, if the edit block doesn't correctly count the number of characters in a line, it removes spaces. I think I showed an example where 5 spaces had been removed to get all the words into the line.

ls_rbls

Braniac

I am examining your files right now.

I also foound this link for you: https://support.office.com/en-us/article/choose-text-encoding-when-you-open-and-save-files-60d59c21-88b5-4006-831c-d536d42fd861

Show more replies

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded