Unicode Text Not Encoding Properly

New Here ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

Hello,

 

When I convert a Microsoft Word document to PDF (using either the Adobe Printer or Save As > PDF), the Unicode text in the document does not encode properly. The text _appears_ correct, but when I copy and paste it to another program, such as Notepad, there are errors:

 

For example, the text in Microsoft Word is


बूढ़े पिता ने ऐसी भूमिका ...

The text _appears_ correct in the PDF file, but when I copy and paste it Notepad, it is:

बूढ़े 􀉟पता ने ऐसी भू􀉠मका ...

Note the boxes.

 

This erroneous encoding means it is not possible to search the PDF document properly. For example, if I search for "पिता" (the second word), I will not get a match. I would have to search "􀉟पता".

 

Embedding the fonts makes no difference.

 

I have attached a Word and PDF file for reference.

 

Please advise how I can create a PDF properly, so that the encoding is exactly like it is in any other program, i.e. without any of these boxes or other issues.

 

Thank you

 

Sim

TOPICS
Edit and convert PDFs , General troubleshooting

Views

6.3K

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

See if you can download a better editor like Notepad2, gEdit or UltraEdit (paid for).

 

See some answers offered in this link:

https://community.adobe.com/t5/Acrobat/Specials-characters-in-app-response/m-p/10621527#M150868

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

Hi,

 

I'm not sure how a different editor would help. The encoding is wrong in the PDF document, so it doesn't matter which program I paste the text to.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

In the link above one user was asking a similar question. He was using BBEdit which is in macOS (I believe) , and he was adivised to change the type of encoding via editor. Some text editors don't have the capability of viewing an XML file for example (if that is the case with yours). Test Screen Name suggested to change the type of encoding and the other user showed him how that editor have the capability of finding and replaceing characters like those. Worst case scenario, could it be that there is no font support fort that particular language ? Can you share a screenshot of the steps you are following ? I would like to document myself better before I give you wrong answers

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

Hi, I am really not sure what you mean here. I need to be able to convert a Word document to PDF. During that process, something happens to the way the text is encoded in the PDF. As such, the text in the PDF appears correct, but if I try to copy and paste it to another program, there are errors. I can't convert from a different application because the documents are Word documents.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

I am examining your files right now.

 

I also foound this link for you: https://support.office.com/en-us/article/choose-text-encoding-when-you-open-and-save-files-60d59c21-...

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

The problem is like it happened to another user yesterday that I was trying to help.

 

There is embedded or merged formatting in those characters wherever you are copying and pasting from .

 

The solution is very easy. ALL YOU HAVE TO DO WITH WORD , OR ANY OTHER TEXT EDITOR IS SAVE THE FILE AS PLAIN TEXT.

 

Look at the slides:

 

IN HERE THE SOLUTION IS FROM THE LAST LINK I SENT YOU. I AM DOING IT FROM MS WORD

 

change encoding.png

 

 

IN HERE I AM USING A MORE ADVANCED TEXT EDITOR (NOTEPAD++)... SAME PROCEDURE : SAVE AS PLAIN TEXT.. YOU CAN SEE THE ENCRYPTED OR SCRAMBLED XML ENCODING. WILL OPEN FINE IN NOTEPAD AFTER YOU SAVE AS PLAIN TEXT

saving as normal text.png

 

 

THIS IS THE SAME FILE FROM ABOVE OPENED IN MS WORD AFTER IT WAS SAVED AS PLAIN TEXT

saving as normal text2.png

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

This is Peter Copeland: My problem is these crunching texts are occuring in Acrobat Pro DC on documents that are saved as .pdf files. Yes, I am sending them to YOU in Word files. But if I copy the sention of 'crushed text' in Acrobat Pro DC, and paste it in a clear section of the document, it 'uncrushes it'. If I try to retype it in the place it come from, it recrushes MORE , because of 'codes?' within the edited box. The text also 'Left adjusts it' not allowing the line to go to edg

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

Sorry! somthing sent message before I was ready :backspace. The text also 'Left adjusts the line' not allowing the text to finish at the CR or line edge of the edit box. If you notice the 'the prophet of Zerubbabel'text above. See how, in pdf, 'theprophet' [ e and p] have displaced over each other, when I tried to reinsert the section of text. It doesn't recognise spaces. Thankyou for your trouble. 2. I have also noted, that, sometimes, if the edit block doesn't correctly count the number of characters in a line, it removes spaces. I think I showed an example where 5 spaces had been removed to get all the words into the line.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

Sir,

 

I forgot to add one step to your question.

 

When you copy and paste to note pad you have to change the encoding to UTF-8

 

saveas plain text then change encoding to UTF-8.png

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

Okay. This is all fine, but how does this convert the file to a PDF?

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

I uploaded this video here just to make sure you understand the issue: https://youtu.be/JgRINuI36vg

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

Ok so, what worked for me (which is not clear all along in this thread) you have to do this sequence (1) with MS Word, first save the file in plain text. It will bring a up a dialogue box . In this dialoque box select "Unicode" as the encoding type. Then save the text file.. Make note of where you saved this text file as you will need it again. (2) Do not open the file that you just saved as plain tex with MS Word. Even if you open this text file with MS Word and give you the option to change the encoding it will not saved correctly when you convert it to PDF. So (3) just right-click on the text file and select Convert to PDF. Or just open Adobe Acrobat and Acrobat will open it up with no problem (4) Now try the copy and paste... it works!

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

If you see that it is missing characters change to Nirmala UI font, as it is the only one I was able to get a work around . As Chagnon suggested below you are going to have to download and install OpenType/Unicode to make it accessible. in both MS Word, Windows, and your Adobe products

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 20, 2019 Sep 20, 2019

Copy link to clipboard

Copied

to convert the file just open it with Adobe Acrobat  and then save it as PDF. This method won't work if you are trying it with  Acrobat Reader.

 

The other ways of converting the fil to PDF is also very easy. See slides:

 

RIGHT-CLICKCONVERTTOPDF.png

 

 

 

OPENEDINACROBAT.png

 

 

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 21, 2019 Sep 21, 2019

Copy link to clipboard

Copied

Yes, but then try copying that text from the PDF. You will see that the first word is not correct. It pastes as "बूढ़" rather than बूढ़े"

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 21, 2019 Sep 21, 2019

Copy link to clipboard

Copied

This is my last reply on this topic.

 

Here is how it looks on my end:

 

 

copy and paste from PDF.png

 

 

copy and paste from PDF to Notepad.png

 

 

 

 

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 21, 2019 Sep 21, 2019

Copy link to clipboard

Copied

Yes, and as I stated, the first word "बूढ़े" is wrong. It reads "बूढ़" - there is a character missing because of the incorrect coding.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 21, 2019 Sep 21, 2019

Copy link to clipboard

Copied

I see what you are saying in my case I mentioned about the fonts earlier because when choose Edit Text and Images in PDF you can copy and paste in there and you will see the pasted text incomplete. but when you change the font to "Nirmala UI", for example, you get the whole string  visible.

 

I tried different things going to Preferences, Export, From MS Word, Edit, Fonts, Embedded Fonts and changed some things in there.

 

In MS Word did the same thing plus played around with different encoding. Changing Fontadobe devangari.png

 

Changing Fontadobe Nirmala UI.png

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 21, 2019 Sep 21, 2019

Copy link to clipboard

Copied

Just make sure that when you saved the edited document in MS Word you are using "Nirmala UI" for the whole document. This will allow you to copy and paste text into the "Search " and will will find the words without dispplaying those dingwings symbols

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 21, 2019 Sep 21, 2019

Copy link to clipboard

Copied

Have you checked to confirm that the version of the font is OpenType/Unicode? (Nirmala UI, I believe, is what you're using.) There are many knock-off versions of this font on the web that could be causing the encoding problems. The real font is copyrighted by Microsoft, installs with legitimate versions of Windows since version 8, and is OpenType/Unicode.

Bevi Chagnon | PubCom | Designer & Technologist for Accessible Documents
| Books & Classes | Accessible InDesign | Accessible PDFs | Accessible MS Office |

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

Nirmala UI is what I was able to find that was recognized in Adobe. The original Font type he sent the word document has Noto Sans Devanagarie

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

Nirmala UI seems not to miss the characters that peter was pointing out earlier in the thread; in my case I recreated various scenarios and it was the only font that allowed me to perform a copy and paste search within Adobe Acrobat and actually point to the right documents or words in a document.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

Yup! it worked thanks!! By the way when I did the conversion to Plain Text the encoding is just "Unicode" . Thanks again!

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Sep 22, 2019 Sep 22, 2019

Copy link to clipboard

Copied

FYI, Noto Sans Devanagarie is an open source (free) font from Google https://fonts.google.com/. It's an excellent Unicode/OpenType font. Best practice = use the same font the original author used.

 

Bevi Chagnon | PubCom | Designer & Technologist for Accessible Documents
| Books & Classes | Accessible InDesign | Accessible PDFs | Accessible MS Office |

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines