Skip to main content
Known Participant
June 11, 2021
Question

Multilanguage/Multifotns/FULL UNICODE NOT SUPPORT(?)

  • June 11, 2021
  • 2 replies
  • 2521 views

Cannot export PDF to DOCX or to  RTF or to TXT ETC. correctly if they contain different fonts, encodings, and different Unicode languages.
Eg Various Hebrew + English and other special characters.
The exported document is practically broken.

2 replies

Joel Cherney
Community Expert
Community Expert
June 13, 2021

It is possible to make a PDF that could successfully export to .docx and resultt in a pretty much undamaged document. That's certainly not true of all of the ways to make a bilingual Hebrew/English document, or even many of them. Usually, when I'm asked to extract multilingual text from a PDF, there's always something broken. However, there are lots of tricks to try that occasionally work. I have in the past flatten PDFs with Hebrew text in order to rasterize the text, and then OCRed the result. This is a dirty trick and I don't necessarily reccomend it, but it's worked in the past. Opening PDFs with Illustrator has worked for me in some cases. Saving as HTML worked once when nothing else did. 

 

But sometimes, the target formats exported by Acrobat are just broken. Much depends on the tools & methods used to create the PDF in the first place. 

 

 

Kamil5C2DAuthor
Known Participant
June 14, 2021

PDF very bad format!

Why is there a PDF format if it is not fully compatible with other formats?
I have yet to find any tools in the 21st century, in 2021, that would always correctly convert PDF formats (1:1)

Joel Cherney
Community Expert
Community Expert
June 16, 2021
quote

PDF very bad format!

Why is there a PDF format if it is not fully compatible with other formats?
I have yet to find any tools in the 21st century, in 2021, that would always correctly convert PDF formats (1:1)


By @Kamil5C2D

 

Well, I feel weird defending the PDF format (not really my job!) but I'd say that I've been making multilingual PDFs since the 20th century, so I'm well aware that nothing will always successfully convert PDF formats. Ever heard the saying "You can't unscramble an egg"?  

 

Some conversions work fine; if you start with a well-made .docx file and turn it into a .pdf, you can usually extract a decent .docx from the .pdf and expect reasonable fidelity. However, if someone gives me a PDF that was made twenty years ago by printing Postscript from Pagemaker and running the resultant .ps through Distiller, I don't expect Acrobat to be able to extract a .docx from it that displays any degree of fidelity to the source Pagemaker file. That makes sense, right? How about architectural models spat out of AutoCAD? Urdu newspapers laid out in InPage? Would you expect those to be able to be perfectly converted to Word file format? 

 

If you post one of your PDFs, I'd be happy to try three or four different conversion techniques to see which one produces the most usable export, and I'll share the tricks I used to pull it off. 

 

 

ls_rbls
Community Expert
Community Expert
June 12, 2021

I am not sure if there is full unicode support for Hebrew in Adobe Acrobat... I may be wrong.

 

 

 

Kamil5C2DAuthor
Known Participant
June 12, 2021

You mean it will never convert PDF? Because all converted files are damaged or empty (no text!)!
Adobe is then the most useless tool in the case ?!

ls_rbls
Community Expert
Community Expert
June 12, 2021

I didn't mean anything in particular, I also didn't implied nor said in any way that Acrobat will never convert a file to PDF.

 

Acrobat Pro does convert files to PDF, but if you're having problems with the Unicode character maps of different languages, it may be possible that Hebrew language (in particular) may not be supported.

 

If that is the case, then you can still produce a PDF by saving the combined documents as an image file and then convert the image to PDF.  

 

The images won't have searchable text, but at least you'll be able to produce a PDF document out of all those combined image files.