Multilanguage/Multifotns/FULL UNICODE NOT SUPPORT(?)

Forum|Forum|4 years ago
June 11, 2021
2 replies
2526 views

Cannot export PDF to DOCX or to RTF or to TXT ETC. correctly if they contain different fonts, encodings, and different Unicode languages.
Eg Various Hebrew + English and other special characters.
The exported document is practically broken.

Joel Cherney

Community Expert

It is possible to make a PDF that could successfully export to .docx and resultt in a pretty much undamaged document. That's certainly not true of all of the ways to make a bilingual Hebrew/English document, or even many of them. Usually, when I'm asked to extract multilingual text from a PDF, there's always something broken. However, there are lots of tricks to try that occasionally work. I have in the past flatten PDFs with Hebrew text in order to rasterize the text, and then OCRed the result. This is a dirty trick and I don't necessarily reccomend it, but it's worked in the past. Opening PDFs with Illustrator has worked for me in some cases. Saving as HTML worked once when nothing else did.

But sometimes, the target formats exported by Acrobat are just broken. Much depends on the tools & methods used to create the PDF in the first place.

K

Kamil5C2DAuthor

Known Participant

PDF very bad format!

Why is there a PDF format if it is not fully compatible with other formats?
I have yet to find any tools in the 21st century, in 2021, that would always correctly convert PDF formats (1:1)

Joel Cherney

Community Expert

PDF very bad format!
Why is there a PDF format if it is not fully compatible with other formats?
I have yet to find any tools in the 21st century, in 2021, that would always correctly convert PDF formats (1:1)

By @Kamil5C2D

Well, I feel weird defending the PDF format (not really my job!) but I'd say that I've been making multilingual PDFs since the 20th century, so I'm well aware that nothing will always successfully convert PDF formats. Ever heard the saying "You can't unscramble an egg"?

Some conversions work fine; if you start with a well-made .docx file and turn it into a .pdf, you can usually extract a decent .docx from the .pdf and expect reasonable fidelity. However, if someone gives me a PDF that was made twenty years ago by printing Postscript from Pagemaker and running the resultant .ps through Distiller, I don't expect Acrobat to be able to extract a .docx from it that displays any degree of fidelity to the source Pagemaker file. That makes sense, right? How about architectural models spat out of AutoCAD? Urdu newspapers laid out in InPage? Would you expect those to be able to be perfectly converted to Word file format?

If you post one of your PDFs, I'd be happy to try three or four different conversion techniques to see which one produces the most usable export, and I'll share the tricks I used to pull it off.

ls_rbls

Community Expert

I am not sure if there is full unicode support for Hebrew in Adobe Acrobat... I may be wrong.

K

Kamil5C2DAuthor

Known Participant

You mean it will never convert PDF? Because all converted files are damaged or empty (no text!)!
Adobe is then the most useless tool in the case ?!

ls_rbls

Community Expert

I didn't mean anything in particular, I also didn't implied nor said in any way that Acrobat will never convert a file to PDF.

Acrobat Pro does convert files to PDF, but if you're having problems with the Unicode character maps of different languages, it may be possible that Hebrew language (in particular) may not be supported.

If that is the case, then you can still produce a PDF by saving the combined documents as an image file and then convert the image to PDF.

The images won't have searchable text, but at least you'll be able to produce a PDF document out of all those combined image files.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded