Ligatures in PDF from Postscript

Forum|Forum|6 years ago
October 24, 2019
2 replies
4299 views

When creating PDF from Postscript, ligatures like fi ffi fl etc... are mapped in a special way and if you try to copy the text from the resulting PDF and paste it elsewhere, you will end up with missing glyphs or special characters or extra spaces when the ligatures are. You can’t search the text either in Acrobat.

Is there a way to create a poscript files, that will correctly embeded and maps the ligatures so that user can extract or search the PDF?

This topic has been closed for replies.

Z

ZBalling

Inspiring

Why Chrome support "U+FB03 : LATIN SMALL LIGATURE FFI" and Adobe Acrobat does not?? The same about PDF-Xchange... See "suﬃciently (ﬃ is one symbol here)" Also SOMEHOW it copies it as U+000E : <control> SHIFT OUT [SO], why??
here https://sites.math.rutgers.edu/~zeilberg/mamarim/mamarimPDF/pimeas.pdf also look https://www.babelstone.co.uk/Unicode/whatisit.html

Z

ZBalling

Inspiring

Meanwhile Pdf-Xchange fixed all its issues with "U+FB03 : LATIN SMALL LIGATURE FFI"...

T

Test Screen Name

Legend

My take is this... when you distill a PDF from PostScript there is no font or glyph remapping at all. If the PostScript contains a reference to the glyph called /fi then the PDF has a reference to the glyph called /fi. It displays fine, but extraction is an interesting problem. Essentially software has two choices

1. Export as the single glyph /fi. In Unicode this is U+FB01. This is entirely legal and correct, except that many fonts do not have this Unicode glyph, so there will be substitution or a missing character. However, if working with pro fonts the glyph may be there. Still confusing for a person who thinks (wrongly) that there are two glyphs. On Mac, Unicode isn't needed, because fi is in the default character set. Coming back to Windows an app may place both Unicode and non-Unicode on the clipboard, and could follow step 2 for the non-Unicode text.

2. Map to the two glyphs "f" and "i". This is arguably wrong, but it is likely to match user expectation more often.

jctremblay

Author

Community Expert

Mapping / encoding or decoding, whatever! Extracting or searching text in PDF generated by Postscript is definately an issue.

T

Test Screen Name

Legend

There's an entirely separate issue that PostScript has no rule that the text is marked with recognised codes. I can write a PostScript file where the letters of the alphabet, instead of being called /a /b /c are called /fred /barney /wilma. This will show and print beautifully, but no text can be extracted. But this is not a PostScript issue; many other PDF generators will use arbitrary codes.

PDF includes a concent "ToUnicode CMap". This is extra information to give the Unicode value for every glyph. Works well, but most apps don't include it.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded