Skip to main content
jctremblay
Community Expert
Community Expert
October 24, 2019
Question

Ligatures in PDF from Postscript

  • October 24, 2019
  • 2 replies
  • 4299 views

When creating PDF from Postscript, ligatures like fi ffi fl etc... are mapped in a special way and if you try to copy the text from the resulting PDF and paste it elsewhere, you will end up with missing glyphs or special characters or extra spaces when the ligatures are. You can’t search the text either in Acrobat.

 

Is there a way to create a poscript files, that will correctly embeded and maps the ligatures so that user can extract or search the PDF? 

This topic has been closed for replies.

2 replies

Inspiring
June 30, 2020

Why Chrome support "U+FB03 : LATIN SMALL LIGATURE FFI" and Adobe Acrobat does not?? The same about PDF-Xchange... See "sufficiently (ffi is one symbol here)" Also SOMEHOW it copies it as  U+000E : <control> SHIFT OUT [SO], why??
here https://sites.math.rutgers.edu/~zeilberg/mamarim/mamarimPDF/pimeas.pdf also look https://www.babelstone.co.uk/Unicode/whatisit.html

Inspiring
March 31, 2024

Meanwhile Pdf-Xchange fixed all its issues with "U+FB03 : LATIN SMALL LIGATURE FFI"...

Legend
October 24, 2019

My take is this... when you distill a PDF from PostScript there is no font or glyph remapping at all. If the PostScript contains a reference to the glyph called /fi then the PDF has a reference to the glyph called /fi. It displays fine, but extraction is an interesting problem. Essentially software has two choices

1. Export as the single glyph /fi. In Unicode this is U+FB01. This is entirely legal and correct, except that many fonts do not have this Unicode glyph, so there will be substitution or a missing character. However, if working with pro fonts the glyph may be there. Still confusing for a person who thinks (wrongly) that there are two glyphs. On Mac, Unicode isn't needed, because fi is in the default character set. Coming back to Windows an app may place both Unicode and non-Unicode on the clipboard, and could follow step 2 for the non-Unicode text.

2. Map to the two glyphs "f" and "i". This is arguably wrong, but it is likely to match user expectation more often.

jctremblay
Community Expert
Community Expert
October 24, 2019

Mapping / encoding or decoding, whatever! Extracting or searching text in PDF generated by Postscript is definately an issue.

Legend
October 24, 2019

There's an entirely separate issue that PostScript has no rule that the text is marked with recognised codes. I can write a PostScript file where the letters of the alphabet, instead of being called /a /b /c are called /fred /barney /wilma. This will show and print beautifully, but no text can be extracted. But this is not a PostScript issue; many other PDF generators will use arbitrary codes.

 

PDF includes a concent "ToUnicode CMap". This is extra information to give the Unicode value for every glyph. Works well, but most apps don't include it.