Inspiring

Resuelto

Strange Font Encoding in PDF files

Forum|Forum|4 years ago
October 26, 2021
5 respuestas
9325 visualizaciones

I have received a number of multipage (150 pg+) PDF docments from a client that will require extensive revision. I have discovered that there is a great amount of type in these documents that is custom encoded, and have names unusual such as MSTT31c750 (Embedded Subset) Type 1 Encoding: Custom. A LOT of them, like 80 instances. All the usual trick to import ot extract text, even using the otherwise excellent Marzware PDFMarz utility produces "Missing Fonts" for these.

This is where it gets STRANGE. Attemps to replace with a common font such as Myriad, Arial, or Helvetica produces gibberish text, as is the "default" font. Even copy pasting the text or saving as a WORD or TXT file produces gibberish - even pasting into a text editor. VERY Strange, and frustrating. The fairly extreme soultuon of exporting a page as a image file, creating a new PDF of the page and running OCR produces copy that would require extemsive manual correction.

The orginating application seems to be Adobe Pagemaker 6.52 / Distiller for WIndows 4.0

The best guess I have is this is some sort of font encoding DRM/Copy Protection scheme, or posibly some sort of variable typeface with non-standard encoding based on the "font names". What's really crazy is that these LOOK like fairly common ordinary fonts... But I need to be able to either edit or extract this copy for the client's revisions. I do relaize that having an editor manually retype the enitre document may be the eventual - but time consuming and therefore costly - solution.

Anyone seen anything like this?

Mejor respuesta de Brad @ Roaring Mouse

Not a DRM issue.

"Back in the day", TrueType font support in Postscript printers was pretty non-existent, so fonts like Arial were downloaded by Windows to Postscript printers in a PS compatible outline. This results in the weird names you are seeing as the names are being created on the fly. Since they were only meant for output, the fonts were also given an abbreviated custom encoding that was also created on the fly to handle only the limited subset characters that would be embedded. Editing a PDF back then was not really a thing (outside of using a program like PitStop), so it didn't matter what the encoding was. Of course, NOW it is an issue, but right now, outside of a few tricks, there's no way to correct/change EXISTING type and edit it the way you want. You should be able to type NEW content in the proper font (e.g. Arial), even in the same line as the old stuff.

CosmoStranger

Participant

1. Open *.pdf with Mozilla Thunderbird
2. Print this file in the viewer window to another *.pdf.pdf

That's it. The new file will be correct.

Tested on Thunderbird Desktop
Version 128.6.0esr | Released January 8, 2025

CosmoStranger

Participant

Also tested in Mozilla Firefox Browsers 127.0 (32-bit) June 11, 2024

SamuraiArtGuyAutor

Inspiring

Thank you folks. I had My suspicions, as soon as I saw "Pagemaker" in the metadata.

The workflow is intended to end up in InDesign. I would be bat guano insane to attempt this editing within Acrobat DC. I discovered the problem using the otherwise excellent Markzware PDFMarkz utility to convert/import the 164-page document. The various options are all varying degrees of tedious, bringing in individual pages of the orignal and pasting over the edits on a new layer. This approach has it's limitatoons, and lacks design flexibility. I can get reasonable OCR from 600-dpi exports (vs the first attempt at 300 dpi) of individual pages, which I can also use to recover a multitude of individual inline graphics. And bless the Gods of Design, the "Copy witth Formatting" feature in Acrobat DC turns out to recover about 90% of the text from individual pages. So we won't have to have a copy editor retype all the text from the entire document.

So I think the path of least resistance is to re-create these documents with the fiull suite of InDesign's layout tools, and re-set the text I can extract, recover, or OCR. Still tedious, but not brutal. And at the end of the day, the client will have a fresh new original docment that can be freely revised, which is the right way to do it. Also Arial can be banished for opentype versions of Myriad Pro or Heveltica Neue with more typographic flexibility.

Thank you both for your insights and expertise.

Brad @ Roaring Mouse

Community Expert

SamuraiArtGuy:

You may want to consider a different workflow. This sounds like a project far beyond the limited editing abilities of a PDF editor.

Obviously, having the original files wouldn't help much as they are PageMaker (although I could convert them to InDesign for you if they are available), but you could look at placing the existing PDF into a new ID document and doing your changes on individual pages on overlays, re-exporting a new PDF. Or, you could insert and replace the changed pages back into the existing PDF document. In the long run, you will have better flexibility and better control for further changes. You may even try a PDF to ID converter to attempt to recover something editable

If you'd care to share a sample document that's particularly troubling, I could suggest some approaches.

T

Test Screen Name

Legend

This is not at all unusual. It isn't a copy protection scheme, just the accidental fallout from software, fonts and systems older than PDF itself. It's the best part of 20 years since PageMaker itself was discontinued...

There is no way to "repair" the encodings. Indeed, they aren't broken, but just aren't predictable or useful.

SamuraiArtGuyAutor

Inspiring

I did raise my eyebrow that there is someone out there still using Pagemaker 6 in 2017... but hey.

Brad @ Roaring Mouse

Community Expert

I don't think they are. These sound like really old files, especially with the mention of Distiller 4.0. The Document properties should show the original creation date.

Brad @ Roaring Mouse

Respuesta

Community Expert

Not a DRM issue.

"Back in the day", TrueType font support in Postscript printers was pretty non-existent, so fonts like Arial were downloaded by Windows to Postscript printers in a PS compatible outline. This results in the weird names you are seeing as the names are being created on the fly. Since they were only meant for output, the fonts were also given an abbreviated custom encoding that was also created on the fly to handle only the limited subset characters that would be embedded. Editing a PDF back then was not really a thing (outside of using a program like PitStop), so it didn't matter what the encoding was. Of course, NOW it is an issue, but right now, outside of a few tricks, there's no way to correct/change EXISTING type and edit it the way you want. You should be able to type NEW content in the proper font (e.g. Arial), even in the same line as the old stuff.

Regístrese

Social Login

Bienvenido

Social Login

Escaneando el archivo en busca de virus

Este archivo no se puede descargar