Skip to main content
Participating Frequently
November 5, 2019
Question

Cannot find/replace text in pdf: some glyphs not recognizable

  • November 5, 2019
  • 3 replies
  • 3206 views

Hello,

I have a very large modifiable pdf that needs to be updated (replace product reference codes on over 1000 pages...).
I used qpdf to extract the streams and I was able to easily change about one third of the doc with sed since some of the text was in ascii. But the rest of the content is not easily readable and my last resort is to automate mouse and keyboard actions to find and replace each reference code. Unfortunately some content is not found by Acrobat Pro DC even though I can select each character, but if I try to copy/paste the content it is not recognized. I was able to see that the pdf as been created with indesign on a mac and unfortunately I cannot have the original file used to create it and I am on windows. The font used is century gothic. It is probably diffently encoded than the one I have on my computer.

 

I tried downloading different versions of the font but no luck. I also exported the pdf in tiff (600 pp resolution) and tried the OCR but even in that resolution it fails to recognize properly most of the text. I am stuck.
Does anyone have a solution ?

 

This topic has been closed for replies.

3 replies

Legend
November 6, 2019

If you can get the text out, that's a lot of retyping you don't have to do. I suggest looking at all of the other ways you might get the images out, and collate them all, ready to remake the document.

Another route would be to use InDesign (Word would struggle) to make up all the pages with space for images, then in Acrobat redact all the text out. Now in InDesign, place each page (now containing only graphics) and you have recombined with all the graphics in situ.

Participating Frequently
November 6, 2019

Thanks for your reply. There are so many pictures on each page that it will take as much time to replace each picture than it would take to replace all the text 😞 It is a technical document about industrial products, some kind of directory/listing...

Anyways, if the conversion to docx manages to get all the text there should be a way to "fix" this pdf.

Any chance to generate /ToUnicode maps for the fonts encoded in Identity-H ?

Participating Frequently
November 6, 2019

I tried to use other software to modify or convert the pdf and I came across something very interesting that explains why I cannot find/replace some of the content. The same font (Century Gothic) is partially embedded multiple times in the pdf with different encodings !

Using PdfGrabber I was able to visualize the details of each font, here are some examples:

 

Is it possible to extract those fonts from the document and install them on my system ? It might allow me to properly find/replace the content.

Participating Frequently
November 6, 2019

Well, for anyone insterested I managed to extract the fonts, install them on my system and it still isn't working.

 

Anyone has an idea ?

Legend
November 6, 2019

There is no obstacle to using an InDesign file made on a Mac, just because you are on Windows. You can do this. I suggest you do this, and if you can't, prepare for retyping. PDF editing is a desperate last resort, and you have got much further than most people could ever manage, but really, no.

Participating Frequently
November 5, 2019

Here is a simplified page extracted from the document with some reference numbers at the bottom of the page so you can see:

https://drive.google.com/file/d/1DgnNEgfw2H8M3DqIZpkHhCHDwTNdxqq9/view?usp=sharing

Participating Frequently
November 6, 2019

After doing more tests it turns out that exporting the pdf to a word document properly converts all the glyphs into editable text. However now the issue is that all the images in the word document have low quality.

 

Is there a way to preserve the images quality ehen exporting to word? I checked the settings and could not find anything. All the settings are for converting a word to pdf but not the other way around.

 

Thanks