Cannot find/replace text in pdf: some glyphs not recognizable
Copy link to clipboard
Copied
Hello,
I have a very large modifiable pdf that needs to be updated (replace product reference codes on over 1000 pages...).
I used qpdf to extract the streams and I was able to easily change about one third of the doc with sed since some of the text was in ascii. But the rest of the content is not easily readable and my last resort is to automate mouse and keyboard actions to find and replace each reference code. Unfortunately some content is not found by Acrobat Pro DC even though I can select each character, but if I try to copy/paste the content it is not recognized. I was able to see that the pdf as been created with indesign on a mac and unfortunately I cannot have the original file used to create it and I am on windows. The font used is century gothic. It is probably diffently encoded than the one I have on my computer.
I tried downloading different versions of the font but no luck. I also exported the pdf in tiff (600 pp resolution) and tried the OCR but even in that resolution it fails to recognize properly most of the text. I am stuck.
Does anyone have a solution ?
Copy link to clipboard
Copied
Here is a simplified page extracted from the document with some reference numbers at the bottom of the page so you can see:
https://drive.google.com/file/d/1DgnNEgfw2H8M3DqIZpkHhCHDwTNdxqq9/view?usp=sharing
Copy link to clipboard
Copied
After doing more tests it turns out that exporting the pdf to a word document properly converts all the glyphs into editable text. However now the issue is that all the images in the word document have low quality.
Is there a way to preserve the images quality ehen exporting to word? I checked the settings and could not find anything. All the settings are for converting a word to pdf but not the other way around.
Thanks
Copy link to clipboard
Copied
I tried to use other software to modify or convert the pdf and I came across something very interesting that explains why I cannot find/replace some of the content. The same font (Century Gothic) is partially embedded multiple times in the pdf with different encodings !
Using PdfGrabber I was able to visualize the details of each font, here are some examples:
Is it possible to extract those fonts from the document and install them on my system ? It might allow me to properly find/replace the content.
Copy link to clipboard
Copied
Well, for anyone insterested I managed to extract the fonts, install them on my system and it still isn't working.
Anyone has an idea ?
Copy link to clipboard
Copied
There is no obstacle to using an InDesign file made on a Mac, just because you are on Windows. You can do this. I suggest you do this, and if you can't, prepare for retyping. PDF editing is a desperate last resort, and you have got much further than most people could ever manage, but really, no.
Copy link to clipboard
Copied
Thanks for your answer, the thing is I do not have access to the InDesign file, which is probably lost by now. I have been asked to update this pdf file with pretty much nothing else (or no one else) to help.
I have definitely reallized by now that mass pdf editing is a nightmare if the content is not easily readable (ascii for instance) or even if I can use the find/replace in acrobat. I could then automate mouse and keyuboard actions to search and replace each string one by one...but if acrobat cannot find the string I am stuck !
The thing is, when I convert the pdf to docx, all the text is properly detected and then editable so there's hope !
I could make all the changes in Word and then convert back to PDF but the quality of the images in the docx is poor compared to the pdf and I alerady know that 'they' won't be happy with the end result.
So maybe I should be looking for a way to improve the quality of the images in the export to docx, any suggestions ?
Copy link to clipboard
Copied
Quick follow up on the images, I changed the .docx extensioninto .zip an decompressed it. I could then have access to all the images in the docx document. I then went to acrobat to export all the images contained into the pdf and I was hoping to be able to replace the ones from the docx with the ones from the pdf. Unfortunately this is not possible since the number or images is completely different and during the conversion from pdf to docx the engine does some kind of screenshot of section of each page and then cuts out parts of images. So back to square one. Either I find another way to improve the images quality in the generated docx or I find a way to fix the glyphs / mulitple font encoding in the pdf.
I read here https://community.adobe.com/t5/Acrobat/PDF-is-blurry-when-inserted-into-Word-document/td-p/7029461 that potentially the conversion from pdf to docx would work better on a mac since the pdf format is built in the OS.
If it's true I would need to find someone with a mac and all the required softwares to convert the pdf to docx for me...or even install a virtual machine on my computer to have mac OS Mojave...
Can anyone confirm this ?
Copy link to clipboard
Copied
For anyone interested, I installed a virtual machine with mac os majove and used the trial version of acrobat pro dc for mac. I converted the pdf to docx and the result is almost identical as the one on windows 10, the pictures might be slightly better, but it's pretty much the same low quality compared to the pdf...
So if you're in my situation, don't waste your time going through the whole process of installing mac os in a VM...
Copy link to clipboard
Copied
So some of the characters are encoded in Identity-H and the /ToUnicode map is missing from the PDF.
Any way to generate the tables either automatically or by hand ? Since the conversion to docx properly recognizes all the text I think that an automated method should be possible.
Copy link to clipboard
Copied
If you have a PDF editing library (DO NOT HAND EDIT PDFS!!!) you might add your own generated ToUnicode, but nothing will do it for you. Identity-H means that the GIDs in the fonts are used directly, a detailed analysis of font internals might reveal the glyphs. Or might not. Very probably, each subset is different. Nothing is impossible, but this is likely to take much longer than my suggesions, which will certainly take weeks.
Copy link to clipboard
Copied
wow, thank you very much for your help. I guess I am already lucky to have access to all the text. I'm going to try and find a way to get better picture quality in the docx.
Do you know if it's possible to change some setting in acrobat pro dc for this ? I just tried another software (pdfgrabber) and the picture quality of the docx is a bit better than the one from acrobat pro...there must be something I can do to get a better result from acrobat.
by the way I did edit the pdf by hand to start with (!) with sed but I replaced each string with another string that contained the same amount of characters...so it worked without any issues...as long as the characters where in ascii...
Copy link to clipboard
Copied
If you can get the text out, that's a lot of retyping you don't have to do. I suggest looking at all of the other ways you might get the images out, and collate them all, ready to remake the document.
Another route would be to use InDesign (Word would struggle) to make up all the pages with space for images, then in Acrobat redact all the text out. Now in InDesign, place each page (now containing only graphics) and you have recombined with all the graphics in situ.
Copy link to clipboard
Copied
Thanks for your reply. There are so many pictures on each page that it will take as much time to replace each picture than it would take to replace all the text 😞 It is a technical document about industrial products, some kind of directory/listing...
Anyways, if the conversion to docx manages to get all the text there should be a way to "fix" this pdf.
Any chance to generate /ToUnicode maps for the fonts encoded in Identity-H ?

