Skip to main content
New Participant
December 2, 2021
Question

Searching text in a PDF

  • December 2, 2021
  • 2 replies
  • 692 views

When copying and pasting from Adobe Acrobat to Word (or email) that some of the words are not transferring correctly.

 

For example, when we search the attached PDF for the word ‘compressors’ is does not find it in the PDF attached (which is OCR’d and has editable text).  We are aware of the different font used by Adobe.  However it is odd that the word looks fine in PDF only.

 

This gives us the impression that if we were to search documents using the find function, any particular word may be overlooked in the search.  Although the word looks fine in the PDF document.

 

Unfortunately this is not helpful when searching a 150 page contract and looking for items, which may be overlooked.

 

Does anyone know if there is a solution tothis please?

This topic has been closed for replies.

2 replies

gary_sc
Adobe Expert
December 2, 2021

Let me add to Eric's answer in that the results of OCR are problematic at best. I've been doing OCR for some 25–30 years and things have gotten better, but still often leave much to be desired. Things that can cause poor quality results include: very tiny text, ligatures (as mentioned by Eric), not clear text (photocopies are a good example), bleeding of the text on the other side of the page that was scanned, and some letter combinations often give issues such as "ui" being seen as "m." 

 

People have over-expectations of what should result including the rounding over of text when a page from a book is scanned. 

 

The reason why it "LOOKS" fine in the OCRed text is that Acrobat can remove the text from the page, put that on a different layer but the text you select is invisible. So it looks great but the actual text could be a mess. If you're looking for "compressors," that you can clearly see in front of you, it might very well be something altogether different in the OCRed text. 

 

One thing that can help is to get as good a copy from the beginning in the original scan. This can be a challenge because many scanners are now being released with scanning software that gives you no option to fine-tune the quality of the scan. Rather they give you options such as "Text," "Pictures," "Text and Pictures," and that's about it. I guess learning how to do a good scan was deemed too much of a challenge for companies. If you are fortunate to have reasonable quality scanning software with your scanner, this blog I wrote for Adobe sometime back may be of assistance.

 

http://photosbycoyne.com/Gary's_Help/Scanning/clean-scanning.html

 

Good luck!

New Participant
December 3, 2021

Thank you so much Gary - this is extremely helpful

Eric Dumas
Adobe Expert
December 2, 2021

You should be able to highlight the OCR'd text and see if some words were broken into seperate text frames. This happens sometimes when creating pdf or processing an OCR on a scan, saddly, we have no control were the 'text frames' start and stop. Sometimes it is a real mess.vanished, like ff link into a single character.

 

In the past, I encountered an issue were ligatures of the font generated characters that were not recognised or simply 

 

I never found a way around this.

New Participant
December 2, 2021

Thank you so much Eric. This is really helpful.

Much appreciated