Skip to main content
Participant
January 1, 2019
Answered

Highlight and select text skipping chunks of text

  • January 1, 2019
  • 2 replies
  • 6159 views

I use Acrobat 9 Pro on Windows 7. Some scanned PDFs (downloaded from a database, for example) behave very strangely when trying to select or highlight text after running OCR. They might highlight part of a word or part of a line, a line of white space beyond the text that runs off the page. It may skip words or lines or highlight random blocks of text above and below the cursor, including selections on multiple pages. It seems like Acrobat is recognizing an image on top or underneath the text, and/or that there is an alignment problem with the recognized text. Is there a way to solve this issue?

Correct answer gary_sc

HI Alanar,

Actually you are absolutely correct. There is an invisible overlay of the text after OCR has processed the page.

However, just to make sure, copy a paragraph or two and paste that into a word-processing program. There you can verify if all of the words were processed and you can see the quality of the OCR.

Be advised that the quality of the OCR can vary considerably due to factors beyond the quality of the tool (Acrobat). If the scan is a low resolution scan, that can affect the OCR. If the scanning page has a lot of texture on it (like bleed-through of the opposite side of the page and/or folds, creases, hole punches, etc.), that can affect the OCR. If the text is very small, that can affect the OCR.

Simply, while absolutely amazing as to how well it does work, it doesn't take much to through the OCR process off and leading to disappointing results.

One way to get your best shot at this is to follow the guidelines I put forth in the following blog I wrote for Adobe here:

https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs

Let us know if this solves your question.

2 replies

Participant
January 1, 2019

I solved my one issue by "examining the document" under the DOCUMENTS tab and selecting "Remove invisible text". So excited to have this skill in my toolbox. I'll leave this thread up in case others have the same issue.

gary_sc
Community Expert
Community Expert
January 1, 2019

Hi Alanar,

Hmmm, I do not think you want to do that as THAT TEXT is the OCR text.

While visually bizarre to look at when selecting text, if you remove that you've removed the searchable text.

For what reason do you think you want to (or should) remove this text?

Participant
January 1, 2019

It worked like a charm! The problem is that with some PDF articles downloaded from academic databases, such as JSTOR, the document already has recognizable text, but it doesn't align with the visible text. When I I chose to view the hidden text in the editor, it showed me two overlapping versions of text, one of which is halfway off the page. When I deleted this and then ranOCR, I now have text that can be selected normally. I can upload a screenshot later if you're interested, but this will make my work so much faster and less frustrating after an initial 5mins of time invested to basically clear and then rerun OCR.

gary_sc
Community Expert
gary_scCommunity ExpertCorrect answer
Community Expert
January 1, 2019

HI Alanar,

Actually you are absolutely correct. There is an invisible overlay of the text after OCR has processed the page.

However, just to make sure, copy a paragraph or two and paste that into a word-processing program. There you can verify if all of the words were processed and you can see the quality of the OCR.

Be advised that the quality of the OCR can vary considerably due to factors beyond the quality of the tool (Acrobat). If the scan is a low resolution scan, that can affect the OCR. If the scanning page has a lot of texture on it (like bleed-through of the opposite side of the page and/or folds, creases, hole punches, etc.), that can affect the OCR. If the text is very small, that can affect the OCR.

Simply, while absolutely amazing as to how well it does work, it doesn't take much to through the OCR process off and leading to disappointing results.

One way to get your best shot at this is to follow the guidelines I put forth in the following blog I wrote for Adobe here:

https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs

Let us know if this solves your question.