Make scanned document searchable

Report · Jun 14, 2018

Sometimes when searching for documents online I come across a scanned document, maybe a historic or hand written document hundreds of years old, but the search has found the text I was looking for in the document, and the page it is on. There is often a menu saying how many times this text appears in the document, and allowing me to move quickly backwards and forward between these.

It seems like the document contains the images of the original pages, but also the OCR'd text, and somehow each word of OCR'd text knows which part of the original image it came from, because when you search for a word it finds it, and highlights it in the original scanned document.

I now have photocopies of a lot of historic documents (they are letters written by my gt gt grandfather) which I would like to do this with. I would like the text to be searchable, but I would like the original images of the pages there, and for the found words to be highlighted in the original images.

Can this be done? How?

What software do I need to do this?

Does PDF have the ability to store the images and the text and the relationship between the two to enable this? If not, what format does allow this?

Many thanks - Rowan

Report · Jun 22, 2018

It's quite hard to detect accurate words in these kind of documents. OCR recognition on handwritten documents is a tedious task.

But Acrobat provides a feature(Suspect Correction) for this kind of things, where you can correct the text if something is recognized incorrectly.

Run OCR(Text recognition) on the document
- Go to Tools and select "Enhance Scan" tool.
- Now select "Recognize Text" drop-down menu and click "In This File" option. Now click on "settings" and select "Searchable Image Exact"
- Now click on the "Recognize Text" button on the third level toolbar which appears.
Now perform suspect correction.
- Once it recognized all text, go to "Enhance scan"> "Recognize Text"> "Correct Recognize Text".
- It will show you all the words in red boxes where Acrobat has any doubt. Now in 3rd level toolbar, you can correct these words.
- Also, there is a checkbox "Review Recognize Text", which will show you what all recognized by Acrobat.
- You can even create a new suspect by double-clicking any word.

It might be a tedious job for hand written docs as there might be a large number of suspects. But it can do the job you want.

Thanks.

View solution in original post

Report · Jun 22, 2018

It's quite hard to detect accurate words in these kind of documents. OCR recognition on handwritten documents is a tedious task.

But Acrobat provides a feature(Suspect Correction) for this kind of things, where you can correct the text if something is recognized incorrectly.

Run OCR(Text recognition) on the document
- Go to Tools and select "Enhance Scan" tool.
- Now select "Recognize Text" drop-down menu and click "In This File" option. Now click on "settings" and select "Searchable Image Exact"
- Now click on the "Recognize Text" button on the third level toolbar which appears.
Now perform suspect correction.
- Once it recognized all text, go to "Enhance scan"> "Recognize Text"> "Correct Recognize Text".
- It will show you all the words in red boxes where Acrobat has any doubt. Now in 3rd level toolbar, you can correct these words.
- Also, there is a checkbox "Review Recognize Text", which will show you what all recognized by Acrobat.
- You can even create a new suspect by double-clicking any word.

It might be a tedious job for hand written docs as there might be a large number of suspects. But it can do the job you want.

Thanks.

Report · Jan 31, 2023

In theory, your suggestion works; however, due to the book size (1,280 pages), I get thousands of results, making it too cumbersome to go that route. Additionally, it usually does not recognize the full word but only a few letters (presuming due to the OCR not reading the red text well from small font size and or ever so slight red color variations from the ink aging with the printing year of 1901). This makes me wonder if there is something that can be done regarding OCR correcting.

Additionally, do you or does Adobe have professionals that can fix this issue once I complete the scans? I expect the FINAL TOTAL pdf file to be about 1280 pages and about 715 MB (my constraint is less than 2GB with encoding pdf/a, v1.7, Acrobat v8 to upload on archive.org). I plan to divide this scanning into 6 pdfs (about 213 pages and 119 MB each) and then combine them into one. The objective is to have the pdf readable and searchable, including the red text that may be about 1/10th of the total text. Attached is just a sample screenshot.

Report · Feb 01, 2023

Can "Suspect Correction" be used based on text color (red)? I don't see the option for "Suspect Correction," I am using Adobe Acrobat Pro v. 2022.003.20314 | 64-bit. installed from CCDesktop on Windows 10.

Report · Feb 01, 2023

Okay, I have found my problem, my imaging camera was not focused well, I had the object too close to the camera, and it was not capturing the red text well. Fixed.

Report · Jan 31, 2023

I have scanned a document that has both black and red text. The black text is searchable, but the red text is not searchable. Is there a way to fix this in the PDF? I tried to attach my file to this; however, it is above the maximum of 47MB allowed. The file is 118MB, and I am willing to send it or make it available to any who can help with this question. Please help.

Make scanned document searchable

Photos