Copy link to clipboard
Copied
Sometimes when searching for documents online I come across a scanned document, maybe a historic or hand written document hundreds of years old, but the search has found the text I was looking for in the document, and the page it is on. There is often a menu saying how many times this text appears in the document, and allowing me to move quickly backwards and forward between these.
It seems like the document contains the images of the original pages, but also the OCR'd text, and somehow each word of OCR'd text knows which part of the original image it came from, because when you search for a word it finds it, and highlights it in the original scanned document.
I now have photocopies of a lot of historic documents (they are letters written by my gt gt grandfather) which I would like to do this with. I would like the text to be searchable, but I would like the original images of the pages there, and for the found words to be highlighted in the original images.
Can this be done? How?
What software do I need to do this?
Does PDF have the ability to store the images and the text and the relationship between the two to enable this? If not, what format does allow this?
Many thanks - Rowan
Copy link to clipboard
Copied
It's quite hard to detect accurate words in these kind of documents. OCR recognition on handwritten documents is a tedious task.
But Acrobat provides a feature(Suspect Correction) for this kind of things, where you can correct the text if something is recognized incorrectly.
It might be a tedious job for hand written docs as there might be a large number of suspects. But it can do the job you want.
Thanks.
Copy link to clipboard
Copied
It's quite hard to detect accurate words in these kind of documents. OCR recognition on handwritten documents is a tedious task.
But Acrobat provides a feature(Suspect Correction) for this kind of things, where you can correct the text if something is recognized incorrectly.
It might be a tedious job for hand written docs as there might be a large number of suspects. But it can do the job you want.
Thanks.
Copy link to clipboard
Copied
In theory, your suggestion works; however, due to the book size (1,280 pages), I get thousands of results, making it too cumbersome to go that route. Additionally, it usually does not recognize the full word but only a few letters (presuming due to the OCR not reading the red text well from small font size and or ever so slight red color variations from the ink aging with the printing year of 1901). This makes me wonder if there is something that can be done regarding OCR correcting.
Additionally, do you or does Adobe have professionals that can fix this issue once I complete the scans? I expect the FINAL TOTAL pdf file to be about 1280 pages and about 715 MB (my constraint is less than 2GB with encoding pdf/a, v1.7, Acrobat v8 to upload on archive.org). I plan to divide this scanning into 6 pdfs (about 213 pages and 119 MB each) and then combine them into one. The objective is to have the pdf readable and searchable, including the red text that may be about 1/10th of the total text. Attached is just a sample screenshot.
Copy link to clipboard
Copied
Can "Suspect Correction" be used based on text color (red)? I don't see the option for "Suspect Correction," I am using Adobe Acrobat Pro v. 2022.003.20314 | 64-bit. installed from CCDesktop on Windows 10.
Copy link to clipboard
Copied
Okay, I have found my problem, my imaging camera was not focused well, I had the object too close to the camera, and it was not capturing the red text well. Fixed.
Copy link to clipboard
Copied
I have scanned a document that has both black and red text. The black text is searchable, but the red text is not searchable. Is there a way to fix this in the PDF? I tried to attach my file to this; however, it is above the maximum of 47MB allowed. The file is 118MB, and I am willing to send it or make it available to any who can help with this question. Please help.