Skip to main content
Inspiring
January 16, 2017
Answered

I wish to do OCR on a 345pp scanned document. Edit PDF not doing anything.

  • January 16, 2017
  • 3 replies
  • 1531 views

It's a while since I've done any OCR in Acrobat and the (still lamentable) UI has changed a bit since then. I did watch a Adobe tutorial on Acrobat and it said to use the "Edit PDF" tool, and though it's in a different place on my Acrobat Pro I found it easily enough.

The problem is when I click on that tool all that happens is a blue box outlining the perimeter of each scanned page appears, unlike in the tutorial where the text areas are surrounded with boxes and you can interact with the text as strings.

This is the document I wish to attempt OCR on, it's not a great scan, with reverse pages leaking through the image but the text is important so I want to attempt it.

The Great Chronicle Of Buddhas - download it here  (each chapter is about 100MB DL)

Appreciate any help getting me up to speed with OCR in Acrobat these day.

This topic has been closed for replies.
Correct answer wideEyedPupil

I did draft a complete answer to this but seems like a application freeze wiped it out before I hit Add Reply.

Even the Adobe telephone support said you can't edit a scanned document. Pointed out it was possibly over ten years ago. Checked his notes and it is.

In Tools you need to select "Enhance Scan" tool. Then select any item in the "Recognise Text" dropdown menu in the 2nd level toolbar ("This File" for eg) then click on the "Recognise Text" button on the third level toolbar which appears.

Did a pretty good job on that document but some last glyphs on words were occasionally left off. 99% of words scan I imagine. Being a 345pp document and but one of eight such I imagine it would be very burdensome to complete the document by hand. Also the facility to upload a Pali language dictionary or English + Pali to handle all the words with (unusual to English speakers) diacritics like “Āloka” might help it. Not sure if that's possible even by hacking the dictionary files for Acrobat. Will ask separately.

3 replies

Inspiring
February 7, 2017

Thank you Lovekesh. The document is too long for me to do all these corrections by hand, maybe when I retire from clmiate campaigning (like never, it will always be more urgent than the year before).

Karl Heinz  Kremer
Community Expert
Community Expert
January 16, 2017

It does not look like I can download anything without signing up for some service. To be blunt: If you want help with your problem, don't make it hard to actually help you. Most of the people here are doing this in their spare time. If you need help in how you can share a page or two so that we can just download the file, I wrote up some information about how to use Adobe's Document Cloud to share files: Share Documents via Adobe's Document Cloud - KHKonsulting LLC

You may want to take a look at these recent questions, which may answer your question as well:

Can't edit scanned pdf - doesn't work like in the tutorials

Goading Acrobat Pro DC (Mac) into OCR in Edit mode

Inspiring
January 17, 2017

Hi Karl, not sure what you are seeing but I and other people I know have download from that page without any of the issues you describe. Just worked™. Didn't occur to me to share upload the file for the reason that it seemed to be readily available to anybody, even people on slow connections in Burma :-).

Thanks for the links. I did get it working, it wasn't perfect, but neither are the source scans.

wideEyedPupilAuthorCorrect answer
Inspiring
January 16, 2017

I did draft a complete answer to this but seems like a application freeze wiped it out before I hit Add Reply.

Even the Adobe telephone support said you can't edit a scanned document. Pointed out it was possibly over ten years ago. Checked his notes and it is.

In Tools you need to select "Enhance Scan" tool. Then select any item in the "Recognise Text" dropdown menu in the 2nd level toolbar ("This File" for eg) then click on the "Recognise Text" button on the third level toolbar which appears.

Did a pretty good job on that document but some last glyphs on words were occasionally left off. 99% of words scan I imagine. Being a 345pp document and but one of eight such I imagine it would be very burdensome to complete the document by hand. Also the facility to upload a Pali language dictionary or English + Pali to handle all the words with (unusual to English speakers) diacritics like “Āloka” might help it. Not sure if that's possible even by hacking the dictionary files for Acrobat. Will ask separately.

Lovekesh Garg
Adobe Employee
Adobe Employee
January 16, 2017

Steps you followed are correct to run OCR.

Can you please share 2 things for better understanding the issue.

- Acrobat version you are using

- Can you please take 1 single page out of any PDF you are using. and share the exact issue you are facing. It will help us to concentrate on the exact issue you have.

you can use https://cloud.acrobat.com/send  to share the file.

Thanks.

Inspiring
January 17, 2017

This question has been answered (by myself, sorry if it took a while for mods to post the answer).

In the interests of the requests for a small file uploaded to Adobe cloud here's pp1-4.


You'll see that ~99% of the glyphs on these pages are correctly recognised as text strings, yet a few (usually at the end of longer words) get omitted. It happens on both English and Pali language words, so I don't think it's a dictionary issue, though maybe it is and I'm not thinking it through carefully enough.