Improving Boundaries, Editing on Photo or in Acrobat?

Report · Mar 30, 2017

I have recently been trying to OCR some Japanese PDFs directly from high quality (600 DPI) scans that have been tilt corrected. I've gotten different results - because I had to redo it - and I'm wondering if there are ways that I can make the OCR not bleed over lines as often.

There are two ways that I can think of which might be able to assist it

1. Edit the image directly. Both stripping out all color, making it so that only the characters are present, and creating actual boxes around text blocks might help define the areas of the page. The latter would take some time but the former can be done quickly, though I don't know how much this would ultimately help.

2. Defining the bounding boxes within Acrobat itself. I haven't seen any good tutorials on this, so I may be missing something obvious. I have gotten results of both entire vertical lines of text being defined as a bounding boxes, larger rectangles being defined, and individual characters becoming the bounding (which shifts the orientation to horizontal).

Any assistance on this would be appreciated.

Thanks!

Report · Apr 07, 2017

Please share with us the build information, your platform on which you are running acrobat and a sample file of yours so that we can replicate the behavior on your side to understand the issues better.

Also please mention the Steps and the settings you used the perform the ocr operation

Note: Editing the image to leave only the textual content can actually significantly change the OCR behavior.Please try it and let us know the results.

Thanks

Rishabh Sharma