OCR from the same areas on several pages

Report · Oct 03, 2020

I am hoping for a solution using the Action wizard only for the following problem. I am pretty new to Acrobat so I am sure there are many options I have not yet considered for my task.

I am working with a large series of typewritten forms that were scanned. These forms contain information that should be read semi-automatically. The same information is in the same area on every page and every page has 6 "areas of interest" that contain said information. The rest of the page is different from page to page so OCRing entire pages would create different levels of noise depending on the page. That is why I want to OCR only those areas of interest and get the output as plaintext. (The goal of the data is in Excel so I will try to get the output in there as directly as possible by VBA, although reading from exported files in VBA is possible, too.)

I was able to create an action that lets the user crop every page down to one of the aforementioned areas and then run OCR and output the text automatically. This process will force the user to wait while only one of the areas is processed, and then repeat it once for every area of the form.

Ideally, I would want the user to select all fields on one page, this pattern to be applied to every page, and then OCR data to be exported for those fields separately.

Alternatively, getting coordinate data from a user's selection would also work as I could use them in VBA to automate the cropping process.

For these two strategies, I haven't found appropriate commands in Acrobat yet.

Does anyone have an idea about what I can do?

Report · Oct 03, 2020

You can't import or export OCR data from one PDF file to another, if that's what you're planning to do. The only way to do that is to replace the non-OCRed page with one that has undergone it.

Report · Oct 04, 2020

Not OCR data from one page to another, but just which areas need to be OCR'd in order to execute it just there and export the text from just those areas.

Report · Oct 04, 2020

The only way to do that is how you already did it, by cropping the page.

Report · Oct 04, 2020

You can create 5 copies of every page. Then crop the pages at the different coordinates. After this OCR the whole document.

Or redact unwanted areas.

Report · Oct 04, 2020

And how would you combine those pages back to be a single page?

Report · Oct 04, 2020

"And how would you combine those pages back to be a single page?"

Why a single page?

Report · Oct 04, 2020

I thought that was the intetion... If they're just interested in extracting the text, then you suggestion is fine. Another option is to OCR the entire page and then redact the areas you don't want.

Report · Oct 04, 2020

I am indeed interested in only the text.

OCRing the entire page would generate too much noise. Efficiency is essential for the resulting app/workflow that I'm working on.

Report · Oct 04, 2020

This does sound interesting, although to do this with several pages, I think I would need to create 5 copies of the whole pdf, let the user crop different fields in the different documents and export with a special filename indicating its content. With the help of JavaScript this could be possible maybe.

Adobe Community

OCR from the same areas on several pages