Skip to main content
Participating Frequently
October 3, 2020
Question

OCR from the same areas on several pages

  • October 3, 2020
  • 2 replies
  • 2009 views

I am hoping for a solution using the Action wizard only for the following problem. I am pretty new to Acrobat so I am sure there are many options I have not yet considered for my task.

 

I am working with a large series of typewritten forms that were scanned. These forms contain information that should be read semi-automatically. The same information is in the same area on every page and every page has 6 "areas of interest" that contain said information. The rest of the page is different from page to page so OCRing entire pages would create different levels of noise depending on the page. That is why I want to OCR only those areas of interest and get the output as plaintext. (The goal of the data is in Excel so I will try to get the output in there as directly as possible by VBA, although reading from exported files in VBA is possible, too.)


I was able to create an action that lets the user crop every page down to one of the aforementioned areas and then run OCR and output the text automatically. This process will force the user to wait while only one of the areas is processed, and then repeat it once for every area of the form.

 

Ideally, I would want the user to select all fields on one page, this pattern to be applied to every page, and then OCR data to be exported for those fields separately. 

Alternatively, getting coordinate data from a user's selection would also work as I could use them in VBA to automate the cropping process.

For these two strategies, I haven't found appropriate commands in Acrobat yet.

 

Does anyone have an idea about what I can do?

This topic has been closed for replies.

2 replies

Bernd Alheit
Community Expert
Community Expert
October 4, 2020

You can create 5 copies of every page. Then crop the pages at the different coordinates. After this OCR the whole document.

Or redact unwanted areas.

try67
Community Expert
Community Expert
October 4, 2020

And how would you combine those pages back to be a single page?

SmogshaikAuthor
Participating Frequently
October 4, 2020

I thought that was the intetion... If they're just interested in extracting the text, then you suggestion is fine. Another option is to OCR the entire page and then redact the areas you don't want.


I am indeed interested in only the text. 

 

OCRing the entire page would generate too much noise. Efficiency is essential for the resulting app/workflow that I'm working on.

try67
Community Expert
Community Expert
October 3, 2020

You can't import or export OCR data from one PDF file to another, if that's what you're planning to do. The only way to do that is to replace the non-OCRed page with one that has undergone it.

SmogshaikAuthor
Participating Frequently
October 4, 2020

Not OCR data from one page to another, but just which areas need to be OCR'd in order to execute it just there and export the text from just those areas.

try67
Community Expert
Community Expert
October 4, 2020

The only way to do that is how you already did it, by cropping the page.