Copy link to clipboard
Copied
I am hoping for a solution using the Action wizard only for the following problem. I am pretty new to Acrobat so I am sure there are many options I have not yet considered for my task.
I am working with a large series of typewritten forms that were scanned. These forms contain information that should be read semi-automatically. The same information is in the same area on every page and every page has 6 "areas of interest" that contain said information. The rest of the page is different from page to page so OCRing entire pages would create different levels of noise depending on the page. That is why I want to OCR only those areas of interest and get the output as plaintext. (The goal of the data is in Excel so I will try to get the output in there as directly as possible by VBA, although reading from exported files in VBA is possible, too.)
I was able to create an action that lets the user crop every page down to one of the aforementioned areas and then run OCR and output the text automatically. This process will force the user to wait while only one of the areas is processed, and then repeat it once for every area of the form.
Ideally, I would want the user to select all fields on one page, this pattern to be applied to every page, and then OCR data to be exported for those fields separately.
Alternatively, getting coordinate data from a user's selection would also work as I could use them in VBA to automate the cropping process.
For these two strategies, I haven't found appropriate commands in Acrobat yet.
Does anyone have an idea about what I can do?
Copy link to clipboard
Copied
You can't import or export OCR data from one PDF file to another, if that's what you're planning to do. The only way to do that is to replace the non-OCRed page with one that has undergone it.
Copy link to clipboard
Copied
Not OCR data from one page to another, but just which areas need to be OCR'd in order to execute it just there and export the text from just those areas.
Copy link to clipboard
Copied
The only way to do that is how you already did it, by cropping the page.
Copy link to clipboard
Copied
You can create 5 copies of every page. Then crop the pages at the different coordinates. After this OCR the whole document.
Or redact unwanted areas.
Copy link to clipboard
Copied
And how would you combine those pages back to be a single page?
Copy link to clipboard
Copied
"And how would you combine those pages back to be a single page?"
Why a single page?
Copy link to clipboard
Copied
I thought that was the intetion... If they're just interested in extracting the text, then you suggestion is fine. Another option is to OCR the entire page and then redact the areas you don't want.
Copy link to clipboard
Copied
I am indeed interested in only the text.
OCRing the entire page would generate too much noise. Efficiency is essential for the resulting app/workflow that I'm working on.
Copy link to clipboard
Copied
This does sound interesting, although to do this with several pages, I think I would need to create 5 copies of the whole pdf, let the user crop different fields in the different documents and export with a special filename indicating its content. With the help of JavaScript this could be possible maybe.