Requesting help with OCR and file naming issues

Report · Jul 01, 2021

Hello,

My work-place uses Adobe Acrobat Pro DC to process some really basic data entry stuff. My job is to essentially analyse a scanned PDF copy of a printed form + extra attachments. The first 5 pages are generally in the same format, but with about 100 (or so) variations on the exact layout depending on user-based options. It's generally a receipt (Page 1), followed by one or two checklists (Page 2-3, depending), followed by a section for user information (page 4), followed by a proof of ID check (Page 5). It's intended to be filled out on a PC but we currently accept hand-written answers in the user info section.

The rest of the pages in the document are extremely haphazard - we request users send us a list of documents based on their earlier options, and some documents in those lists can be submitted in a myriad of ways that we'll accept. There are set requirements for some documents, but we tend to be more lenient on what we will and won't accept for submission types. Unfortunately, for infosec reasons, I can't share much more detail than this.

The information in the document goes into various places in my companies' CRM implementation, with pretty set options. Realistically, if the information was provided in a computer friendly way, my entire role could be automated (but this is not what I'm after - I like this job). What I would like is to know whether I can do the following, and how:

I know about, and have tried, an OCR solution to automatically extract information straight from the document. OCR promptly threw a fit and cried when I asked it to scan the user information page, which is definitely the best defined page of any given application. Is there some way I can refine the OCR settings such that it will be able to tell at least that there are form boxes the information is meant to be in? It kept finding the boxes of the form to be words to be turned into text, which is definitely not what I wanted. I can create templates of all of the forms - but even pulling up an entirely blank, template form sometimes causes the "Create Form" function to miss radio buttons. And I'm not sure how to associate a template form with OCR so that it can use the template to know where it should and shouldn't be searching for info.

The next step after that is to get that information into Excel, or some other format where we can easily read and input the data to CRM, but that's definitely a secondary issue to the fact that OCR flat-out refuses to read this document properly.

The second question is, is there a way to automatically rename Adobe files based on their contents as they are extracted from a .zip file? I did some brief research into this which led me into the nightmare that is indexing, and decided that that was too far for me. This doesn't need to be anything fancy, just LASTNAME Firstname and then the form options they selected in the first 5 pages.

Thanks for any help!

Report · Jul 01, 2021

> Is there some way I can refine the OCR settings such that it will be able to tell at least that there are form boxes the information is meant to be in?

No. OCR (in Acrobat, at least) is an all-or-nothing process. It can scan an entire page only, not parts of it. And creating form fields over the texts will not do you any good. The fields need to exist before the file is filled-in, and they need to be used for that purpose. If you do that then OCR is not needed at all. You could just extract the information directly.

This is really the best, and most efficient option of doing it.

> Iis there a way to automatically rename Adobe files based on their contents as they are extracted from a .zip file?

That depends. Where is the information to rename them based on coming from? If it's in form fields then it's much easier. If it's static text in specific locations on a page then it's possible, but more complicated. If it's hand-written text or text that has been OCRed it's practically impossible.

Also, it can't be done as the file are extracted from a zip file (at least not if you use Acrobat). You would have to open them and then run a script on them, either via a menu item or by using the Action Wizard.

Report · Jul 01, 2021

Unfortunately, as far as I know the current system of printing a PDF and then scanning it and then processing it is the only method we have available to us for contract reasons. While I can suggest we change that part of the process, it's not something I have any direct or even really indirect control over. So while agree that the pre-filled Forms method would be best, I don't really see that being used going forward.

As for the whole partial OCR thing, I fortunately didn't intend that anyway. I know OCR is an all-or-nothing process, I was more wondering if I could...target it, in a way. Give it guidelines for where text *should be*, so that it stops seeing side by side boxes as double lls. It sounds a bit like what I want is too far out of reach for what Adobe is capable of, given the OCR results I currently have.

Realistically if I was ever going to upgrade our system, I would likely change it so that we primarily accepted form-filled PDFs...but that comes back to contractual issues we can't get past for the forseeable future. Such is life in business sometimes. Cheers anyway.

Report · Jul 01, 2021

Yeah, I don't think it's possible.

Requesting help with OCR and file naming issues

Photos