Copy link to clipboard
Copied
Let's say I have a huge bunch of 50K+ PDF files that should be exported to DOCX format with OCR applying on PDF. While I can easily export them using the SDK, I would like to know if there is a reliable way to automatically assess the accuracy of the export operation because it's not feasible to manually check the results of 50K outputs. Any suggestions?
Thanks everyone!
~Veslav
Copy link to clipboard
Copied
How would that work, exactly? How can a tool assess its own work? If it didn't get it right the first time, why should it the second time around?
The only way I can think of doing it is by using a dictionary to scan the words in the file. If many of them are "non-sense" (ie. don't appear in the dictionary) then it might be safe to assume the OCR didn't work very well...
Copy link to clipboard
Copied
I didn't mean the second run of the export process. Instead, as I answered below, the goal is to narrow the number of output files for manual review.
Your suggestion of using a dictionary scan might work with relatively simple text files, but it may not be enough in the case of files with complex structures, e.g., tables, diagrams, etc. I was thinking about something like converting the related input and output files to images and then comparing them to some acceptable degree of similarity. Like if the difference is more than 5%, the output file is marked as dubious and intended for further manual review.
Copy link to clipboard
Copied
Sounds like you need an AI. I'd be happy set this up for you if you've got a few million to buget towards the project.
Copy link to clipboard
Copied
Manually checking OCR is part of the cost of very accurate OVR. It has high cost. Only humans can do it well (and only some humans).
Copy link to clipboard
Copied
Exactly. But the goal is to narrow somehow the number of questionable output files for manual review. It's not feasible to check all the 50K output files. If by running some automatic post-export review, we highlight only dubious files that have to be manually reviewed and corrected if needed, we would significantly lower the human work.
Copy link to clipboard
Copied
I think you don't realise how ambitious your wishes are. But here are some thoughts
- it may seem that running a spell checker would be a good start. But all OCR has already used a spell checker to "correct" suspect text, so you won't catch errors which led to the wrong word being chosen.
- rasterising and comparing seems attractive, but you will not get an exact match on fonts (because the original font used wasn't even known, and printing and scanning imperfections are going mess up comparisons).
But here are some areas to look at.
1. Try all available OCR apps. Certainly you won't want to use Acrobat for 50K files! Something industrial strength is what' needed, not an interactive tool with very basic automation. Look at
(a) accuracy for YOUR particular set of files
(b) whether accuracy can be increased through tuning
(c) speed and convenience
(d) whether the OCR system gives any reports on its level of confidence in the OCR
2. If you're starting with paper - scan - PDF, and you want Word, then involving PDF is just making complication and limiting your choices. Look for OCR to Word as well.
3. If you stay with PDF, be prepared to separate PDF to Word from PDF OCR, probably to entirely different products. Still not Acrobat for sure! Maybe AEM (Adobe Experience Manager).
Find more inspiration, events, and resources on the new Adobe Community
Explore Now