Copy link to clipboard
Copied
I have hundreds of pdfs and I want to OCR all of them. Some of them have already been OCRd but I can't figure out which ones. I want them all OCR'd.
I attempted to select all of them and do it all at once with the Recognize Text tool, but I got an error for about half of them: "cannot convert file type to pdf." But they are ALREADY pdfs and don't need to be converted. Why is this happening and how can I reach my goal of having 370 OCR'd documents?
Copy link to clipboard
Copied
Hi, @B Porter you do not state what your OS is (always handy for us folks), and you do not state HOW the files were OCRed (do you know what application was used?).
The reason why this can be helpful is that SOME OCR software packages add the OCRed text on top of the scanned image, leaving the file ginormously large. If you take a full 8.5 x 11 page and scan that as a TIF document (at 300 ppi), it is typically around 8 MB. A JPG can be around 3-5 MB (at 300 ppi), but this can vary greatly due to how much compression was used. (Hint: no compression is best for later OCR processing).
Both of these approaches, after processing the OCR, will end up between 75 – 150 kb per page. But again, some software OCR companies end up with very large documents. Fortunately, the ones that I'm familiar with can be re-OCRed again with Acrobat's smaller document size as a result.
One other thing going against you is if you have a multi-page document and one page has been OCRed, Acrobat "may" reject OCRing the entire document — these can be done one page at a time in Acrobat, but not the entire document.
Please let me know what you find, and we can take it from there.
Copy link to clipboard
Copied
I'm using Windows 10 Enterprise. I don't know how the files were OCR'd. There are hundreds of them. Some of them I did myself in Adobe. Others were already OCR'd when we downloaded them off various databases. A couple of these were scanned from a book by a colleague.
I thought this might be size-related but looking at the results, I don't think it is. Some bigger files were OCR'd just fine and some smaller ones weren't. See the screenshot of a small selection.
Copy link to clipboard
Copied
Hi, @B Porter, wait. I must have missed something, but each of the files in your screenshot that have issues do not have an Acrobat PDF icon. There is no icon.
I do not know much about Windows 10 (sorry, I'm a Mac guy), so seeing no icon is strange to me. What kind of files are these?
BTW, I failed to add one important, obvious piece of information about file size: A ten-page OCRed document can be larger than a single-page non-OCRed document. So, when comparing document size, it's important to compare apples to apples. Very sorry for that omission.
Copy link to clipboard
Copied
Like I explained in my first post, they are all PDFs. Every single file is a PDF file that I have opened in Adobe at some point. Look on the left side of the screen-- that is the window showing the files that I selected to open. They all have PDF icons in that window because they are all PDF files. There are no files that are not PDFs. Every file is already a PDF.
Copy link to clipboard
Copied
I don't really understand your comment about the page number counts. I am simply going off the file sizes which you see here listed in the window. Some of the larger files were OCRd (see the first one listed, Louisiana Admin which is 5.35 MB) and some of the smaller ones weren't (see Menzie, only 360 KB). There seems to be no rhyme or reason to this.