OCR- What can be, what can't be and more.
I want to fully understand OCR. I am running Acrobat 10. I have a massive amount of documents that must all be searchable. I have plenty of time to do so. What I want to be sure of is under what condition is and is not a pdf searchable? This is roughly half a pedabyte of data. Yes that's right. Some of these were created for pdf searchable exact while others were generated from old microfish and still others are in both jpg and jpeg2000.
I also need to know if there is any way to find out from this amount of data what is searchable and what is not searchable? Can I run OCR on all files and have it put OCR'ed files in a designated folder which will take forever? If I do this then will it do this for files that are already searchable or will it just skip those and omit them from the output folder? If I OCR them then will they be searchable once uploaded to a site or is that another can of worms?
I know I am leaving a question or two out but it will come to me. Thanks in advance very very much!