Skip to main content
Participating Frequently
June 2, 2020
Question

OCR- What can be, what can't be and more.

  • June 2, 2020
  • 3 replies
  • 1846 views

I want to fully understand OCR. I am running Acrobat 10. I have a massive amount of documents that must all be searchable. I have plenty of time to do so. What I want to be sure of is under what condition is and is not a pdf searchable? This is roughly half a pedabyte of data. Yes that's right. Some of these were created for pdf searchable exact while others were generated from old microfish and still others are in both jpg and jpeg2000.

 

I also need to know if there is any way to find out from this amount of data what is searchable and what is not searchable? Can I run OCR on all files and have it put OCR'ed files in a designated folder which will take forever? If I do this then will it do this for files that are already searchable or will it just skip those and omit them from the output folder? If I OCR them then will they be searchable once uploaded to a site or is that another can of worms?

 

I know I am leaving a question or two out but it will come to me. Thanks in advance very very much!

This topic has been closed for replies.

3 replies

Legend
June 2, 2020

It's generally considered that to get good results you need a human proof reading. This may take man-decades for your collection. Otherwise there will be errors from minor to ridiculous.

62chuckAuthor
Participating Frequently
June 5, 2020

I have an addon question here. On the post gary_sc posted. My Acrobat X follows but only up to a point. On Step Step 3: Create a Batch Sequence, mine dos'nt follow this. First off, Acrobat X pro dos'nt have the word Sequence anywhere that I can find. However I checked my Acrobat X PDF Bilble by Ted Padova and it says the following.

 But when I go into my actions. If I am following this page correctly. The in the same Step 3: Create a Batch Sequence but on 5. The Select Commands window opens.
A) From the list at left, choose Preflight
B) Click the Add> button

 

mine devirges again and the action I imported does'nt show up, however it is in Preflight where I imported it. Ether I am not following it right or I am just not seeing it. I have looked thoroughly....I think.

 

 

I do want to take a time out here though and thank you all of the help so far. I am learning more about Acrobat than I thought I ever wanted to know. But I greatly enjoy the learning process.

 

SchweineKarl
Inspiring
June 2, 2020

Hello Chuck,

I've been scanning books and magazines since the DOS days, so you can get in touch with me if you like

and I'll provide you with some DO's and DONT's for converting text into electronic form.

You can reach me at: scansite@planet.nl.

 

Regards,

Carel

gary_sc
Community Expert
Community Expert
June 2, 2020

Hi 62chuck,

 

Sorry but this is a long one.

 

When everything is perfect, OCR-ing your documents will give you great content. When things are not all that good to begin with, your quality will suffer.

 

The grand majority of documents that are OCRed is after scanning the documents. The general rule of thumb is to scan at no less than 300ppi. 600ppi should give you better results but I have observed times when it doesn't — I've no clue as to why that is.

 

Generally the snafus of OCR happen when letters too close to each other become confused with other letters. For example "ri" is seen as an "n." You get the idea.

 

A lot of folks consider scanning as to walk up to the scanner, place the document inside, tap the button on the lid that says "Scan," and they are done. The problem with that is that the quality of OCR you are likely to get from that may not provide the best OCR. More on that in a minute.

 

The other thing you mention is the source:

 

Microfish: These are positive images (aka like a photographic slide) that were not taken at high resolution and are likely to have a number of horizontal scratches across the surface. Both of which will degrade the quality of the OCR process.

 

JPG 2000: I mention this because these should be good quality but of unknown resolution. If the resolution is high enough you have a chance of good quality OCR.

 

JPG: A standard JPG is a compression format. That makes them what's called a "lossy" image in that it loses information. Technically it averages every 8 pixels and makes a decision as to what to drop based on some algorithm. The greater the compression, the more it drops. At some point you get JPG degradation. You can see this in an image that has had high compression when you look closely at a dark object next to a light object (of a meeting notice sent via email and the meeting notice was saved at a JPG. If you look closely at the image you'll see pixel spots adjacent to the letters. THAT's JPG degradation and they play hell with OCR.

 

What I'm getting at here is that your results will be both good and bad. How does this affect you? Let's say that you are searching for the word "explorer." If one of the documents has that word as "expl0rer," it will not be found. In addition, Acrobat does NOT deal with tabbed words. So if the word "ex-" followed on the next line with "plorer," again that word will not be found.

 

Is there anything you can do about this? Yes, to a degree, but it will take time. 

Within Acrobat, search for potentially wrong words. Similar to a spell check, it will go through the document looking for strange words. Then search and fix, search and fix.

 

Earlier I mention a quality scan. I wrote a blog about this for Adobe some time back, here's a link:

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785435?page=1#M89

 

Also, please note that you cannot scan from Acrobat. Acrobat on Windows provides a portal to your scanning software. On the Mac it's much worse as they do not allow the Twain application to your scanning software but rather provide a link to their one Image Capture software which does dreadful scanning. Because of this (I am a Mac user), I just used my scanner's software's results into a folder and process the TIF images I create from within there.

 

Lastly, there was one time I had thousands of pages to scan and was able to borrow a FujiScan which did a fantastic job of scanning. But there were two issues: 1) I had to destroy all of the books and magazines so that there were lose pages to scan and 2) the FujiScan OCR software was dreadful. It left the pages overly bloated in storage size and the quality (accuracy) of the results was very poor. 

 

As far as destroying the books and magazines, not really a loss, I was just trying to digitize everything so that was not really a loss. For the latter issue, after scanning a block of items, I would then point Acrobat at the folder or that set of scanning and told it to OCR those documents. This both dramatically decreased the storage size of the documents and significantly increased the accuracy of the OCR.

 

Which leads me to a final final point: You talk of pedabyte of data. A 300 ppi full page TIF image is about 8-10 mb. After you OCR that same page it will be about 80-140 kb. Thus, there are several advantages of making a document searchable.

 

I hope this gives you some of the insites on making documents searchable. 

 

Let me know if you have any other questions.

62chuckAuthor
Participating Frequently
June 2, 2020

I understand everything and much appreciate your time. These files are 50/50 but the not good quality I cannot do anything about and expect these to have defects. I will explain these are magazines and newspapers many of them are 1700’s to 1940. Needless to say that these are scanned from microfiche. Some came to me on microfiche or pressed between veneer panels and are stained. Others rolled and placed in cedar boxes and many were also just folded. Trying to save as many as I can because not even the library of congress has many of these. I have a few of these that are being professionally restored and others will be put away till I can afford that same restoration.

My biggest concern right now is my scan quality because these will be going online at some point. I cannot afford a professional consultant so I am doing the best I can on my own without causing more damage.

 

The link you provided is showing dead. Is that the correct link?

Bernd Alheit
Community Expert
Community Expert
June 2, 2020

The link is only for ACPs.