Copy link to clipboard
Copied
Hi. I'm converting pre-digital-layout issues of a magazine into searchable pdfs for our archive. I would like to save some time and NOT run OCR on ads.
My question is: is it possible to run OCR on partial pages, rather than entire pages?
Thank you in advance for any help.
Susan
Copy link to clipboard
Copied
In a word, no. (But please read all the way, I may have thought of something at the end of this.)
I do remember years ago there was a OCR package that let you do that but that was back in the days when OCR took a LONG time to do and there was some level of time saved by doing that. However now, the time you take to block out each add will probably take more time than the time you'd save from Acrobat not doing the OCR processing.
However, how are you creating your magazine in the first place? If you are using InDesign or FrameMaker, you can generate PDFs directly out of those applications. Is your company using some proprietary software that doesn't do that?
Otherwise, the add itself will undergo some level of storage compression during the OCR process so that the final document size will be smaller than if the OCR process was not done on the adds.
Oh, I just had one thought, if you are using (say) InDesign and place the adds on their own layer and turned those layers off during the PDF saving process, you'd have all the text but no adds. That would be one way to do what you want.
Copy link to clipboard
Copied
We are using InDesign and I've already archived 2002-2018.
We were using computer layout 1992-2002, but that digital archive has been lost. Prior to 1992 is another 22 years in which layout was pasteup. So, I have 30 years (1970-2002) worth of scanned issues.
The earliest pages from our printed archive are on lower quality paper and include stray ink marks plus not-so-great printing, as well as some page damage, so I'm going thru each page and correcting the OCR text to reflect original layout. A lot of the ad copy is tiny and low quality, and OCR turns some of it into gibberish. Since this is for our archive, I'm stuck between correcting all the ad text so that it's as accurate as editorial, or deleting all of it, which involves selecting and deleting each individual word or mark that OCR has picked up. Unfortunately, there does not seem to be a way to select a section of a page and delete the OCR text. Either way, it's awfully time consuming.
I would really, really like to be able to exclude those ad portions of the pages from the OCR process to begin with.
Copy link to clipboard
Copied
Hi Susan,
Thanks for filling in the details.
Some thoughts: since you are scanning the pages (sheesh, what a job, good luck). you can consider having pieces of post-it notes to cover the adds. Tedious yes, but it would do what you want.
Alternatively, just ignore the adds. Since they are not germane to the primary focus, ignore them.
Alternatively, within Acrobat, go into the Edit tool and just delete the copy. (Warning, depending on how much staining there is across the page, you potentially risk deleting things you do not want to delete so if you do this, do it one-at-a-time so you can Undo as necessary.
Lastly, to make sure you are getting good quality (or at least better quality) scans, please read this article I wrote for Adobe some time back.
https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs
One thing I did not mention in that article is scanning resolution: slower is better. The higher the resolution the better quality OCR can be done. At lower resolution, letters are more likely to merge together so that letter combinations such as "ri" might be seen as "n" or something else. So a minimum of 300 ppi or preferably 600 ppi to get much better quality. Be aware that a 600 ppi page will be 4 times the storage size as a 300 ppi page* but once you OCR it, the page storage size will be profoundly considerably smaller.
Let me know how this works out
* if you double the visual size of a document you increase the storage size four times (twice the width and twice the height).
Copy link to clipboard
Copied
Fortunately, I'm not having to scan the pages myself. That was done several years ago. The scans are ok, but they are high resolution. Ink spread is a problem on the paper on which the magazine was printed, particularly in the 1970s, which is why I'm making the effort to correct the OCR text to match the printed text. Correcting or deleting the resulting text in the ads is equally time consuming... I was just hoping there was a way to not have to do it, since they're not really consequential to editorial content.
Thank you for your suggestions.
Get ready! An upgraded Adobe Community experience is coming in January.
Learn more