Copy link to clipboard
Copied
I am supporting a website for a local museum that offers access to scanned newspapers from 1915 to 1929. There are 600 different issues that are searchable that can be accessed through the reference materials section on the main WordPress site. Link: Reference Materials – The Patagonia Museum
What I would like to do is create a way for a user to search for keywords across all the issues at once.
Does Adobe offer a tool that will allow me to do that?
Thanks in advance for your help.
Linda S. for
The Patagonia Museum
Hi Linda,
That's more of a database issue than a PDF issue. You need to find out how the online search feature will work under the table on this.
You can search any PDF for any word and assuming that the spelling is correct (e.g., color versus colour), you're good. No tool is necessary, a find is a find.
However, let's say that each article will be compared to a pre-made database that only has those 600 words, than you're stuck. One other layer is partial words. In my example above, both words coul
...Copy link to clipboard
Copied
Hi Linda,
That's more of a database issue than a PDF issue. You need to find out how the online search feature will work under the table on this.
You can search any PDF for any word and assuming that the spelling is correct (e.g., color versus colour), you're good. No tool is necessary, a find is a find.
However, let's say that each article will be compared to a pre-made database that only has those 600 words, than you're stuck. One other layer is partial words. In my example above, both words could be discovered if all you had to type was "col" as that fits both words. The other advantage of this type of entry is that it will catch both "color," "colors," "colour," and "colours." (On the other hand, getting too many search results has other issues, but I'll ignore that for now.)
One last issue is the quality of the scan and the resultant OCR-ing. You will get better results if you have larger text and higher resolution scanning. That is, a 24 point size font will be more successful than a 10 point font. In addition, a 300 ppi scan will be better than a 150 ppi scan. The smaller the font and the lower the resolution is more likely to have errors. An example for this is the letter pair "ri" might easily be read as "n" as a result. In addition, since you are using rather old material. the quality of the paper can come into play that can also affect the quality of the OCR.
I wrote a blog for Adobe that covers this issue:
https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs
If these are newspaper-sized documents, you might also consider photographing the pages rather than scanning them. If you want/need more information on this, let me know.
Copy link to clipboard
Copied
Linda, we had a similar problem with our local museum except we have thousands of images (instead of newspapers). We never did find any features in the Adobe products to help present keyworded resources onto the web.
I notice that Patagonia offers the scanned newspapers without HTML support; they expose the file system, which is fine and very simple to implement, but can be awkward to navigate, as you mentioned. Fortunately this means you can start from scratch without baggage or some existing system it has to match.
We ended up writing our own web server software to display Lightroom images that are tagged with hierarchical keywords. It's a general solution in that it reads all keywords and the hierarchy directly from the set of uploaded images. All of our image management is done in Adobe Lightroom Classic. Web visitors can search by keyword and can view the entire list of keywords to find topics of interest.
Our web software is open source: https://github.com/barry-ha/Lightroom-nested-keywords
A small working demo: http://nestedkeywords.com/
Our large public website: https://photos.shorelinehistoricalmuseum.org/photo-gallery.html
Perhaps you have some web dev helper that can look at adapting this to your scanned newspapers. I realize you wrote that five years ago, and our approach would be a stretch and probably a lot of work. Good luck. ~Barry~