Copy link to clipboard
Copied
In other words, if I create a catalog file, can I read that catalog file via a Python, C#, Java or program in a different language and return a list of files to the user that contain the person or place name searched for? This would allow us to make our archive of PDF files searchable on a web site. We do not want Google to search these files, they are not publicly available. Users need to log into the web site before they will be allowed to search and download PDF files matching their criteria. Also, it would be great if the search would not just return the file name, but some context of each match. Is there something like this? If even just the indexes of the PDFs are accessible via API, it might be possible to use a library like BeautifulSoup to satisfy this need.
Copy link to clipboard
Copied
There is no documented API for this.
Copy link to clipboard
Copied
OK. That seems fair. Let's say that a user has a number of PDF files, all of which have had OCR run on them. Some with Adobe Acrobat, some with Abbyy FineReader. Is there an accepted way to open the file, access it using Python or C#, where the program can read ONLY the OCR text? Is the OCR text stored on a separate layer? I am confident that Acrobat can use the OCR text, but can a standalone program open the file and extract just the text so it can build indexes?
Copy link to clipboard
Copied
Frank, you should look for open source text extraction apps for PDF, to use with your own web indexing system (eg Apache Lucene).
Copy link to clipboard
Copied
Who is Frank? I will check out Lucene. It appears to be Java. I like Java but web server does not have Java available, but maybe something similar can be found. Thanks!
Copy link to clipboard
Copied
OCR text is still normal text. In certain cases it is marked as hidden (do not display) but in all other ways (position, font, etc.) it is regular text. Text extraction apps and libraries exist, I've never used any.