Skip to main content
Participant
July 12, 2019
Question

Is there an API to access Cataloged Index files of PDF files created by Adobe?

  • July 12, 2019
  • 3 replies
  • 1314 views

In other words, if I create a catalog file, can I read that catalog file via a Python, C#, Java or program in a different language and return a list of files to the user that contain the person or place name searched for? This would allow us to make our archive of PDF files searchable on a web site. We do not want Google to search these files, they are not publicly available. Users need to log into the web site before they will be allowed to search and download PDF files matching their criteria. Also, it would be great if the search would not just return the file name, but some context of each match. Is there something like this? If even just the indexes of the PDFs are accessible via API, it might be possible to use a library like BeautifulSoup to satisfy this need.

This topic has been closed for replies.

3 replies

Legend
July 15, 2019

OCR text is still normal text. In certain cases it is marked as hidden (do not display) but in all other ways (position, font, etc.) it is regular text. Text extraction apps and libraries exist, I've never used any.

Legend
July 15, 2019

Frank, you should look for open source text extraction apps for PDF, to use with your own web indexing system (eg Apache Lucene).

Participant
July 15, 2019

Who is Frank? I will check out Lucene. It appears to be Java. I like Java but web server does not have Java available, but maybe something similar can be found. Thanks!

Bernd Alheit
Community Expert
Community Expert
July 12, 2019

There is no documented API for this.

Participant
July 15, 2019

OK. That seems fair. Let's say that a user has a number of PDF files, all of which have had OCR run on them. Some with Adobe Acrobat, some with Abbyy FineReader. Is there an accepted way to open the file, access it using Python or C#, where the program can read ONLY the OCR text? Is the OCR text stored on a separate layer? I am confident that Acrobat can use the OCR text, but can a standalone program open the file and extract just the text so it can build indexes?