Is there an API to access Cataloged Index files of PDF files created by Adobe?

Forum|Forum|6 years ago
July 12, 2019
3 replies
1314 views

In other words, if I create a catalog file, can I read that catalog file via a Python, C#, Java or program in a different language and return a list of files to the user that contain the person or place name searched for? This would allow us to make our archive of PDF files searchable on a web site. We do not want Google to search these files, they are not publicly available. Users need to log into the web site before they will be allowed to search and download PDF files matching their criteria. Also, it would be great if the search would not just return the file name, but some context of each match. Is there something like this? If even just the indexes of the PDFs are accessible via API, it might be possible to use a library like BeautifulSoup to satisfy this need.

This topic has been closed for replies.

T

Test Screen Name

Legend

OCR text is still normal text. In certain cases it is marked as hidden (do not display) but in all other ways (position, font, etc.) it is regular text. Text extraction apps and libraries exist, I've never used any.

T

Test Screen Name

Legend

Frank, you should look for open source text extraction apps for PDF, to use with your own web indexing system (eg Apache Lucene).

P

Prometheus_UnboundAuthor

Participant

Who is Frank? I will check out Lucene. It appears to be Java. I like Java but web server does not have Java available, but maybe something similar can be found. Thanks!

Bernd Alheit

Community Expert

There is no documented API for this.

P

Prometheus_UnboundAuthor

Participant

OK. That seems fair. Let's say that a user has a number of PDF files, all of which have had OCR run on them. Some with Adobe Acrobat, some with Abbyy FineReader. Is there an accepted way to open the file, access it using Python or C#, where the program can read ONLY the OCR text? Is the OCR text stored on a separate layer? I am confident that Acrobat can use the OCR text, but can a standalone program open the file and extract just the text so it can build indexes?

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded