Is there an API to access Cataloged Index files of PDF files created by Adobe?

Community Beginner ,
Jul 12, 2019

Copy link to clipboard

Copied

In other words, if I create a catalog file, can I read that catalog file via a Python, C#, Java or program in a different language and return a list of files to the user that contain the person or place name searched for? This would allow us to make our archive of PDF files searchable on a web site. We do not want Google to search these files, they are not publicly available. Users need to log into the web site before they will be allowed to search and download PDF files matching their criteria. Also, it would be great if the search would not just return the file name, but some context of each match. Is there something like this? If even just the indexes of the PDFs are accessible via API, it might be possible to use a library like BeautifulSoup to satisfy this need.

TOPICS
Acrobat SDK and JavaScript

Views

166

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more

Is there an API to access Cataloged Index files of PDF files created by Adobe?

Community Beginner ,
Jul 12, 2019

Copy link to clipboard

Copied

In other words, if I create a catalog file, can I read that catalog file via a Python, C#, Java or program in a different language and return a list of files to the user that contain the person or place name searched for? This would allow us to make our archive of PDF files searchable on a web site. We do not want Google to search these files, they are not publicly available. Users need to log into the web site before they will be allowed to search and download PDF files matching their criteria. Also, it would be great if the search would not just return the file name, but some context of each match. Is there something like this? If even just the indexes of the PDFs are accessible via API, it might be possible to use a library like BeautifulSoup to satisfy this need.

TOPICS
Acrobat SDK and JavaScript

Views

167

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Jul 12, 2019 0
Adobe Community Professional ,
Jul 12, 2019

Copy link to clipboard

Copied

There is no documented API for this.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 12, 2019 2
Community Beginner ,
Jul 15, 2019

Copy link to clipboard

Copied

OK. That seems fair. Let's say that a user has a number of PDF files, all of which have had OCR run on them. Some with Adobe Acrobat, some with Abbyy FineReader. Is there an accepted way to open the file, access it using Python or C#, where the program can read ONLY the OCR text? Is the OCR text stored on a separate layer? I am confident that Acrobat can use the OCR text, but can a standalone program open the file and extract just the text so it can build indexes?

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 15, 2019 0
Most Valuable Participant ,
Jul 15, 2019

Copy link to clipboard

Copied

Frank, you should look for open source text extraction apps for PDF, to use with your own web indexing system (eg Apache Lucene).

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 15, 2019 0
Community Beginner ,
Jul 15, 2019

Copy link to clipboard

Copied

Who is Frank? I will check out Lucene. It appears to be Java. I like Java but web server does not have Java available, but maybe something similar can be found. Thanks!

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 15, 2019 0
Most Valuable Participant ,
Jul 15, 2019

Copy link to clipboard

Copied

OCR text is still normal text. In certain cases it is marked as hidden (do not display) but in all other ways (position, font, etc.) it is regular text. Text extraction apps and libraries exist, I've never used any.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jul 15, 2019 0