• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Is there an API to access Cataloged Index files of PDF files created by Adobe?

Community Beginner ,
Jul 12, 2019 Jul 12, 2019

Copy link to clipboard

Copied

In other words, if I create a catalog file, can I read that catalog file via a Python, C#, Java or program in a different language and return a list of files to the user that contain the person or place name searched for? This would allow us to make our archive of PDF files searchable on a web site. We do not want Google to search these files, they are not publicly available. Users need to log into the web site before they will be allowed to search and download PDF files matching their criteria. Also, it would be great if the search would not just return the file name, but some context of each match. Is there something like this? If even just the indexes of the PDFs are accessible via API, it might be possible to use a library like BeautifulSoup to satisfy this need.

TOPICS
Acrobat SDK and JavaScript

Views

682

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 12, 2019 Jul 12, 2019

Copy link to clipboard

Copied

There is no documented API for this.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 15, 2019 Jul 15, 2019

Copy link to clipboard

Copied

OK. That seems fair. Let's say that a user has a number of PDF files, all of which have had OCR run on them. Some with Adobe Acrobat, some with Abbyy FineReader. Is there an accepted way to open the file, access it using Python or C#, where the program can read ONLY the OCR text? Is the OCR text stored on a separate layer? I am confident that Acrobat can use the OCR text, but can a standalone program open the file and extract just the text so it can build indexes?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jul 15, 2019 Jul 15, 2019

Copy link to clipboard

Copied

Frank, you should look for open source text extraction apps for PDF, to use with your own web indexing system (eg Apache Lucene).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 15, 2019 Jul 15, 2019

Copy link to clipboard

Copied

Who is Frank? I will check out Lucene. It appears to be Java. I like Java but web server does not have Java available, but maybe something similar can be found. Thanks!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jul 15, 2019 Jul 15, 2019

Copy link to clipboard

Copied

LATEST

OCR text is still normal text. In certain cases it is marked as hidden (do not display) but in all other ways (position, font, etc.) it is regular text. Text extraction apps and libraries exist, I've never used any.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines