• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

PDF Extract API - Two column pdfs

Community Beginner ,
Oct 02, 2023 Oct 02, 2023

Copy link to clipboard

Copied

Is there a way to extract data from two-column pdfs? I've tried the PDF Extract API, and it doesn't work on PDFs with two colu

TOPICS
PDF Extract API , Python SDK

Views

831

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 03, 2023 Oct 03, 2023

Copy link to clipboard

Copied

I've found the exact opposite. Every file I've tried that has multiple columns works very well even if the gutter is sometimes thinner than some of the character spacing in justified columns. Can you supply the PDF in question?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 03, 2023 Oct 03, 2023

Copy link to clipboard

Copied

Hi Joel, thank you so much for letting me know. I've attached the file below. When I use the PDF extraction API, it combines the text from the left column and the right column for each row. 

What I'm doing at the end of my code:

for element in data["elements"]:
        if(element["Path"].endswith("/body")):
            print(element["Text"])

 

Granted, I have not tried other types of PDFs, but please let me know if there would be anything I can try to modify the code. Thank you so much!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 03, 2023 Oct 03, 2023

Copy link to clipboard

Copied

Ok - I think you'd have to admit this is a uniquely complex layout. It's actually not a two-column layout, it's a three-column layout where the numbers going down the center are column 2 and the paragraphs on the right are column 3. Unfortunately, Extract tries to interpret the layout as 2 column and variously includes the middle column in column 1 or column 3 and where the text in column 2 is unrelated to either 1 or 3. If you were to turn on the character position option, you could detect these extra numbers by their position and exclude them from the results.

 

All that said, do you mind if I send this to our engineering team to train the AI on? This is a really interesting layout for it to try to understand.

 

2023-10-03_08-44-26.png

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 03, 2023 Oct 03, 2023

Copy link to clipboard

Copied

Thank you for the analysis, and yes, please feel free to share the file with the engineering team. I'd really appreciate hearing back if there are any updates! 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 18, 2023 Oct 18, 2023

Copy link to clipboard

Copied

Hi Joel,

 

Going off a tangent from the original question - Is there a way to conduct search on a pdf document that Adobe has split into elements? (like the screenshot you posted above). Please let me know if there's an API suitable to search for a sentence on a pdf document.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Oct 19, 2023 Oct 19, 2023

Copy link to clipboard

Copied

Once you have the text from a PDF, you could simply include that into a search engine. Something like Algolia for example. We don't have an API for search as it wouldn't make sense really. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 19, 2023 Oct 19, 2023

Copy link to clipboard

Copied

Thank you for the advice!! But instead of raw text search, I'd appreciate hearing from you if there's a way to conduct search on a section on a document based on the section number specified in the section's path (ex. //Document/Sect/P[8]).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Oct 20, 2023 Oct 20, 2023

Copy link to clipboard

Copied

LATEST

Well.... sure. Anythings possible. 😉 I mean we give you the data, right? You could parse the JSON and search against that. So yes, you could build this. No, we don't have an API for it. But yes, you could do this.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources