Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

PDF Extract API - Two column pdfs

Community Beginner ,
Oct 02, 2023 Oct 02, 2023

Is there a way to extract data from two-column pdfs? I've tried the PDF Extract API, and it doesn't work on PDFs with two colu

TOPICS
PDF Extract API , Python SDK
1.5K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 03, 2023 Oct 03, 2023

I've found the exact opposite. Every file I've tried that has multiple columns works very well even if the gutter is sometimes thinner than some of the character spacing in justified columns. Can you supply the PDF in question?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 03, 2023 Oct 03, 2023

Hi Joel, thank you so much for letting me know. I've attached the file below. When I use the PDF extraction API, it combines the text from the left column and the right column for each row. 

What I'm doing at the end of my code:

for element in data["elements"]:
        if(element["Path"].endswith("/body")):
            print(element["Text"])

 

Granted, I have not tried other types of PDFs, but please let me know if there would be anything I can try to modify the code. Thank you so much!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 03, 2023 Oct 03, 2023

Ok - I think you'd have to admit this is a uniquely complex layout. It's actually not a two-column layout, it's a three-column layout where the numbers going down the center are column 2 and the paragraphs on the right are column 3. Unfortunately, Extract tries to interpret the layout as 2 column and variously includes the middle column in column 1 or column 3 and where the text in column 2 is unrelated to either 1 or 3. If you were to turn on the character position option, you could detect these extra numbers by their position and exclude them from the results.

 

All that said, do you mind if I send this to our engineering team to train the AI on? This is a really interesting layout for it to try to understand.

 

2023-10-03_08-44-26.png

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 03, 2023 Oct 03, 2023

Thank you for the analysis, and yes, please feel free to share the file with the engineering team. I'd really appreciate hearing back if there are any updates! 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 18, 2023 Oct 18, 2023

Hi Joel,

 

Going off a tangent from the original question - Is there a way to conduct search on a pdf document that Adobe has split into elements? (like the screenshot you posted above). Please let me know if there's an API suitable to search for a sentence on a pdf document.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Oct 19, 2023 Oct 19, 2023

Once you have the text from a PDF, you could simply include that into a search engine. Something like Algolia for example. We don't have an API for search as it wouldn't make sense really. 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 19, 2023 Oct 19, 2023

Thank you for the advice!! But instead of raw text search, I'd appreciate hearing from you if there's a way to conduct search on a section on a document based on the section number specified in the section's path (ex. //Document/Sect/P[8]).

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Oct 20, 2023 Oct 20, 2023
LATEST

Well.... sure. Anythings possible. 😉 I mean we give you the data, right? You could parse the JSON and search against that. So yes, you could build this. No, we don't have an API for it. But yes, you could do this.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources