Skip to main content
Participant
October 3, 2023
Question

PDF Extract API - Two column pdfs

  • October 3, 2023
  • 1 reply
  • 2103 views

Is there a way to extract data from two-column pdfs? I've tried the PDF Extract API, and it doesn't work on PDFs with two colu

This topic has been closed for replies.

1 reply

Joel Geraci
Community Expert
Community Expert
October 3, 2023

I've found the exact opposite. Every file I've tried that has multiple columns works very well even if the gutter is sometimes thinner than some of the character spacing in justified columns. Can you supply the PDF in question?

Participant
October 3, 2023

Hi Joel, thank you so much for letting me know. I've attached the file below. When I use the PDF extraction API, it combines the text from the left column and the right column for each row. 

What I'm doing at the end of my code:

for element in data["elements"]:
        if(element["Path"].endswith("/body")):
            print(element["Text"])

 

Granted, I have not tried other types of PDFs, but please let me know if there would be anything I can try to modify the code. Thank you so much!

Joel Geraci
Community Expert
Community Expert
October 3, 2023

Ok - I think you'd have to admit this is a uniquely complex layout. It's actually not a two-column layout, it's a three-column layout where the numbers going down the center are column 2 and the paragraphs on the right are column 3. Unfortunately, Extract tries to interpret the layout as 2 column and variously includes the middle column in column 1 or column 3 and where the text in column 2 is unrelated to either 1 or 3. If you were to turn on the character position option, you could detect these extra numbers by their position and exclude them from the results.

 

All that said, do you mind if I send this to our engineering team to train the AI on? This is a really interesting layout for it to try to understand.