PDF Extract API - Two column pdfs

Forum|Forum|2 years ago
October 3, 2023
1 reply
2128 views

Is there a way to extract data from two-column pdfs? I've tried the PDF Extract API, and it doesn't work on PDFs with two colu

This topic has been closed for replies.

Joel Geraci

Community Expert

I've found the exact opposite. Every file I've tried that has multiple columns works very well even if the gutter is sometimes thinner than some of the character spacing in justified columns. Can you supply the PDF in question?

J

Jerrod326727148pseAuthor

Participant

Hi Joel, thank you so much for letting me know. I've attached the file below. When I use the PDF extraction API, it combines the text from the left column and the right column for each row.

What I'm doing at the end of my code:

for element in data["elements"]:
        if(element["Path"].endswith("/body")):
            print(element["Text"])

Granted, I have not tried other types of PDFs, but please let me know if there would be anything I can try to modify the code. Thank you so much!

US10899855.pdf

Joel Geraci

Community Expert

Ok - I think you'd have to admit this is a uniquely complex layout. It's actually not a two-column layout, it's a three-column layout where the numbers going down the center are column 2 and the paragraphs on the right are column 3. Unfortunately, Extract tries to interpret the layout as 2 column and variously includes the middle column in column 1 or column 3 and where the text in column 2 is unrelated to either 1 or 3. If you were to turn on the character position option, you could detect these extra numbers by their position and exclude them from the results.

All that said, do you mind if I send this to our engineering team to train the AI on? This is a really interesting layout for it to try to understand.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded