Copy link to clipboard
Copied
Is there a way to extract data from two-column pdfs? I've tried the PDF Extract API, and it doesn't work on PDFs with two colu
Copy link to clipboard
Copied
I've found the exact opposite. Every file I've tried that has multiple columns works very well even if the gutter is sometimes thinner than some of the character spacing in justified columns. Can you supply the PDF in question?
Copy link to clipboard
Copied
Hi Joel, thank you so much for letting me know. I've attached the file below. When I use the PDF extraction API, it combines the text from the left column and the right column for each row.
What I'm doing at the end of my code:
for element in data["elements"]:
if(element["Path"].endswith("/body")):
print(element["Text"])
Granted, I have not tried other types of PDFs, but please let me know if there would be anything I can try to modify the code. Thank you so much!
Copy link to clipboard
Copied
Ok - I think you'd have to admit this is a uniquely complex layout. It's actually not a two-column layout, it's a three-column layout where the numbers going down the center are column 2 and the paragraphs on the right are column 3. Unfortunately, Extract tries to interpret the layout as 2 column and variously includes the middle column in column 1 or column 3 and where the text in column 2 is unrelated to either 1 or 3. If you were to turn on the character position option, you could detect these extra numbers by their position and exclude them from the results.
All that said, do you mind if I send this to our engineering team to train the AI on? This is a really interesting layout for it to try to understand.
Copy link to clipboard
Copied
Thank you for the analysis, and yes, please feel free to share the file with the engineering team. I'd really appreciate hearing back if there are any updates!
Copy link to clipboard
Copied
Hi Joel,
Going off a tangent from the original question - Is there a way to conduct search on a pdf document that Adobe has split into elements? (like the screenshot you posted above). Please let me know if there's an API suitable to search for a sentence on a pdf document.
Copy link to clipboard
Copied
Once you have the text from a PDF, you could simply include that into a search engine. Something like Algolia for example. We don't have an API for search as it wouldn't make sense really.
Copy link to clipboard
Copied
Thank you for the advice!! But instead of raw text search, I'd appreciate hearing from you if there's a way to conduct search on a section on a document based on the section number specified in the section's path (ex. //Document/Sect/P[8]).
Copy link to clipboard
Copied
Well.... sure. Anythings possible. 😉 I mean we give you the data, right? You could parse the JSON and search against that. So yes, you could build this. No, we don't have an API for it. But yes, you could do this.