• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
1

Is there a way to easily change the SDK such that the table can be returned in polars/pandas format?

Explorer ,
Jan 22, 2024 Jan 22, 2024

Copy link to clipboard

Copied

Hi,

 

 

I am working on a flow, in which we upload pdfs to S3 and then ping a lambda function that will use the Extract API in order to extract the tables from the PDF.

I see that it works well but my request is: can the SDK be changes such that it can return polars/pandas dataframes instead of a ZIP of CSVs?

Or at least give us some control over it ? Like let us submit pull requests ?

I definitely would love this possibility because now I need to run it, unzip, then read all the csvs into dataframes and then do the processing.

And everything seems fragile to me.

TOPICS
Feature request , PDF Extract API , PDF Services API , Python SDK

Views

997

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 22, 2024 Jan 22, 2024

Copy link to clipboard

Copied

And another question would be: why does it take for us (UK), about 18-30 seconds for the Adobe API to send us a response ? We would appreciate it if we could lower the processing time.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 22, 2024 Jan 22, 2024

Copy link to clipboard

Copied

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

I changed it to EU.

And we run it in AWS Lambda now and I get 43 seconds for processing time. Which very weird

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

That seems rather high - for me normally its < 10 seconds.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

@Raymond Camden we use the Python SDK

Yes, it`s very weird. Any idea why ? We just use AWS Lambda , 3GB and the python SDK. And we do not run any of our code. We just hit your API and wait for the response and it takes 42-43 secs.

 

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

@Raymond Camden now we get 18-22 seconds. So it fluctuates a lot ...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

LATEST

@Raymond Camden another thing we tested:

 

- from my local it takes 17-20 seconds. The same multi page pdf, on another colleagues local takes 38-43 seconds. On lambda the same things takes ~40 seconds.

We only time the execute() function, so mostly nothing is done locally.

Also we found that between US and EU, we do not have too much of a diference. It`s weird. Like Ireland is 200 miles from us and US 3000+

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 22, 2024 Jan 22, 2024

Copy link to clipboard

Copied

No, this is not possible. In my opinion, we have to settle on a few output options that are the most flexible to cover the most use cases, but we won't ever be able to cover every usecase. 

 

As for 'fragile', I'm not sure what you mean. After you get the result from Extract, your code to process the results is... well your code. Build it rock solid and it won't be fragile. 😉

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

Hi. Okay, thank you for the quick response!

My last question:@Raymond Camden

Can I directly use the zipped response from the API in memory? Without writing to a file ? I tried write_steam but I do not manage it

The only way I found a way is, to modify download_and_save_file() and do not create a local file from FileRef, but directly unzip the stream and then open the zip in memory and use the json

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

Not knowing what SDK you are using, in Node, there is a writeToStream option. Temporary file storage _is_ used though. If you absolutely need to avoid that, you need to switch to the REST APIs which are relatively easy to use.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Jan 23, 2024 Jan 23, 2024

Copy link to clipboard

Copied

I use the python SDK. I managed to do it with write_to_stream and then Inherit the PDFExtraction class, to create a new object that doesn`t write the zip file to the storage.

Then, get the byte stream, unzip it and save the json directly in memory and it works fine on AWS Lambda. 


Thanks a lot!

BTW: Can we send you (via email) some examples of PDFs that are anonymyzed, where the Sensei AI does mistakes ?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources