Is there a way to easily change the SDK such that the table can be returned in polars/pandas format?

Report · Jan 22, 2024

Hi,

I am working on a flow, in which we upload pdfs to S3 and then ping a lambda function that will use the Extract API in order to extract the tables from the PDF.

I see that it works well but my request is: can the SDK be changes such that it can return polars/pandas dataframes instead of a ZIP of CSVs?

Or at least give us some control over it ? Like let us submit pull requests ?

I definitely would love this possibility because now I need to run it, unzip, then read all the csvs into dataframes and then do the processing.

And everything seems fragile to me.

Report · Jan 22, 2024

And another question would be: why does it take for us (UK), about 18-30 seconds for the Adobe API to send us a response ? We would appreciate it if we could lower the processing time.

Report · Jan 22, 2024

Part of it could be network. Have you tried switching regions? https://developer.adobe.com/document-services/docs/overview/pdf-services-api/howtos/service-region-c...

Report · Jan 23, 2024

I changed it to EU.

And we run it in AWS Lambda now and I get 43 seconds for processing time. Which very weird

Report · Jan 23, 2024

That seems rather high - for me normally its < 10 seconds.

Report · Jan 23, 2024

@Raymond Camden we use the Python SDK

Yes, it`s very weird. Any idea why ? We just use AWS Lambda , 3GB and the python SDK. And we do not run any of our code. We just hit your API and wait for the response and it takes 42-43 secs.

Report · Jan 23, 2024

@Raymond Camden now we get 18-22 seconds. So it fluctuates a lot ...

Report · Jan 23, 2024

@Raymond Camden another thing we tested:

- from my local it takes 17-20 seconds. The same multi page pdf, on another colleagues local takes 38-43 seconds. On lambda the same things takes ~40 seconds.

We only time the execute() function, so mostly nothing is done locally.

Also we found that between US and EU, we do not have too much of a diference. It`s weird. Like Ireland is 200 miles from us and US 3000+

Report · Jan 22, 2024

No, this is not possible. In my opinion, we have to settle on a few output options that are the most flexible to cover the most use cases, but we won't ever be able to cover every usecase.

As for 'fragile', I'm not sure what you mean. After you get the result from Extract, your code to process the results is... well your code. Build it rock solid and it won't be fragile. 😉

Report · Jan 23, 2024

Hi. Okay, thank you for the quick response!

My last question:@Raymond Camden

Can I directly use the zipped response from the API in memory? Without writing to a file ? I tried write_steam but I do not manage it

The only way I found a way is, to modify download_and_save_file() and do not create a local file from FileRef, but directly unzip the stream and then open the zip in memory and use the json

Report · Jan 23, 2024

Not knowing what SDK you are using, in Node, there is a writeToStream option. Temporary file storage _is_ used though. If you absolutely need to avoid that, you need to switch to the REST APIs which are relatively easy to use.

Report · Jan 23, 2024

I use the python SDK. I managed to do it with write_to_stream and then Inherit the PDFExtraction class, to create a new object that doesn`t write the zip file to the storage.

Then, get the byte stream, unzip it and save the json directly in memory and it works fine on AWS Lambda.

Thanks a lot!

BTW: Can we send you (via email) some examples of PDFs that are anonymyzed, where the Sensei AI does mistakes ?