Copy link to clipboard
Copied
Hi,
I am working on a flow, in which we upload pdfs to S3 and then ping a lambda function that will use the Extract API in order to extract the tables from the PDF.
I see that it works well but my request is: can the SDK be changes such that it can return polars/pandas dataframes instead of a ZIP of CSVs?
Or at least give us some control over it ? Like let us submit pull requests ?
I definitely would love this possibility because now I need to run it, unzip, then read all the csvs into dataframes and then do the processing.
And everything seems fragile to me.
Copy link to clipboard
Copied
And another question would be: why does it take for us (UK), about 18-30 seconds for the Adobe API to send us a response ? We would appreciate it if we could lower the processing time.
Copy link to clipboard
Copied
Part of it could be network. Have you tried switching regions? https://developer.adobe.com/document-services/docs/overview/pdf-services-api/howtos/service-region-c...
Copy link to clipboard
Copied
I changed it to EU.
And we run it in AWS Lambda now and I get 43 seconds for processing time. Which very weird
Copy link to clipboard
Copied
That seems rather high - for me normally its < 10 seconds.
Copy link to clipboard
Copied
@Raymond Camden we use the Python SDK
Yes, it`s very weird. Any idea why ? We just use AWS Lambda , 3GB and the python SDK. And we do not run any of our code. We just hit your API and wait for the response and it takes 42-43 secs.
Copy link to clipboard
Copied
@Raymond Camden now we get 18-22 seconds. So it fluctuates a lot ...
Copy link to clipboard
Copied
@Raymond Camden another thing we tested:
- from my local it takes 17-20 seconds. The same multi page pdf, on another colleagues local takes 38-43 seconds. On lambda the same things takes ~40 seconds.
We only time the execute() function, so mostly nothing is done locally.
Also we found that between US and EU, we do not have too much of a diference. It`s weird. Like Ireland is 200 miles from us and US 3000+
Copy link to clipboard
Copied
No, this is not possible. In my opinion, we have to settle on a few output options that are the most flexible to cover the most use cases, but we won't ever be able to cover every usecase.
As for 'fragile', I'm not sure what you mean. After you get the result from Extract, your code to process the results is... well your code. Build it rock solid and it won't be fragile. 😉
Copy link to clipboard
Copied
Hi. Okay, thank you for the quick response!
My last question:@Raymond Camden
Can I directly use the zipped response from the API in memory? Without writing to a file ? I tried write_steam but I do not manage it
The only way I found a way is, to modify download_and_save_file() and do not create a local file from FileRef, but directly unzip the stream and then open the zip in memory and use the json
Copy link to clipboard
Copied
Not knowing what SDK you are using, in Node, there is a writeToStream option. Temporary file storage _is_ used though. If you absolutely need to avoid that, you need to switch to the REST APIs which are relatively easy to use.
Copy link to clipboard
Copied
I use the python SDK. I managed to do it with write_to_stream and then Inherit the PDFExtraction class, to create a new object that doesn`t write the zip file to the storage.
Then, get the byte stream, unzip it and save the json directly in memory and it works fine on AWS Lambda.
Thanks a lot!
BTW: Can we send you (via email) some examples of PDFs that are anonymyzed, where the Sensei AI does mistakes ?