Copy link to clipboard
Copied
I am using the PDF extract API for one of my applications. Each PDF file page length may differ from 300 - 1000 pages. In the documentation, it is mentioned that non-scanned PDFs are limited to 200 pages and Scanned PDFs must be 100 pages or less. Limits may be lower for files with a large number of tables.
To solve the above-mentioned limitations a PDF file should be split into multiple chunks, but how can I determine the optimal no. of page length for each PDF to automate the process?
Questions?
What is the max no. of pages that I can hit the API with a PDF which contains only
Thank you!
Copy link to clipboard
Copied
If you don't know ahead of time if the file is image only or not, I'd limit the number of pages submitted to 100 but then also pay attention to what was retuned just in case the tables are too long and it fails.
That said, you can use the new PDF Properties API to detect if the file contains only images and if true, send 100 at a time and if false, send 200.
Copy link to clipboard
Copied
Thanks for responding. I actually split my file to multiple chunks with a 50 page range length, but in one chunk there are 24 pages with scanned tables. For that chunk I am getting below mentioned error.
ERROR:root:Exception encountered while executing operation
Traceback (most recent call last):
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\pdfops\extract_pdf_operation.py", line 134, in execute
ExtractPDFAPI.download_and_save(location=location, context=execution_context, file_location=file_location)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\service\extract_pdf_api.py", line 48, in download_and_save
response = CPFApi.cpf_status_api(location, context)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\api\cpf_api.py", line 92, in cpf_status_api
timeout=10 * 60
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\polling2.py", line 191, in poll val = target(*args, **kwargs)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\api\cpf_api.py", line 89, in <lambda>
error_response_handler=CPFApi.handle_error_response),
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\http\http_client.py", line 43, in process_request
error_response_handler, not http_request.authenticator) and http_request.retryable:
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\http\http_client.py", line 110, in _handle_response_and_retry
report_error_code=report_error_code)
adobe.pdfservices.operation.internal.exceptions.OperationException
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\***\***\extract_txt_from_pdf.py", line 60, in extractAdobePdfTextApi
result: FileRef = extract_pdf_operation.execute(execution_context)
File "C:\Users\avemula\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\pdfops\extract_pdf_operation.py", line 139, in execute
request_tracking_id=oex.request_tracking_id, status_code=oex.status_code)
adobe.pdfservices.operation.exception.exceptions.ServiceApiException: description =Unable to process the message
even after retries; requestTrackingId=bsE8PLwSoGgjQMVLJCxLZS2u2BzdWfj5; statusCode=500; errorCode=UNKNOWN
Copy link to clipboard
Copied
Can you please let me know how can I handle this exception? Thank you!
Copy link to clipboard
Copied
Hi Ashish5C1E,
Thank you for providing the request ID. We notice an error in one of the components in our pipeline during that session. We can troubleshoot further if you can share a file that is resulting in this error either by attaching it to this post or sending it via email to extractapi@adobe.com
Copy link to clipboard
Copied
Hi Chris,
I've sent the PDF and errors to the above mentioned email address.