Skip to main content
Participating Frequently
October 8, 2021
Question

What is the optimal PDF page length to hit the PDF Extract API?

  • October 8, 2021
  • 1 reply
  • 1339 views

I am using the PDF extract API for one of my applications. Each PDF file page length may differ from 300 - 1000 pages. In the documentation, it is mentioned that non-scanned PDFs are limited to 200 pages and Scanned PDFs must be 100 pages or less. Limits may be lower for files with a large number of tables.

To solve the above-mentioned limitations a PDF file should be split into multiple chunks, but how can I determine the optimal no. of page length for each PDF to automate the process?

 

Questions?

What is the max no. of pages that I can hit the API with a PDF which contains only

  1. scanned pages in it? 
  2. large tables?
  3. scanned large tables? 


Thank you!

 

This topic has been closed for replies.

1 reply

Joel Geraci
Community Expert
Community Expert
October 8, 2021

If you don't know ahead of time if the file is image only or not, I'd limit the number of pages submitted to 100 but then also pay attention to what was retuned just in case the tables are too long and it fails.

 

That said, you can use the new PDF Properties API to detect if the file contains only images and if true, send 100 at a time and if false, send 200.

Participating Frequently
October 20, 2021

Thanks for responding. I actually split my file to multiple chunks with a 50 page range length, but in one chunk there are 24 pages with scanned tables. For that chunk I am getting below mentioned error.

 

 

ERROR:root:Exception encountered while executing operation
Traceback (most recent call last):
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\pdfops\extract_pdf_operation.py", line 134, in execute
ExtractPDFAPI.download_and_save(location=location, context=execution_context, file_location=file_location)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\service\extract_pdf_api.py", line 48, in download_and_save
response = CPFApi.cpf_status_api(location, context)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\api\cpf_api.py", line 92, in cpf_status_api
timeout=10 * 60
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\polling2.py", line 191, in poll val = target(*args, **kwargs)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\api\cpf_api.py", line 89, in <lambda>
error_response_handler=CPFApi.handle_error_response),
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\http\http_client.py", line 43, in process_request
error_response_handler, not http_request.authenticator) and http_request.retryable:
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\http\http_client.py", line 110, in _handle_response_and_retry
report_error_code=report_error_code)
adobe.pdfservices.operation.internal.exceptions.OperationException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\***\***\extract_txt_from_pdf.py", line 60, in extractAdobePdfTextApi
result: FileRef = extract_pdf_operation.execute(execution_context)
File "C:\Users\avemula\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\pdfops\extract_pdf_operation.py", line 139, in execute
request_tracking_id=oex.request_tracking_id, status_code=oex.status_code)
adobe.pdfservices.operation.exception.exceptions.ServiceApiException: description =Unable to process the message
even after retries; requestTrackingId=bsE8PLwSoGgjQMVLJCxLZS2u2BzdWfj5; statusCode=500; errorCode=UNKNOWN

Participating Frequently
October 20, 2021

Can you please let me know how can I handle this exception? Thank you!