• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

What is the optimal PDF page length to hit the PDF Extract API?

Community Beginner ,
Oct 08, 2021 Oct 08, 2021

Copy link to clipboard

Copied

I am using the PDF extract API for one of my applications. Each PDF file page length may differ from 300 - 1000 pages. In the documentation, it is mentioned that non-scanned PDFs are limited to 200 pages and Scanned PDFs must be 100 pages or less. Limits may be lower for files with a large number of tables.

To solve the above-mentioned limitations a PDF file should be split into multiple chunks, but how can I determine the optimal no. of page length for each PDF to automate the process?

 

Questions?

What is the max no. of pages that I can hit the API with a PDF which contains only

  1. scanned pages in it? 
  2. large tables?
  3. scanned large tables? 


Thank you!

 

TOPICS
PDF Extract API

Views

908

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 08, 2021 Oct 08, 2021

Copy link to clipboard

Copied

If you don't know ahead of time if the file is image only or not, I'd limit the number of pages submitted to 100 but then also pay attention to what was retuned just in case the tables are too long and it fails.

 

That said, you can use the new PDF Properties API to detect if the file contains only images and if true, send 100 at a time and if false, send 200.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 20, 2021 Oct 20, 2021

Copy link to clipboard

Copied

Thanks for responding. I actually split my file to multiple chunks with a 50 page range length, but in one chunk there are 24 pages with scanned tables. For that chunk I am getting below mentioned error.

 

 

ERROR:root:Exception encountered while executing operation
Traceback (most recent call last):
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\pdfops\extract_pdf_operation.py", line 134, in execute
ExtractPDFAPI.download_and_save(location=location, context=execution_context, file_location=file_location)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\service\extract_pdf_api.py", line 48, in download_and_save
response = CPFApi.cpf_status_api(location, context)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\api\cpf_api.py", line 92, in cpf_status_api
timeout=10 * 60
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\polling2.py", line 191, in poll val = target(*args, **kwargs)
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\api\cpf_api.py", line 89, in <lambda>
error_response_handler=CPFApi.handle_error_response),
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\http\http_client.py", line 43, in process_request
error_response_handler, not http_request.authenticator) and http_request.retryable:
File "C:\Users\***\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\internal\http\http_client.py", line 110, in _handle_response_and_retry
report_error_code=report_error_code)
adobe.pdfservices.operation.internal.exceptions.OperationException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\***\***\extract_txt_from_pdf.py", line 60, in extractAdobePdfTextApi
result: FileRef = extract_pdf_operation.execute(execution_context)
File "C:\Users\avemula\AppData\Local\Programs\Python\Python36\lib\site-packages\adobe\pdfservices\operation\pdfops\extract_pdf_operation.py", line 139, in execute
request_tracking_id=oex.request_tracking_id, status_code=oex.status_code)
adobe.pdfservices.operation.exception.exceptions.ServiceApiException: description =Unable to process the message
even after retries; requestTrackingId=bsE8PLwSoGgjQMVLJCxLZS2u2BzdWfj5; statusCode=500; errorCode=UNKNOWN

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 20, 2021 Oct 20, 2021

Copy link to clipboard

Copied

Can you please let me know how can I handle this exception? Thank you!

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Oct 21, 2021 Oct 21, 2021

Copy link to clipboard

Copied

Hi Ashish5C1E,

Thank you for providing the request ID. We notice an error in one of the components in our pipeline during that session. We can troubleshoot further if you can share a file that is resulting in this error either by attaching it to this post or sending it via email to extractapi@adobe.com

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Oct 28, 2021 Oct 28, 2021

Copy link to clipboard

Copied

LATEST

Hi Chris,

I've sent the PDF and errors to the above mentioned email address. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources