Skip to main content
Participant
January 24, 2024
Answered

File not suitable for content extraction: File contents are too complex for content extraction

  • January 24, 2024
  • 3 replies
  • 1246 views

Dear Community,

 

I am going crazy. I am trying to extract a few PDF files and I get the error: "File not suitable for content extraction: File contents are too complex for content extraction" for some files. Unfortunately, this error doesn't help me at all. The files are not large (9 MB), there are only 4 pages and they look exactly the same as all the others I am working on. unfortunately I am not allowed to share the PDF files. Could someone please give me a hint as to WHAT is too complex?

 

Best regards
Tommy

 

my code:

 

try:

    #Initial setup, create credentials instance.
    credentials = Credentials.service_principal_credentials_builder().with_client_id('XXX').with_client_secret('XXX').build()

    #Create an ExecutionContext using credentials and create a new operation instance.
    execution_context = ExecutionContext.create(credentials)
    extract_pdf_operation = ExtractPDFOperation.create_new()

    #Set operation input from a source file.
    source = FileRef.create_from_local_file("XXXX6bfcd6f.pdf")
    extract_pdf_operation.set_input(source)

    #Build ExtractPDF options and set them into the operation
    extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
        .with_element_to_extract(ExtractElementType.TEXT) \
        .with_include_styling_info(True) \
        .build()
    extract_pdf_operation.set_options(extract_pdf_options)

    #Execute the operation.
    result: FileRef = extract_pdf_operation.execute(execution_context)

    #Save the result to the specified location.
    result.save_as("ExtractTextInfoWithStylingInfoFromPDF.zip")
except (ServiceApiException, ServiceUsageException, SdkException):
    logging.exception("Exception encountered while executing operation")

 

    This topic has been closed for replies.
    Correct answer Tomas35001055w90y

    Many thanks for the answers. Based on them I was able to find the cause. I have complex vector drawings in some PDF files. If I remove these, the extraction works. I have therefore found a work-around on how to proceed:

     

    I remove all vector drawings using GS:
    "gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf

    3 replies

    Tomas35001055w90yAuthorCorrect answer
    Participant
    January 26, 2024

    Many thanks for the answers. Based on them I was able to find the cause. I have complex vector drawings in some PDF files. If I remove these, the extraction works. I have therefore found a work-around on how to proceed:

     

    I remove all vector drawings using GS:
    "gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf

    Joel Geraci
    Community Expert
    Community Expert
    January 24, 2024

    Generally, the "too complex" error pops up when the AI can't break down a table into its component rows, columns, and cells, but it can also be when there is diagonal text, a watermark, no logical reading order, or simply the kind of layout that it's not been trained for. It's that last one that's really frustrating.

    If you can share the files privately with Ray, or even just one that works and one that doesn't. 

    Raymond Camden
    Community Manager
    Community Manager
    January 24, 2024

    I understand you can't share them publicly, but could you share them with me directly? (jedimaster@adobe.com)