File not suitable for content extraction: File contents are too complex for content extraction

Question

Dear Community, I am going crazy. I am trying to extract a few PDF files and I get the error: "File not suitable for content extraction: File contents are too complex for content extraction" for some files. Unfortunately, this error doesn't help me at all. The files are not large (9 MB), there are only 4 pages and they look exactly the same as all the others I am working on. unfortunately I am not allowed to share the PDF files. Could someone please give me a hint as to WHAT is too complex? Best regardsTommy my code: try:

#Initial setup, create credentials instance.
    credentials = Credentials.service_principal_credentials_builder().with_client_id('XXX').with_client_secret('XXX').build()

#Create an ExecutionContext using credentials and create a new operation instance.
    execution_context = ExecutionContext.create(credentials)
    extract_pdf_operation = ExtractPDFOperation.create_new()

#Set operation input from a source file.
    source = FileRef.create_from_local_file("XXXX6bfcd6f.pdf")
    extract_pdf_operation.set_input(source)

#Build ExtractPDF options and set them into the operation
    extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
        .with_element_to_extract(ExtractElementType.TEXT) \
        .with_include_styling_info(True) \
        .build()
    extract_pdf_operation.set_options(extract_pdf_options)

#Execute the operation.
    result: FileRef = extract_pdf_operation.execute(execution_context)

#Save the result to the specified location.
    result.save_as("ExtractTextInfoWithStylingInfoFromPDF.zip")
except (ServiceApiException, ServiceUsageException, SdkException):
    logging.exception("Exception encountered while executing operation")

Tomas35001055w90y · Accepted Answer

Many thanks for the answers. Based on them I was able to find the cause. I have complex vector drawings in some PDF files. If I remove these, the extraction works. I have therefore found a work-around on how to proceed:

I remove all vector drawings using GS:
"gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf

Joel Geraci · Answer

Generally, the "too complex" error pops up when the AI can't break down a table into its component rows, columns, and cells, but it can also be when there is diagonal text, a watermark, no logical reading order, or simply the kind of layout that it's not been trained for. It's that last one that's really frustrating.

If you can share the files privately with Ray, or even just one that works and one that doesn't.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.