Unable to Extract any Data from Pdf's

Report · Nov 16, 2021

I'm facing some Adobe services exceptions while running the Python SDK of Adobe PDF Extract API Service.

Clueless thing is I'm facing this exception only when I'm trying to use any of my PDF Data sets. However, it's working succesfully for the pdf sample which comes with all SDK named: "extractPdfInput.pdf" and with this I'm able to generate json structure for all the .py files inside rc

1) .py script with my PDF data set : (AnalogDialogue.pdf)

import logging
import os.path
import zipfile

from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation

logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))

try:
    # get base path.
    base_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

    # Initial setup, create credentials instance.
    credentials = Credentials.service_account_credentials_builder() \
        .from_file(base_path + "/pdfservices-api-credentials.json") \
        .build()

    # Create an ExecutionContext using credentials and create a new operation instance.
    execution_context = ExecutionContext.create(credentials)
    extract_pdf_operation = ExtractPDFOperation.create_new()

    # Set operation input from a source file.
    source = FileRef.create_from_local_file(base_path + "/resources/AnalogDialogue.pdf")

    extract_pdf_operation.set_input(source)

    # Build ExtractPDF options and set them into the operation
    extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
        .with_element_to_extract(ExtractElementType.TEXT) \
        .build()
    extract_pdf_operation.set_options(extract_pdf_options)

    # Execute the operation.
    result: FileRef = extract_pdf_operation.execute(execution_context)

    # Save the result to the specified location.
    result.save_as(base_path + "/output/ExtractTextInfoFromPDF.zip")
    file_to_extract = "structuredData.json"

	# extract the json
    with zipfile.ZipFile(base_path + "/output/ExtractTextInfoFromPDF.zip") as z:
        with open(file_to_extract, 'wb') as f:
            f.write(z.read(file_to_extract))
            print("Extracted", file_to_extract)
            # os.remove(base_path + "/output/ExtractTextInfoFromPDF.zip")

except (ServiceApiException, ServiceUsageException, SdkException):
    logging.exception("Exception encountered while executing operation")

- Terminal log while running "adobe-pdf-extract/src/extractpdf/extract_txt_from_pdf.py"

python3 src/extractpdf/extract_txt_table_info_with_figure_tables_rendition_from_pdf.py
INFO:adobe.pdfservices.operation.pdfops.extract_pdf_operation:All validations successfully done. Beginning ExtractPDF operation execution
INFO:adobe.pdfservices.operation.pdfops.extract_pdf_operation:Extract Operation Successful - Transaction ID: lUFDE1p1OC3oxgDtCeIdW6HeWmVc14Ry
INFO:adobe.pdfservices.operation.internal.io.file_ref_impl:Moving file at /var/folders/z_/hrr9wxg135x30vrj32b868100000gp/T/extractSdkResult/b22cc67a46ab11ec9955b88d120e91a8.zip to target /Users/achal/Downloads/PDFServices/adobe-pdf-extract/output/ExtractTextTableWithFigureTableRendition.zip
admins-MacBook-Air-3:adobe-pdf-extract achal$ python3 src/extractpdf/extract_txt_table_info_with_figure_tables_rendition_from_pdf.py
INFO:adobe.pdfservices.operation.pdfops.extract_pdf_operation:All validations successfully done. Beginning ExtractPDF operation execution
ERROR:root:Exception encountered while executing operation
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1262, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1308, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1257, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1067, in _send_output
self.send(chunk)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 989, in send
self.sock.sendall(data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1034, in sendall
v = self.send(byte_view[count:])
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1003, in send
return self._sslobj.write(data)
socket.timeout: The write operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/util/retry.py", line 531, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/packages/six.py", line 734, in reraise
raise value.with_traceback(tb)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1262, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1308, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1257, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1067, in _send_output
self.send(chunk)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 989, in send
self.sock.sendall(data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1034, in sendall
v = self.send(byte_view[count:])
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1003, in send
return self._sslobj.write(data)
urllib3.exceptions.ProtocolError: ('Connection aborted.', timeout('The write operation timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/adobe/pdfservices/operation/internal/http/http_client.py", line 73, in _execute_request
timeout=timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/api.py", line 119, in post
return request('post', url, data=data, json=json, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "src/extractpdf/extract_txt_table_info_with_figure_tables_rendition_from_pdf.py", line 53, in <module>
result: FileRef = extract_pdf_operation.execute(execution_context)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/adobe/pdfservices/operation/pdfops/extract_pdf_operation.py", line 131, in execute
location = ExtractPDFAPI.extract_pdf(execution_context, self._source_file_ref, self._extract_pdf_options)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/adobe/pdfservices/operation/internal/service/extract_pdf_api.py", line 43, in extract_pdf
ServiceConstants.EXTRACT_OPERATION_NAME)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/adobe/pdfservices/operation/internal/api/cpf_api.py", line 65, in cpf_create_ops_api
error_response_handler=CPFApi.handle_error_response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/adobe/pdfservices/operation/internal/http/http_client.py", line 41, in process_request
response = _execute_request(http_request)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/adobe/pdfservices/operation/internal/http/http_client.py", line 81, in _execute_request
raise SdkException("Request could not be completed. Possible cause attached!", sys.exc_info())
adobe.pdfservices.operation.exception.exceptions.SdkException: description =Request could not be completed. Possible cause attached!, requestTrackingId=(<class 'requests.exceptions.ConnectionError'>, ConnectionError(ProtocolError('Connection aborted.', timeout('The write operation timed out'))), <traceback object at 0x10ff36500>)

Report · Nov 19, 2021

Hi! First off, thank you for sharing so much information, and especially the sample PDFs. I tried the first two and was not able to replicate the issue. I do see that the error relates to a timeout. Is it possible your system (where you tested the code) has a (possibly) bad connection, or slow connection, to our APIs? Also, could you try increasing the timeout? I'm a Python newbie, but according to our docs, it's configured here:

https://opensource.adobe.com/pdfservices-python-sdk-samples/apidocs/latest/reference/index.html#clie...

Report · Nov 23, 2021

Thanks for your response! Actually we've already extracted every data using your PDF i.e extractPdfInput.pdf,

and with your PDF sample there is no such error connection Exception in your API.

- The Error and Exception is coming "only" when we are using our PDF Data sets.

- About the timeout issue, we will check that once but for your reference there is already inbuilt timeout which is given and there's no need to seperately provide any timeout. There is enough timeout given while API are configured.

Kindly provide assistance by assigning your technical team so that we can get this issue resolved and check our Sample PDF's so that we can integrate it with our Application.

Report · Nov 23, 2021

To your point about a timeout existing. I know that - my point though was to see if _increasing_ the timeout can help.

As I said, I'm unable to replicate the issue with your PDFs - they work fine for me.

Can you try increasing your timeout to see if it helps?

Report · Jan 22, 2023

I really appreciate your help but I don't understand anything your wanting me to do. The file I was trying to send was a pdf file and maybe that's the problem. I'm going to see what I can do on my end. these files I made years ago with a illustrator I bought and owned, now that new versions have come out, I can't make these designs anymore. I have probably hundreds of them in files, I then print them out on black and white printer, then etch them, wish you could seen them, I just can't make them anymore...Linda

Lindas Lovely Loot

Report · Jan 22, 2023

There's the answers the name at the end of your email. type in Lindas Lovely Loot. Some of my work is on that page....

Lindas Lovely Loot

Report · Jan 23, 2023

Im not sure how your question relates to the Acrobat Services APIs.

Report · Jul 05, 2023

Hi,

I am facing a similar situation, even with the Adobe Extract API Sample.pdf. Here's the error I'm receiving:

Exception has occurred: ValueError

Invalid Credentials provided as argument

File "C:\Users\dawsonsc\OneDrive - Organon\desktop\PDF Extract API\main.py", line 29, in <module> execution_context = ExecutionContext.create(credentials)
ValueError: Invalid Credentials provided as argument

Here's the code I'm using (Straight from the Adobe PDF Extract Quickstarts):

import logging

from adobe.pdfservices.operation.auth.credentials import Credentials

from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException

from adobe.pdfservices.operation.execution_context import ExecutionContext

from adobe.pdfservices.operation.io.file_ref import FileRef

from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation

from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions

from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType

import os.path

import zipfile

import json

zip_file = "./Adobe Extract API Sample.pdf"

if os.path.isfile(zip_file😞

os.remove(zip_file)

input_pdf = "./Adobe Extract API Sample.pdf"

try:

#Initial setup, create credentials instance.

credentials = Credentials.service_principal_credentials_builder()

credentials.with_client_id('PDF_SERVICES_CLIENT_ID')

credentials.with_client_secret('PDF_SERVICES_CLIENT_SECRET')

credentials.build();

#Create an ExecutionContext using credentials and create a new operation instance.

execution_context = ExecutionContext.create(credentials)

extract_pdf_operation = ExtractPDFOperation.create_new()

# Set operation input from a source file.

source = FileRef.create_from_local_file(input_pdf)

extract_pdf_operation.set_input(source)

#Build ExtractPDF options and set them into the operation

extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \

.with_element_to_extract(ExtractElementType.TEXT) \

.build()

extract_pdf_operation.set_options(extract_pdf_options)

#Execute the operation.

result: FileRef = extract_pdf_operation.execute(execution_context)

#Save the result to the specified location.

result.save_as(zip_file)

print("Successfully extracted information from PDF. Printing H1 Headers:\n");

archive = zipfile.ZipFile(zip_file, 'r')

jsonentry = archive.open('structuredData.json')

jsondata = jsonentry.read()

data = json.loads(jsondata)

for element in data["elements"]:

if(element["Path"].endswith("/H1")):

print(element["Text"])

except (ServiceApiException, ServiceUsageException, SdkException😞

logging.exception("Exception encountered while executing operation")

Report · Jul 10, 2023

If you are using the _very_ latest SDK, be sure to go into the developer console and generate new OAuth credentials.

Report · Jul 10, 2023

I did that and same error. I even deleted all projects, credentials, everything and started from scratch. Same error.

Report · Jul 10, 2023

Did you change these lines?

credentials.with_client_id('PDF_SERVICES_CLIENT_ID')

credentials.with_client_secret('PDF_SERVICES_CLIENT_SECRET')

Report · Jul 10, 2023

So I tried both ways: replacing those lines with the generated credentials and leaving them in original format.

Honestly, I'm about to just give up on Adobe altogether and use Python's pdfminer module which is not nearly as time-consuming and ridiculously difficult to implement yet achieves the exact same goal - a JSON output with all PDF elements, text, tables, multi-page paragraphs, literally everything that this API offers, attributes, etc cetera. Which can then, of course be used in a Power automate flow to parse the JSON file. And Adobe is zero help - they want me to create deprecated JWT credentials - an engineer on this specific API no less.

Report · Jul 10, 2023

Who asked you to make deprecated JWT credentials? I know I didn't. Earlier I confirmed with you that you were using the new OAuth creds.

Secondly, the right thing to do is replace those static values with your credentials. You said you did that - but now I'm concerned that you were using the wrong ones. So you did use the client id, and secret, from the oauth credentials, right?

Report · Jul 10, 2023

Cosmin provided links in an email which when I clicked on them took me to a page that instructed me to create JWT credentials.

Report · Jul 11, 2023

I don't see Cosmin on this thread, but if it came from an Adobe employee, please feel free to forward it to me at jedimaster@adobe.com.

So, to be clear, you should the OAuth creds. And it should be updated in the quick start. Can you confirm you are doing that and still getting an error? Also, can you email me a copy of your file with the creds in there so I can verify?

Report · Jul 10, 2023

facing problem with error "base64 data appears to be truncated"

Report · Jul 10, 2023

Um, are you using the same code as the orignal user? If not, can you open a new thread please.