Skip to main content
Participating Frequently
January 2, 2023
Question

Adobe PDF table extract using Databricks

  • January 2, 2023
  • 1 reply
  • 3736 views

Hi All,

I'm trying to use Adobe PDF API service to extract table data in pdf files. For that i'm using sampl code for python sdk, but facing issue while trying to execute code in databricks environment. Please guide how to fix the issue

import logging
import os.path

from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type import \
    ExtractRenditionsElementType
from adobe.pdfservices.operation.pdfops.options.extractpdf.table_structure_type import TableStructureType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation

#logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))


credentials = Credentials.service_account_credentials_builder() \
    .from_file("/dbfs/FileStore/pdfservices_api_credentials.json") \
    .build()

execution_context = ExecutionContext.create(credentials)
extract_pdf_operation = ExtractPDFOperation.create_new()

source = FileRef.create_from_local_file("/dbfs/FileStore/form.pdf")
extract_pdf_operation.set_input(source)

# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
    .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \
    .with_element_to_extract_renditions(ExtractRenditionsElementType.TABLES) \
    .with_table_structure_format(TableStructureType.CSV) \
    .build()
extract_pdf_operation.set_options(extract_pdf_options)

# Execute the operation.
result: FileRef = extract_pdf_operation.execute(execution_context)

result.save_as(base_path + "/output/ExtractTextInfoFromPDF.zip")

 

Here is the error details:

 

INFO:adobe.pdfservices.operation.pdfops.extract_pdf_operation:All validations successfully done. Beginning ExtractPDF operation execution
INFO:py4j.java_gateway:Received command c on object id p0
INFO:py4j.java_gateway:Received command c on object id p0
INFO:py4j.java_gateway:Received command c on object id p0
INFO:py4j.java_gateway:Received command c on object id p0
SdkException: description =Exception in fetching access token, requestTrackingId=(<class 'AttributeError'>, AttributeError("'str' object has no attribute 'get'"), <traceback object at 0x7f7572a3fd00>)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/local_disk0/.ephemeral_nfs/envs/pythonEnv-150d96ca-003d-4671-a6d9-ab8e566616d1/lib/python3.8/site-packages/adobe/pdfservices/operation/internal/auth/jwt_authenticator.py in refresh_token(self)
     62                                        data=access_token_request_payload, headers={})
---> 63             response = http_client.process_request(http_request=http_request, success_status_codes=[HTTPStatus.OK],
     64                                                    error_response_handler=self.handle_ims_failure)

/local_disk0/.ephemeral_nfs/envs/pythonEnv-150d96ca-003d-4671-a6d9-ab8e566616d1/lib/python3.8/site-packages/adobe/pdfservices/operation/internal/http/http_client.py in process_request(http_request, success_status_codes, error_response_handler)
     37         response = _execute_request(http_request)
---> 38         if _handle_response_and_retry(response, success_status_codes,
     39                                       error_response_handler, not http_request.authenticator, http_request.request_key) and http_request.retryable:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-150d96ca-003d-4671-a6d9-ab8e566616d1/lib/python3.8/site-packages/adobe/pdfservices/operation/internal/http/http_client.py in _handle_response_and_retry(response, success_status_codes, error_response_handler, is_ims_api, request_key)
     94             "Failure response code {error_code} encountered from backend".format(error_code=response.status_code))
---> 95         should_retry = ResponseUtil.handle_api_failures(response, request_key, is_ims_api)
     96         return should_retry if should_retry else error_response_handler(response)

 Appreciate any help for this .. Thanks! in advance

This topic has been closed for replies.

1 reply

Raymond Camden
Community Manager
Community Manager
January 3, 2023

Can you confirm the environment has the right version of Python required for the SDK? Can you confirm the read operation on the credentials worked right?

Participating Frequently
January 9, 2023

This is the current version of python used in Azure databricks environment :

 

3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]

How do i confirm read operation on the credentials, please let me know the steps.

Raymond Camden
Community Manager
Community Manager
January 9, 2023

Try to read /dbfs/FileStore/pdfservices_api_credentials.json and ensure you can log the results. That's what I'd try first.