• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Adobe PDF table extract using Databricks

New Here ,
Jan 01, 2023 Jan 01, 2023

Copy link to clipboard

Copied

Hi All,

I'm trying to use Adobe PDF API service to extract table data in pdf files. For that i'm using sampl code for python sdk, but facing issue while trying to execute code in databricks environment. Please guide how to fix the issue

import logging
import os.path

from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type import \
    ExtractRenditionsElementType
from adobe.pdfservices.operation.pdfops.options.extractpdf.table_structure_type import TableStructureType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation

#logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))


credentials = Credentials.service_account_credentials_builder() \
    .from_file("/dbfs/FileStore/pdfservices_api_credentials.json") \
    .build()

execution_context = ExecutionContext.create(credentials)
extract_pdf_operation = ExtractPDFOperation.create_new()

source = FileRef.create_from_local_file("/dbfs/FileStore/form.pdf")
extract_pdf_operation.set_input(source)

# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
    .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \
    .with_element_to_extract_renditions(ExtractRenditionsElementType.TABLES) \
    .with_table_structure_format(TableStructureType.CSV) \
    .build()
extract_pdf_operation.set_options(extract_pdf_options)

# Execute the operation.
result: FileRef = extract_pdf_operation.execute(execution_context)

result.save_as(base_path + "/output/ExtractTextInfoFromPDF.zip")

 

Here is the error details:

 

INFO:adobe.pdfservices.operation.pdfops.extract_pdf_operation:All validations successfully done. Beginning ExtractPDF operation execution
INFO:py4j.java_gateway:Received command c on object id p0
INFO:py4j.java_gateway:Received command c on object id p0
INFO:py4j.java_gateway:Received command c on object id p0
INFO:py4j.java_gateway:Received command c on object id p0
SdkException: description =Exception in fetching access token, requestTrackingId=(<class 'AttributeError'>, AttributeError("'str' object has no attribute 'get'"), <traceback object at 0x7f7572a3fd00>)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/local_disk0/.ephemeral_nfs/envs/pythonEnv-150d96ca-003d-4671-a6d9-ab8e566616d1/lib/python3.8/site-packages/adobe/pdfservices/operation/internal/auth/jwt_authenticator.py in refresh_token(self)
     62                                        data=access_token_request_payload, headers={})
---> 63             response = http_client.process_request(http_request=http_request, success_status_codes=[HTTPStatus.OK],
     64                                                    error_response_handler=self.handle_ims_failure)

/local_disk0/.ephemeral_nfs/envs/pythonEnv-150d96ca-003d-4671-a6d9-ab8e566616d1/lib/python3.8/site-packages/adobe/pdfservices/operation/internal/http/http_client.py in process_request(http_request, success_status_codes, error_response_handler)
     37         response = _execute_request(http_request)
---> 38         if _handle_response_and_retry(response, success_status_codes,
     39                                       error_response_handler, not http_request.authenticator, http_request.request_key) and http_request.retryable:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-150d96ca-003d-4671-a6d9-ab8e566616d1/lib/python3.8/site-packages/adobe/pdfservices/operation/internal/http/http_client.py in _handle_response_and_retry(response, success_status_codes, error_response_handler, is_ims_api, request_key)
     94             "Failure response code {error_code} encountered from backend".format(error_code=response.status_code))
---> 95         should_retry = ResponseUtil.handle_api_failures(response, request_key, is_ims_api)
     96         return should_retry if should_retry else error_response_handler(response)

 Appreciate any help for this .. Thanks! in advance

TOPICS
PDF Extract API , Python SDK , REST APIs

Views

2.4K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 03, 2023 Jan 03, 2023

Copy link to clipboard

Copied

Can you confirm the environment has the right version of Python required for the SDK? Can you confirm the read operation on the credentials worked right?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

This is the current version of python used in Azure databricks environment :

 

3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]

How do i confirm read operation on the credentials, please let me know the steps.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

Try to read /dbfs/FileStore/pdfservices_api_credentials.json and ensure you can log the results. That's what I'd try first.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

Can you please share some sample code, when i'm trying to read with spark.read.json its giving error as:  |-- _corrupt_record: string (nullable = true)

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

Err, I don't know what spark.read.json is - I had meant the basic Python file open command. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

 

import json


with open('/dbfs/FileStore/pdfservices_api_credentials.json', 'r') as f:
    data = f.read()
    jsonObject = json.loads(data)
jsonObject

 

using the above code i can read the required file. Please let me know the next step

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

Unfortunately I'm not sure what to suggest. The error is not, as far as I know, related to bad credentials. But you could test by using Python locally, with your credentials, to do a quick Extract call and confirm it works.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

Ah something occurs to me. Your credentials should include a private key as well. Can you confirm that is available? The JSON file will point to the path of the key.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

With existing python code sample sdk it works , when deployed in standalone system. I'm trying to create a scalable solution with the help of databricks code (Becuase we have very large number of pdf files to process) and this a proof of concept we are trying to create to evaluate its capability to handle variety pattern of pdf files.

 

If you can suggest any scalable solution besides databricks let me know we can try to recreate that kind of environment and then process multiple pdf files parallely.

 

Looking for valuable inputs in this particular use case.

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

Um, honestly no. I like Python and use it a bit, but I'm far from an expert. Generally speaking, if I had N PDFs to extract where N was a large number, I'd write a process to handle X% of that, or some constant X, and run it in an interval. Ie, if I had 1 million, I'd process maybe a 10-20K a day or some such. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

LATEST

Ok, thanks for your input.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 09, 2023 Jan 09, 2023

Copy link to clipboard

Copied

df = spark.read.json("dbfs:/FileStore/pdfservices_api_credentials.json")
df.printSchema()
df.show()

 

here is the code that i'm using, please guide

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources