Copy link to clipboard
Copied
Hi,
I'm (trying to) extract data from a pdf using the code in extract_txt_table_info_with_figure_tables_rendition_from_pdf.py as a template.
It works fine - and manages to download the file to /tmp/sdk_result/ but then breaks on
I filed a bug report for this today.
Copy link to clipboard
Copied
Just to add that the key piece of information in the error is here: 48 os.rename(self._file_path, abs_path) 49 return 50 raise SdkException("Output file {file} exists".format(file=destination_file_path)) OSError: [Errno 18] Invalid cross-device link
It shows that the Adobe library is using os.rename to move the file from the temporary location to the specified output folder. If I specify a subdirectory of /tmp/sdk_result/ then all is fine - there's no error - but if I specify another location, then I get the above error. This is a known problem, and the solution is to use shutil.move (https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link)
Please can someone in the dev team look into this? Thanks!
Copy link to clipboard
Copied
I want to make sure I understand - are you saying if /tmp is not in the physical drive, the issue occurs? Or that if you try to save the _result_ to a different physical drive?
Copy link to clipboard
Copied
Hi Raymond,
I'm not sure where /tmp is actually, but is likely to be on a different physical drive to the location that I specify to save the data. I'll paste all my code below. n.b. this code works, but contains the workaround to avoid the above error.
import logging
import os.path
from pathlib import Path
import shutil
from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type import \
ExtractRenditionsElementType
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation
logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))
try:
# get base path.
#base_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) #example base_path changed to root_path
root_path = os.getcwd() #to edit based on location
input_path = root_path + '/still_missing' #to edit based on location
output_path = root_path + '/still_missing/output/' #to edit based on location
credentials = Credentials.service_account_credentials_builder() \
.from_file(root_path + "/pdfservices-api-credentials-jt.json") \
.build()
# Create an ExecutionContext using credentials and create a new operation instance.
execution_context = ExecutionContext.create(credentials)
filenames = os.listdir(input_path)
for name in filenames:
if not name.endswith('.pdf'):
continue
print(name)
# Initial setup, create credentials instance.
extract_pdf_operation = ExtractPDFOperation.create_new()
# Set operation input from a source file.
# source = FileRef.create_from_local_file(input_path + "/Abroms_2014.pdf")
# extract_pdf_operation.set_input(source)
source = FileRef.create_from_local_file(input_path + "/" + name)
extract_pdf_operation.set_input(source)
# Build ExtractPDF options and set them into the operation
extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
.with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \
.with_elements_to_extract_renditions([ExtractRenditionsElementType.TABLES,
ExtractRenditionsElementType.FIGURES]) \
.build()
extract_pdf_operation.set_options(extract_pdf_options)
# Execute the operation.
result: FileRef = extract_pdf_operation.execute(execution_context)
# Save the result to the specified location.
#giving same name as input but saving as zip files
output_file_name = name[:-3] + "zip"
# getting around a bug in the adobe library. (The /tmp/ folder is on a different mount, and throws an error, as the library tries to rename, rather than move the file)
output_file_path = '/tmp/sdk_result/' + output_file_name #output_path + output_file_name
result.save_as(output_file_path)
shutil.copyfile(output_file_path, output_path + output_file_name)
except (ServiceApiException, ServiceUsageException, SdkException):
logging.exception("Exception encountered while executing operation")
Copy link to clipboard
Copied
Ok, I don't think I can test this myself locally, but it feels like you have a good handle on the issue and enough detail for an error report. Query, the script itself, is it on the same drive as /tmp, or the same drive as where you _wanted_ to save it?
Copy link to clipboard
Copied
The script is on the same drive as I'd like to save the results. The location of /tmp is unknown. I've dug around in the drive on the compute resource I'm using (on Azure machine learning - notebooks), but can't see it there, so assume it's elsewhere.
Copy link to clipboard
Copied
I filed a bug report for this today.
Copy link to clipboard
Copied
Thanks for your quick follow-up Raymond.
James.