• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

BUG in adobe.pdfservices.operation.internal.io.file_ref_impl

New Here ,
Feb 13, 2023 Feb 13, 2023

Copy link to clipboard

Copied

Hi,

I'm (trying to) extract data from a pdf using the code in extract_txt_table_info_with_figure_tables_rendition_from_pdf.py as a template.

 

It works fine - and manages to download the file to /tmp/sdk_result/ but then breaks on 

result.save_as(output_file_path) with the following error:
 
INFO:adobe.pdfservices.operation.internal.io.file_ref_impl:Moving file at /tmp/sdk_result/dbe74a20abb911ed8aaf57ef4efd14e7.zip to target /mnt/batch/tasks/shared/LS_root/mounts/clusters/<pathinfo>/1990965948.zip --------------------------------------------------------------------------- OSError Traceback (most recent call last) Input In [19], in <cell line: 28>() 79 output_file_name = name[:-3] + "zip" 80 output_file_path = output_path + output_file_name ---> 81 result.save_as(output_file_path) 83 except (ServiceApiException, ServiceUsageException, SdkException): 84 logging.exception("Exception encountered while executing operation") File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/adobe/pdfservices/operation/internal/io/file_ref_impl.py:48, in FileRefImpl.save_as(self, destination_file_path) 46 os.mkdir(dir) 47 if not os.path.exists(abs_path): ---> 48 os.rename(self._file_path, abs_path) 49 return 50 raise SdkException("Output file {file} exists".format(file=destination_file_path)) OSError: [Errno 18] Invalid cross-device link: '/tmp/sdk_result/dbe74a20abb911ed8aaf57ef4efd14e7.zip' -> '/mnt/batch/tasks/shared/LS_root/mounts/clusters/<pathinfo>/1990965948.zip'
 
It looks to me as though the /tmp/sdk_result/ folder is not on the same drive as the user folders, so os.rename is throwing an error. If you search the forum there are several other people who have encountered the same problem in different environments but, other than running everything on local C: (not an option!), there appears to be no resolution.
 
My workaround has been to specify that the output path should be 'tmp/sdk_result/filename.zip' and then to use shutil.copyfile to copy it to where I want it to be. This is obviously not ideal though, and it's not unusual to have temporary storage away from data. 
 
Please can you suggest a better workaround (or change the library so that it works in cloud environments)? Simply being able to specify where the temporary data are stored would be sufficient (or avoiding using os.rename in the library at all).
 
Thanks, James.
TOPICS
Bug , Python SDK

Views

860

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Adobe Employee , Feb 15, 2023 Feb 15, 2023

I filed a bug report for this today.

Votes

Translate

Translate
New Here ,
Feb 14, 2023 Feb 14, 2023

Copy link to clipboard

Copied

Just to add that the key piece of information in the error is here: 48 os.rename(self._file_path, abs_path) 49 return 50 raise SdkException("Output file {file} exists".format(file=destination_file_path)) OSError: [Errno 18] Invalid cross-device link

 

It shows that the Adobe library is using os.rename to move the file from the temporary location to the specified output folder. If I specify a subdirectory of /tmp/sdk_result/ then all is fine - there's no error - but if I specify another location, then I get the above error. This is a known problem, and the solution is to use shutil.move (https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link)

 

Please can someone in the dev team look into this? Thanks!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Feb 14, 2023 Feb 14, 2023

Copy link to clipboard

Copied

I want to make sure I understand - are you saying if /tmp is not in the physical drive, the issue occurs? Or that if you try to save the _result_ to a different physical drive?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 14, 2023 Feb 14, 2023

Copy link to clipboard

Copied

Hi Raymond,

I'm not sure where /tmp is actually, but is likely to be on a different physical drive to the location that I specify to save the data. I'll paste all my code below. n.b. this code works, but contains the workaround to avoid the above error.

 

 

 

 

import logging
import os.path
from pathlib import Path
import shutil

from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_renditions_element_type import \
    ExtractRenditionsElementType
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation


logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))

try:
    # get base path.
    #base_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) #example base_path changed to root_path
    root_path = os.getcwd() #to edit based on location
    input_path = root_path + '/still_missing' #to edit based on location
    output_path = root_path + '/still_missing/output/' #to edit based on location
    
    credentials = Credentials.service_account_credentials_builder() \
        .from_file(root_path + "/pdfservices-api-credentials-jt.json") \
        .build()
    
    # Create an ExecutionContext using credentials and create a new operation instance.
    execution_context = ExecutionContext.create(credentials)
    
    filenames = os.listdir(input_path)
    for name in filenames:
        if not name.endswith('.pdf'):
            continue

        print(name)
        
        # Initial setup, create credentials instance.
       
        extract_pdf_operation = ExtractPDFOperation.create_new()
    
        # Set operation input from a source file.
    #    source = FileRef.create_from_local_file(input_path + "/Abroms_2014.pdf")
    #    extract_pdf_operation.set_input(source)
    
        source = FileRef.create_from_local_file(input_path + "/" + name)
        extract_pdf_operation.set_input(source)

        # Build ExtractPDF options and set them into the operation
        extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
            .with_elements_to_extract([ExtractElementType.TEXT, ExtractElementType.TABLES]) \
            .with_elements_to_extract_renditions([ExtractRenditionsElementType.TABLES,
                                                  ExtractRenditionsElementType.FIGURES]) \
            .build()
        extract_pdf_operation.set_options(extract_pdf_options)
    
        # Execute the operation.
        result: FileRef = extract_pdf_operation.execute(execution_context)
    
        # Save the result to the specified location.
        #giving same name as input but saving as zip files
        output_file_name = name[:-3] + "zip"

        # getting around a bug in the adobe library. (The /tmp/ folder is on a different mount, and throws an error, as the library tries to rename, rather than move the file)
        output_file_path = '/tmp/sdk_result/' + output_file_name #output_path + output_file_name
        result.save_as(output_file_path)
        shutil.copyfile(output_file_path, output_path + output_file_name)
        
except (ServiceApiException, ServiceUsageException, SdkException):
    logging.exception("Exception encountered while executing operation")

 

 
You can see that I'm specifying where to save the file in this line: source = FileRef.create_from_local_file(input_path + "/" + name)
Further down, you can see that I'm specifying the output_file_path so that it's in the same physical location as the temporary file. The code commented out is what I'd like to use - i.e. to save the result where I want it saved, but the result.save_as would fail if I used the same location as os.getcwd()
 
This may be a lengthy way of answering your question to say that if I try to save the result to a different physical drive to the one that /tmp is on, then it will fail. If you google the 'OSError: [Errno 18] Invalid cross-device link' you'll see that this is a known issue when using os.rename - which is what it seems is being used to move the temporary file to the specified location. (line 48 of adobe/pdfservices/operation/internal/io/file_ref_impl.py).
 
Hope this helps, but happy to have another go at describing what I'm seeing if not.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Feb 14, 2023 Feb 14, 2023

Copy link to clipboard

Copied

Ok, I don't think I can test this myself locally, but it feels like you have a good handle on the issue and enough detail for an error report. Query, the script itself, is it on the same drive as /tmp, or the same drive as where you _wanted_ to save it?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 14, 2023 Feb 14, 2023

Copy link to clipboard

Copied

The script is on the same drive as I'd like to save the results. The location of /tmp is unknown. I've dug around in the drive on the compute resource I'm using (on Azure machine learning - notebooks), but can't see it there, so assume it's elsewhere.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Feb 15, 2023 Feb 15, 2023

Copy link to clipboard

Copied

I filed a bug report for this today.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 16, 2023 Feb 16, 2023

Copy link to clipboard

Copied

LATEST

Thanks for your quick follow-up Raymond.

James.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources