Unable to correctly extract tables from pdf document using pdf extract api

Question

Hello Everyone Use Case : I am using the PDF Extract API service to extract the tables within the pdfTech Stack : .Net Nuget version is Adobe.PDFServicesSDK :  3.0.0Problem : In a given table, if all the cells for a particular column are empty, they get merged with the next column, (Both the pdf file and output file is attached)Expected output : CSV FilesHere is the sample code Adobe.PDFServicesSDK.ExecutionContext executionContext = Adobe.PDFServicesSDK.ExecutionContext.Create(credentials);
                
ExtractPDFOperation extractPdfOperation = ExtractPDFOperation.CreateNew();
FileRef sourceFileRef = FileRef.CreateFromStream(pdfFileStream, "application/pdf");
                    extractPdfOperation.SetInputFile(sourceFileRef);

// Build ExtractPDF options and set them into the operation.
                    ExtractPDFOptions extractPdfOptions = ExtractPDFOptions.ExtractPDFOptionsBuilder()
                          .AddElementsToExtract(new List<ExtractElementType>(new[] { ExtractElementType.TABLES }))
                          .AddTableStructureFormat(TableStructureType.CSV)
                          .Build();

extractPdfOperation.SetOptions(extractPdfOptions);

// Lock & Execute the operation.
                    FileRef resultZipFile = extractPdfOperation.Execute(executionContext); Error CSV : expected are 6 columns but only 5 are being shown in the csvPdf File being parsed : Pls help thanksAD

Anil B Dugar · Answer

Try parsing the attached file and you will be able to reproduce the issue,Also note i tried parsing using Amazon Textextract Service and it works !!

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.