Watermark detection in PDF Extract API

Report · Apr 04, 2022

I'm having issues where watermarking is interfering with extracting tables from some documents. I made an example document that fails to detect the existance of a table (see attached). I have other documents that I'm trying to extract data from that where the watermark is interfering with the extract API. Unfortunetly, I can't share those documents. Hopefully the example document is illustrative enough.

The logging I get is as follows:

INFO:adobe.pdfservices.operation.pdfops.extract_pdf_operation:All validations successfully done. Beginning ExtractPDF operation execution
INFO:adobe.pdfservices.operation.pdfops.extract_pdf_operation:Extract Operation Successful - Transaction ID: 3I6Y4FDws6sqf1xhkgvtFStaOQeVolt6
INFO:adobe.pdfservices.operation.internal.io.file_ref_impl:Moving file at /tmp/extractSdkResult/42afb4a4b44911ec952900155d058d1f.zip to target /home/lei/output/example.zip

I attached the pdf and the service output.

Report · Apr 04, 2022

Is there anyway to remove the watermark or get the extract API to ignore the watermark?

Report · Apr 12, 2022

No luck probably on having PDF Extract API to ignore the watermark.

It looks like the watermark in your PDF isn't a properly generated watermark because otherwise I would typically suggest that you go into Acrobat and select "Remove Watermark", but that doesn't work because it isn't written into the file like a watermark should, just as an object.

You can remove the watermark manually in Adobe Acrobat DC by clicking on Edit PDF and selecting each of the characters and deleting them.

Adobe Community

Watermark detection in PDF Extract API