Extract and replace test in PDF

Report · Jun 10, 2021

Hi all,

I have a pdf generated in appl Pages. The pdf has several data tables / text boxes per page.

Wondering if the following is possible:

Detect all text content on page, whether in tables or text boxes, etc.

Extract text and replace w/ other text, thus generating the same doc w/ entirely new text.

I've looked at some of the OCR and text extraction samples/docs. Extracting the data seems pretty strait forwards. But can I replace it? The use case is translating the doc to another language.

Report · Jun 10, 2021

Adobe doesn't have a Document Services API to do this. That said, it's not really something you'll want to do. Text in a PDF is generally laid down with precise coordinates and isn't able to be replaced without causing overlaps.

If you want to create new PDF files with different text, I suggest looking at the Document Generation API which will allow you to start with a Word template plus some JSON and output a PDF where the JSON is merged into tagged "fields" in the document.

Report · Jun 15, 2021

Hi Joel, thanks for the feedback.

That makes sense. Looks like using a template is the way to go. We don't
use Word though, and also manually tagging our complex documents would be
really time consuming.

I wonder if something like this would work:
> extract text using the extract API, thus getting content and coordinates
> populate a new blank PDF with the data, using the coordinate info to
insert text in blank document
( I would use another PDF js library for this part, if needed)

I downloaded the sample json output from the Extract API, and see a bunch
of coordinate data.
Could I use that to position text in a new document?

Report · Jun 15, 2021

Do you control the authoring process of these documents before they get converted to PDF or is this situation where you need to take what you are given?

Report · Jun 16, 2021

Yes, I create them in Apple Pages.

Report · Jun 16, 2021

So you have the source files? Why do you need the Extract service then? I'm missing something.

Report · Jun 16, 2021

The pdf's change depending on our clients' needs — we use tables & formulas to show/hide different options, and we are always making customiztion to them on the fly - adding notes, etc.. And they change over time. I want to completely automate translating new features in the file.

The end goal is to translate the pdf's. The translation part I have covered (google translate api).

Adobe Community

Extract and replace test in PDF