I have a pdf generated in appl Pages. The pdf has several data tables / text boxes per page.
Wondering if the following is possible:
Detect all text content on page, whether in tables or text boxes, etc.
Extract text and replace w/ other text, thus generating the same doc w/ entirely new text.
I've looked at some of the OCR and text extraction samples/docs. Extracting the data seems pretty strait forwards. But can I replace it? The use case is translating the doc to another language.
Adobe doesn't have a Document Services API to do this. That said, it's not really something you'll want to do. Text in a PDF is generally laid down with precise coordinates and isn't able to be replaced without causing overlaps.
If you want to create new PDF files with different text, I suggest looking at the Document Generation API which will allow you to start with a Word template plus some JSON and output a PDF where the JSON is merged into tagged "fields" in the document.
Do you control the authoring process of these documents before they get converted to PDF or is this situation where you need to take what you are given?
Yes, I create them in Apple Pages.
So you have the source files? Why do you need the Extract service then? I'm missing something.
The pdf's change depending on our clients' needs — we use tables & formulas to show/hide different options, and we are always making customiztion to them on the fly - adding notes, etc.. And they change over time. I want to completely automate translating new features in the file.
The end goal is to translate the pdf's. The translation part I have covered (google translate api).