Questions Regarding Extraction & Conversion to Image Consistency

Question

To whomever sees this Hi! First time messaging here, and looking forward to chatting with everyone!

That said, I'm integrating the Document Services API such that PDF submissions I receive are rendered into their components (images and text) for further processing. However, while I've gotten that to work, my program requires the images and text pulled from a PDF to be 100% consistent in the *long-term*, such that a PDF submitted now, and anytime in the future, will produce the same images and text when extracted. So my question is basically, will I extract the exact same data despite the changes that might happen to the API in the future? (which is expected)

My follow-up question would then be, if the format of that data received *does/would* change in the future, would a PDF conversion to JPG/PNG (excluding the metadata) be 100% consistent between now and anytime in the future? (Such that the two image files would hold the *exact* same bits)

Thanks so much for the help with this,

Robert

Joel Geraci · Answer

The Extract API uses AI/ML to deconstruct the page into reading order "elements". The AI is constantly being trained and updated. I can't imagine the extraction would be 100% identical over time, maybe not even from one API call to another.

But I'm curious, once you have the output for a particular PDF, why not just keep it around so you don't need to reprocess it? You're guaranteed 100% consistentcy then.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded