Skip to main content
Participant
February 14, 2022
Question

Questions Regarding Extraction & Conversion to Image Consistency

  • February 14, 2022
  • 1 reply
  • 219 views

To whomever sees this Hi! First time messaging here, and looking forward to chatting with everyone!

 

That said, I'm integrating the Document Services API such that PDF submissions I receive are rendered into their components (images and text) for further processing. However, while I've gotten that to work, my program requires the images and text pulled from a PDF to be 100% consistent in the *long-term*, such that a PDF submitted now, and anytime in the future, will produce the same images and text when extracted. So my question is basically, will I extract the exact same data despite the changes that might happen to the API in the future? (which is expected)

 

My follow-up question would then be, if the format of that data received *does/would* change in the future, would a PDF conversion to JPG/PNG (excluding the metadata) be 100% consistent between now and anytime in the future? (Such that the two image files would hold the *exact* same bits)

 

Thanks so much for the help with this,

Robert

This topic has been closed for replies.

1 reply

Joel Geraci
Community Expert
Community Expert
February 28, 2022

The Extract API uses AI/ML to deconstruct the page into reading order "elements". The AI is constantly being trained and updated. I can't imagine the extraction would be 100% identical over time, maybe not even from one API call to another.

 

But I'm curious, once you have the output for a particular PDF, why not just keep it around so you don't need to reprocess it? You're guaranteed 100% consistentcy then.

Participant
March 7, 2022

Thanks for the reply, that's really helpful to know! To answer your curiosity, while I definitely *could* hold on to the particular output, I'm worried that it would be susceptible to tampering or breach (considering the potentially sensitive data in the pdfs). Additionally, I'm developing this program to solve those types of issues, or at least further minimize them, so keeping the output would be counteractive towards that end.

 

That all said, I've realized that I can adapt my code to any changes to the resultant JSON, and still get the text from the document in-order (assuming the JSON continues to house the text). However, I'm not sure how to approach getting the image elements consistently, since there's many potential cases where the same image would result in non-identical image files between two or more API calls / updates / etc (regarding the actual data, not the metadata). My only other consideration is to maybe convert the entire document into a png (again, assuming that any conversion API is consistent as well).