PDF Extract form fields observations

Question

Can someone confirm my observations about using PDF Extract on forms? This is based on my experimentation and reviewing other posts:

The extended metadata has two boolean flags for forms: "has_acroform" and "is_XFA"
PDF Extract does process Acroforms and populates the "has_acroform" flag
If there is data in the Acroform that data becomes an element in the resulting JSON. For example if the firstname field has "John" in it and the last name field has "Doe" in it, two elements will be created with the text values of John and Doe
There is nothing in the JSON to indicate the presence of a form field or to indicate that text was form data. In other words no way to know there was a form field called firstname containing John. John could have come from the pdf content.
Static XFA forms can not be processed at all and result in an error. DISQUALIFIED - File not suitable for content extraction: File contains XFA form(s). Not supported for content extraction
The is_XFA flag will never be populated because an XFA file will never be processed.

Are these observations correct?

Joel Geraci · Accepted Answer

A few correctionsIf there is data in the Acroform that data becomes an element in the resulting JSON.  For example if the firstname field has "John" in it and the last name field has "Doe" in it, two elements will be created with the text values of John and DoeThis is actually not the case. Extract does not read values from AcroForm fields... or XFA fields for that matter. What you are likely seeing is an AcroForm that had been filled and then "flattened". Flattening a PDF form removes the form fields and flattens the values down onto the page as content. It is this content the Extract is able to read. The document metadata will still think it's an AcroForm because the AcroForm dictionary is still present in the Catalog. It's just empty.  There is nothing in the JSON to indicate the presence of a form field or to indicate that text was form data.   In other words no way to know there was a form field called firstname containing John.  John could have come from the pdf content.This is because after flattening, there are no fields. It's just page content. If you had multiple copies of the same form but with different data, it would be fairly trivial to identify which parts were consistent and then the parts that are different are the data fields.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.