PDF Extract form fields observations

Report · Apr 21, 2022

Can someone confirm my observations about using PDF Extract on forms? This is based on my experimentation and reviewing other posts:

The extended metadata has two boolean flags for forms: "has_acroform" and "is_XFA"
PDF Extract does process Acroforms and populates the "has_acroform" flag
If there is data in the Acroform that data becomes an element in the resulting JSON. For example if the firstname field has "John" in it and the last name field has "Doe" in it, two elements will be created with the text values of John and Doe
There is nothing in the JSON to indicate the presence of a form field or to indicate that text was form data. In other words no way to know there was a form field called firstname containing John. John could have come from the pdf content.
Static XFA forms can not be processed at all and result in an error. DISQUALIFIED - File not suitable for content extraction: File contains XFA form(s). Not supported for content extraction
The is_XFA flag will never be populated because an XFA file will never be processed.

Are these observations correct?

Report · Apr 21, 2022

A few corrections

If there is data in the Acroform that data becomes an element in the resulting JSON. For example if the firstname field has "John" in it and the last name field has "Doe" in it, two elements will be created with the text values of John and Doe

This is actually not the case. Extract does not read values from AcroForm fields... or XFA fields for that matter. What you are likely seeing is an AcroForm that had been filled and then "flattened". Flattening a PDF form removes the form fields and flattens the values down onto the page as content. It is this content the Extract is able to read. The document metadata will still think it's an AcroForm because the AcroForm dictionary is still present in the Catalog. It's just empty.

There is nothing in the JSON to indicate the presence of a form field or to indicate that text was form data. In other words no way to know there was a form field called firstname containing John. John could have come from the pdf content.

This is because after flattening, there are no fields. It's just page content. If you had multiple copies of the same form but with different data, it would be fairly trivial to identify which parts were consistent and then the parts that are different are the data fields.

Report · Apr 22, 2022

Thanks Joel. You are quite right. I thought I had an Acroform but it had in fact been flattened. I re-ran my tests with an Acroform and there is no data. So in summary:

XFA Form = error and no processing, the is_XFA flag is never populated because no JSON is produced
Acroform = no error, file is processed, has_acroform is set to true, but no fields or field data are converted to JSON.

Report · Apr 22, 2022

Correct. I'll just add that I think the "is_XFA" property is a side effect of using the same PDF profiling code that is used by the PDF Properties API. In the case of Extract, it's never going to be true. In the case of PDF Properties, it will be true if the document is either static or dynamic XFA. So the following table can be used to tell what kind of form you have ahead of time using the PDF Properties API.

has_acroform: true, is_XFA: false = AcroForm

has_acroform: true, is_XFA: true = Static XFA

has_acroform: false, is_XFA: true = Dynamic XFA

Again with the caveat that the AcroForm dictionary in the PDF will remain even after the form has been flattened or all fields have been deleted. Once it's there, it's there, at least when using Adobe tools.

PDF Extract form fields observations

1 Correct answer