PDF Extract form fields observations
Can someone confirm my observations about using PDF Extract on forms? This is based on my experimentation and reviewing other posts:
- The extended metadata has two boolean flags for forms: "has_acroform" and "is_XFA"
- PDF Extract does process Acroforms and populates the "has_acroform" flag
- If there is data in the Acroform that data becomes an element in the resulting JSON. For example if the firstname field has "John" in it and the last name field has "Doe" in it, two elements will be created with the text values of John and Doe
- There is nothing in the JSON to indicate the presence of a form field or to indicate that text was form data. In other words no way to know there was a form field called firstname containing John. John could have come from the pdf content.
- Static XFA forms can not be processed at all and result in an error. DISQUALIFIED - File not suitable for content extraction: File contains XFA form(s). Not supported for content extraction
- The is_XFA flag will never be populated because an XFA file will never be processed.
Are these observations correct?
