Copy link to clipboard
Copied
Can someone confirm my observations about using PDF Extract on forms? This is based on my experimentation and reviewing other posts:
Are these observations correct?
A few corrections
This is actually not the case. Extract does not read values from AcroForm fields... or XFA fields for that matter. What you are likely seeing is an AcroForm that had been filled and then "flattened". Flattening a PDF form removes the f
...Copy link to clipboard
Copied
A few corrections
This is actually not the case. Extract does not read values from AcroForm fields... or XFA fields for that matter. What you are likely seeing is an AcroForm that had been filled and then "flattened". Flattening a PDF form removes the form fields and flattens the values down onto the page as content. It is this content the Extract is able to read. The document metadata will still think it's an AcroForm because the AcroForm dictionary is still present in the Catalog. It's just empty.
This is because after flattening, there are no fields. It's just page content. If you had multiple copies of the same form but with different data, it would be fairly trivial to identify which parts were consistent and then the parts that are different are the data fields.
Copy link to clipboard
Copied
Thanks Joel. You are quite right. I thought I had an Acroform but it had in fact been flattened. I re-ran my tests with an Acroform and there is no data. So in summary:
XFA Form = error and no processing, the is_XFA flag is never populated because no JSON is produced
Acroform = no error, file is processed, has_acroform is set to true, but no fields or field data are converted to JSON.
Copy link to clipboard
Copied
Correct. I'll just add that I think the "is_XFA" property is a side effect of using the same PDF profiling code that is used by the PDF Properties API. In the case of Extract, it's never going to be true. In the case of PDF Properties, it will be true if the document is either static or dynamic XFA. So the following table can be used to tell what kind of form you have ahead of time using the PDF Properties API.
has_acroform: true, is_XFA: false = AcroForm
has_acroform: true, is_XFA: true = Static XFA
has_acroform: false, is_XFA: true = Dynamic XFA
Again with the caveat that the AcroForm dictionary in the PDF will remain even after the form has been flattened or all fields have been deleted. Once it's there, it's there, at least when using Adobe tools.