How do you collect text data from filled forms in local PDF files?

Report · Oct 04, 2020

I have about 80 local PDF files having input forms that have been filled by students. I would like to extract text data from them so that I can easily score their answers. How do you do that by the latest Acrobat Pro? I need do that on local files.

Report · Oct 05, 2020

You didn't mention your version of Acrobat but it can be done using the Merge Data Files into Spreadsheet command, which is under Tools - Prepare Form (and then under More Form Options, in some versions).

View solution in original post

Report · Oct 05, 2020

Hi there,

We are sorry for the trouble. As described, you want to extract data from the filled PDF form.

Please try the following steps and see if that helps

In Acrobat, open the response file and select the data to export.
In the left navigation panel, click Export, and then choose Export Selected.
In the Select Folder To Save File dialog box, specify a name, location, and file format (CSV or XML) for the form data, and click Save.

For more information please look at the help page https://helpx.adobe.com/in/acrobat/using/collecting-pdf-form-data.html#export_user_data_from_a_respo...

Regards

Amal

Report · Oct 05, 2020

The PDF files were collected via a web form as a file attachment, and so the individual users have not submitted the form. In this case, how do I create and initializethe response file you mentioned? Thank you very much for your help.

Report · Oct 05, 2020

You didn't mention your version of Acrobat but it can be done using the Merge Data Files into Spreadsheet command, which is under Tools - Prepare Form (and then under More Form Options, in some versions).

Report · Oct 05, 2020

Thank you very much. It is what I was looking for and it worked, but all the Japanese characters in the form fields are broken after exporting to a CSV file.

Report · Oct 05, 2020

The encoding of the file created is UTF8, which might not cover Japanese characters. In order to do that you would need to use some other tool, I'm afraid. Maybe try exporting files as TXT or FDF files, and then merge them using a different utility. Another option is to use a script to do it, instead of the built-in Merge Data Files command.

Report · Oct 05, 2020

Thank you agai. The text encoding looks to be UTF-8 because I could etract fields text by using PyPDF2, which is a Python module to handle PDF forms. For the moment, the use of PyPDF2 is good enough for my purpose, but your suggestion to use the native Acrobat functionality was much easier except for the Japanese character problem.

If I find a fix for my problem, I will post it in this thread for someone else.

Report · Oct 05, 2020

Can you share a sample file with fields that has Japanese text in them?

Report · Oct 05, 2020

Here is a sample file.

https://www.dropbox.com/s/faupq7447hb84b9/sample.pdf?dl=0

"Answer1" and "Answer2" should be "日本語 Japanese 日本語" but it is convereted to "... Japanese ...".

Report · Oct 05, 2020

When exporting it in UTF-8 explicitly it does seem to work correctly. I guess the default encoding is just plain ANSI, then. You can use this code I wrote to export it properly (you can run it from the JS Console, or from an Action, or something like that):

var names = [];
var values = [];
for (var i=0; i<this.numFields; i++) {
	var f = this.getField(this.getNthFieldName(i));
	if (f==null) continue;
	if (f.type=="button" || f.type=="signature") continue;
	names.push(f.name);
	values.push(f.valueAsString);
}

var doName = this.documentFileName.replace(/\.pdf$/i, "_data.txt");
this.createDataObject(doName, "");
var s = names.join("\t") + "\r\n" + values.join("\t");
this.setDataObjectContents(doName, util.streamFromString(s, "utf-8"));
this.exportDataObject(doName);
this.removeDataObject(doName);