Accented characters exported in a .csv file

Report · Apr 16, 2020

I wrote a script to merge the data of different form in a cvs file attached to a document.
Everything works fine, execpt for the accented characters which don't appear correctly in the .csv file.

For exporting the data, I use the util.streamFromString with utf-8 setting.
I tryed all other setting but no one is correct.
Is there a way to export correctly the accented characters?

FYI, merging data with the Acrobat tool works fine.

Thanks for your answer.

Report · Apr 16, 2020

UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document.

Use "utf-16".

However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Apr 17, 2020

Thank you for your answer Tom!

"Use "utf-16"" -> That doesn't work neither.

"However, why are you using a stream? The "createDataObject()" function takes a string as input." -> Because I haven't thought about that! But the result is the same...

However, I found a solution. I attach a txt file already utf-16 formatted then I fill that file. That works fine...

@+

Report · Apr 17, 2020

Actually it does work, but you have to specify the correct Mime Type, since UTF-8 is the default.

This works

createDataObject("Tst2.Txt", "Some ascii text then, ©™Σ","text/html; charset=utf-16")

By pre-attaching a file that is already UTF-16, you are pre-setting the mime type.

You're first solution would have worked if the file was created with a UTF-16 mimetype.

It's all about being consistent with the typeing all the way through the process.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Apr 18, 2020

Great! That works very well...
The "cMIMEType" parameter is not very well documented in the api reference!
Thanks again Tom.

@+

Report · Apr 20, 2020

Hi,

I come back on this post because I have a trouble.

When the characters are written in quotes such as your example, that woks fine.

When the characters are placed into a variable, that works fine too:

var myVariable="Some ascii text then, ©™Σ";

createDataObject("Tst2.Txt", myVariable,"text/html; charset=utf-16");

In the script I'm writting, the variable is built all along the script and recalled at the end to fill the .txt file.

In the screenshot attached, you can see the variable (lesDonnees) is correctly displayed when recalled in the console, but the special characters are not correctly displayed in the .txt file while the cMIMEType parameter seems to be correctly set!

Do you have any idea on what's happening?

Thanks

Report · Apr 20, 2020

Examine the actual contents of the TXT file to see what encoding is used. I mean look at the hex codes, not open in a text editor. Know what WinAnsi, UTF-8 and UTF-16BE will look like. This is a lot of learning but really vital in solving problems like this. Otherwise you are forever trying to deduce what the problem is from side effects and you don't know whether it is the writing software or reading software doing something unwanted.

Report · Apr 21, 2020

Thank you for your answer.

Opening the txt file with Excel, the Windows (ANSI) setting suits to display the characters correctly.

In a firs step, I know the setting to use, but I don't understand what's happening with the text stored in a variable.

Is there a way to set the txt created attached file correctly.

Thanks

Report · Apr 21, 2020

" I mean look at the hex codes, not open in a text editor."

Report · Apr 21, 2020

...Sorry, but I don't understand how can I look at the hex codes!

After a lot of testing, here is what I found:

Example #1 with the Tom's script -> works fine.

Example #2 where I add some accentuated characters to the special characters of the first script -> createDataObject("Example2.Txt", "Some ascii text then, ©™Σ plus éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> works fine too.

Example #3 where I only indicated the accentuated characters -> createDataObject("Example2.Txt", "Some ascii text then, éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> doesn't work anymore.

So I decided to do an Example #4 where I add these special characters to my variable in end of the script before creating the txt data object and that works fine (why ????).

After several other tests I found I can add only one of these characters (©™Σ) at the beginning or at the end of my variable for working fine.

Any explanation?

So, for my script, I found the solution which is to create the txt dataobject with these symbols (or just one) then to fill it with my variable and streamFromString then setDataObjectContents.

Thanks for reading and providing comments on this post.

Capture_d’écran_2020-04-21_à_12_24_35.png

Report · Apr 21, 2020

"UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters." I have to completely disagree. UTF-8 is designed to both include readable low ASCII and all Unicode characters too. It is the best and most recommended Unicode format for most purposes, UTF-16 is not nearly so flexible.

Report · Apr 21, 2020

There is a module for Notepad++ to show Hex codes https://appuals.com/how-to-install-notepad-hex-editor-plugin/

Really I find guesswork terrifying! This is all extremely well defined once you know how the different encodings are used and represented.

Report · Apr 21, 2020

Hi,

You seem to be a specialist and must certainly be right. For my part, I don't understand a lot about these types of formats and I therefore try to find a solution for my script.

Do you have a good cMIMEType parameter to set directly the creating of the data object?

Thanks

Report · Apr 21, 2020

Do you know how to read the hex codes on Mac?

Report · Apr 21, 2020

In Mac I would just use the command line for short files. If you are happy with the command line you can use

od -xa filename

which shows hex on one line, and ASCII characters on the next.

Report · Apr 21, 2020

...and the result for a short file is:

What can you deduce?

Report · Apr 21, 2020

We can see it is NOT UTF-16 because that would have 00 in most character pairs.

Let's look at the line containing Pr??vu

We can see the hex codes 69 6d a9 c3 75 76. The interesting part, the ? ? is a9 c3. Annoyingly these od options reverse each pair of bytes so what we actually want is c3a9.

Here we come into guesswork, but I'll assume these are recognisable European words.

In Windows ANSI (ISO Latin 1 for the most part) c3a9 is Ã© (capital A tilde, copyright sign). Probably not. http://ascii-table.com/ansi-codes.php
In Mac Roman c3a9 is √© (square root, copyright sig). Probably not that either. https://en.wikipedia.org/wiki/Mac_OS_Roman
In UTF-8 c3a9 is the single character é (lower case e acute). https://onlineutf8tools.com/convert-hexadecimal-to-utf8

So we have what looks like a UTF-8 file. I'd hope an app would accept this as Unicode. If it does not, and there is no UTF-8 encoding setting, the data might need a BOM (special marker) at the start. This is the three bytes EF BB BF. Which may look like ï»¿ if an app doesn't understand UTF-8. But you cannot write these bytes directly from JavaScript. According to info I've found but not tested, to write a BOM you can write "U+FEFF".

Report · Apr 21, 2020

Do you mean "U+FEFF" is the charset to indicate for the cMIMEType parameter?

Report · Apr 21, 2020

No, it needs to get directly into the file as the first three bytes.

Report · Apr 21, 2020

Thanks for your help.

Report · Apr 21, 2020

I have the same issue than bebarth.

When created from a variable, the attachment encoding is an issue.

According to my tests it depends on the computer used:

Acrobat Mac = Western MacOS Roman

Acrobat Windows = Western Windows Latin 1

So far so good since until now in the process used by my documents the users does not change computers to open the attachment they just created, but it is not correct.

But we cannot use Thom's tip (updating a previously created attachment) when both the PDF and its attachment are created on the fly.

If you want a true sample of this issue install my (free) FormReport utility and use it on a Mac and on a Windows computer : the attachment encoding is not the same … but the script is the same.

(I can share the not minified and full commented JavaScript of FormReport if needed)

Acrobate du PDF, InDesigner et Photoshopographe

Accented characters exported in a .csv file

1 Correct answer