Copy link to clipboard
Copied
I wrote a script to merge the data of different form in a cvs file attached to a document.
Everything works fine, execpt for the accented characters which don't appear correctly in the .csv file.
For exporting the data, I use the util.streamFromString with utf-8 setting.
I tryed all other setting but no one is correct.
Is there a way to export correctly the accented characters?
FYI, merging data with the Acrobat tool works fine.
Thanks for your answer.
UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document.
Use "utf-16".
However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode.
Copy link to clipboard
Copied
UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document.
Use "utf-16".
However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode.
Copy link to clipboard
Copied
Thank you for your answer Tom!
"Use "utf-16"" -> That doesn't work neither.
"However, why are you using a stream? The "createDataObject()" function takes a string as input." -> Because I haven't thought about that! But the result is the same...
Copy link to clipboard
Copied
Actually it does work, but you have to specify the correct Mime Type, since UTF-8 is the default.
This works
createDataObject("Tst2.Txt", "Some ascii text then, ©™Σ","text/html; charset=utf-16")
By pre-attaching a file that is already UTF-16, you are pre-setting the mime type.
You're first solution would have worked if the file was created with a UTF-16 mimetype.
It's all about being consistent with the typeing all the way through the process.
Copy link to clipboard
Copied
Great! That works very well...
The "cMIMEType" parameter is not very well documented in the api reference!
Thanks again Tom.
@+
Copy link to clipboard
Copied
Hi,
I come back on this post because I have a trouble.
When the characters are written in quotes such as your example, that woks fine.
When the characters are placed into a variable, that works fine too:
var myVariable="Some ascii text then, ©™Σ";
createDataObject("Tst2.Txt", myVariable,"text/html; charset=utf-16");
In the script I'm writting, the variable is built all along the script and recalled at the end to fill the .txt file.
In the screenshot attached, you can see the variable (lesDonnees) is correctly displayed when recalled in the console, but the special characters are not correctly displayed in the .txt file while the cMIMEType parameter seems to be correctly set!
Do you have any idea on what's happening?
Thanks
Copy link to clipboard
Copied
Examine the actual contents of the TXT file to see what encoding is used. I mean look at the hex codes, not open in a text editor. Know what WinAnsi, UTF-8 and UTF-16BE will look like. This is a lot of learning but really vital in solving problems like this. Otherwise you are forever trying to deduce what the problem is from side effects and you don't know whether it is the writing software or reading software doing something unwanted.
Copy link to clipboard
Copied
Thank you for your answer.
Opening the txt file with Excel, the Windows (ANSI) setting suits to display the characters correctly.
In a firs step, I know the setting to use, but I don't understand what's happening with the text stored in a variable.
Is there a way to set the txt created attached file correctly.
Thanks
Copy link to clipboard
Copied
" I mean look at the hex codes, not open in a text editor."
Copy link to clipboard
Copied
...Sorry, but I don't understand how can I look at the hex codes!
After a lot of testing, here is what I found:
Example #1 with the Tom's script -> works fine.
Example #2 where I add some accentuated characters to the special characters of the first script -> createDataObject("Example2.Txt", "Some ascii text then, ©™Σ plus éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> works fine too.
Example #3 where I only indicated the accentuated characters -> createDataObject("Example2.Txt", "Some ascii text then, éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> doesn't work anymore.
So I decided to do an Example #4 where I add these special characters to my variable in end of the script before creating the txt data object and that works fine (why ????).
After several other tests I found I can add only one of these characters (©™Σ) at the beginning or at the end of my variable for working fine.
Any explanation?
So, for my script, I found the solution which is to create the txt dataobject with these symbols (or just one) then to fill it with my variable and streamFromString then setDataObjectContents.
Thanks for reading and providing comments on this post.
Copy link to clipboard
Copied
"UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters." I have to completely disagree. UTF-8 is designed to both include readable low ASCII and all Unicode characters too. It is the best and most recommended Unicode format for most purposes, UTF-16 is not nearly so flexible.
Copy link to clipboard
Copied
There is a module for Notepad++ to show Hex codes https://appuals.com/how-to-install-notepad-hex-editor-plugin/
Really I find guesswork terrifying! This is all extremely well defined once you know how the different encodings are used and represented.
Copy link to clipboard
Copied
Hi,
You seem to be a specialist and must certainly be right. For my part, I don't understand a lot about these types of formats and I therefore try to find a solution for my script.
Do you have a good cMIMEType parameter to set directly the creating of the data object?
Thanks
Copy link to clipboard
Copied
Do you know how to read the hex codes on Mac?
Copy link to clipboard
Copied
In Mac I would just use the command line for short files. If you are happy with the command line you can use
od -xa filename
which shows hex on one line, and ASCII characters on the next.
Copy link to clipboard
Copied
...and the result for a short file is:
What can you deduce?
Copy link to clipboard
Copied
We can see it is NOT UTF-16 because that would have 00 in most character pairs.
Let's look at the line containing Pr??vu
We can see the hex codes 69 6d a9 c3 75 76. The interesting part, the ? ? is a9 c3. Annoyingly these od options reverse each pair of bytes so what we actually want is c3a9.
Here we come into guesswork, but I'll assume these are recognisable European words.
So we have what looks like a UTF-8 file. I'd hope an app would accept this as Unicode. If it does not, and there is no UTF-8 encoding setting, the data might need a BOM (special marker) at the start. This is the three bytes EF BB BF. Which may look like  if an app doesn't understand UTF-8. But you cannot write these bytes directly from JavaScript. According to info I've found but not tested, to write a BOM you can write "U+FEFF".
Copy link to clipboard
Copied
Do you mean "U+FEFF" is the charset to indicate for the cMIMEType parameter?
Copy link to clipboard
Copied
No, it needs to get directly into the file as the first three bytes.
Copy link to clipboard
Copied
Thanks for your help.
Copy link to clipboard
Copied
I have the same issue than bebarth.
When created from a variable, the attachment encoding is an issue.
According to my tests it depends on the computer used:
Acrobat Mac = Western MacOS Roman
Acrobat Windows = Western Windows Latin 1
So far so good since until now in the process used by my documents the users does not change computers to open the attachment they just created, but it is not correct.
But we cannot use Thom's tip (updating a previously created attachment) when both the PDF and its attachment are created on the fly.
If you want a true sample of this issue install my (free) FormReport utility and use it on a Mac and on a Windows computer : the attachment encoding is not the same … but the script is the same.
(I can share the not minified and full commented JavaScript of FormReport if needed)