Accented characters exported in a .csv file

Participant ,
Apr 16, 2020

Copy link to clipboard

Copied

I wrote a script to merge the data of different form in a cvs file attached to a document.
Everything works fine, execpt for the accented characters which don't appear correctly in the .csv file.

For exporting the data, I use the util.streamFromString with utf-8 setting.
I tryed all other setting but no one is correct.
Is there a way to export correctly the accented characters?

FYI, merging data with the Acrobat tool works fine.

Thanks for your answer.

Capture d’écran 2020-04-16 à 19.01.14.pngCapture d’écran 2020-04-16 à 19.26.07.png

 

 

Adobe Community Professional
Correct answer by Thom Parker | Adobe Community Professional

UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document. 

Use "utf-16".   

 

However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode. 

TOPICS
Acrobat SDK and JavaScript, Create PDFs, PDF forms

Views

320

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more

Accented characters exported in a .csv file

Participant ,
Apr 16, 2020

Copy link to clipboard

Copied

I wrote a script to merge the data of different form in a cvs file attached to a document.
Everything works fine, execpt for the accented characters which don't appear correctly in the .csv file.

For exporting the data, I use the util.streamFromString with utf-8 setting.
I tryed all other setting but no one is correct.
Is there a way to export correctly the accented characters?

FYI, merging data with the Acrobat tool works fine.

Thanks for your answer.

Capture d’écran 2020-04-16 à 19.01.14.pngCapture d’écran 2020-04-16 à 19.26.07.png

 

 

Adobe Community Professional
Correct answer by Thom Parker | Adobe Community Professional

UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document. 

Use "utf-16".   

 

However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode. 

TOPICS
Acrobat SDK and JavaScript, Create PDFs, PDF forms

Views

321

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Apr 16, 2020 1
Adobe Community Professional ,
Apr 16, 2020

Copy link to clipboard

Copied

UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document. 

Use "utf-16".   

 

However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode. 

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 16, 2020 0
Participant ,
Apr 17, 2020

Copy link to clipboard

Copied

Thank you for your answer Tom!

"Use "utf-16"" -> That doesn't work neither.

"However, why are you using a stream? The "createDataObject()" function takes a string as input." -> Because I haven't thought about that! But the result is the same...

However, I found a solution. I attach a txt file already utf-16 formatted then I fill that file. That works fine...
@+

Capture_d’écran_2020-04-17_à_18_44_08.png

 

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 17, 2020 1
Adobe Community Professional ,
Apr 17, 2020

Copy link to clipboard

Copied

Actually it does work, but you have to specify the correct Mime Type, since UTF-8 is the default.

 

This works

createDataObject("Tst2.Txt", "Some ascii text then, ©™Σ","text/html; charset=utf-16")

 

By pre-attaching a file that is already UTF-16, you are pre-setting the mime type. 

You're first solution would have worked if the file was created with a UTF-16 mimetype. 

It's all about being consistent with the typeing all the way through the process.

 

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 17, 2020 0
Participant ,
Apr 18, 2020

Copy link to clipboard

Copied

Great! That works very well...
The "cMIMEType" parameter is not very well documented in the api reference!
Thanks again Tom.

@+

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 18, 2020 1
Participant ,
Apr 20, 2020

Copy link to clipboard

Copied

Hi,

I come back on this post because I have a trouble.

When the characters are written in quotes such as your example, that woks fine.

When the characters are placed into a variable, that works fine too:

var myVariable="Some ascii text then, ©™Σ";

createDataObject("Tst2.Txt", myVariable,"text/html; charset=utf-16");

In the script I'm writting, the variable is built all along the script and recalled at the end to fill the .txt file.

In the screenshot attached, you can see the variable (lesDonnees) is correctly displayed when recalled in the console, but the special characters are not correctly displayed in the .txt file while the cMIMEType parameter seems to be correctly set!

Do you have any idea on what's happening?

Thanks

Capture_d’écran_2020-04-20_à_18_24_32.png

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 20, 2020 0
Most Valuable Participant ,
Apr 20, 2020

Copy link to clipboard

Copied

Examine the actual contents of the TXT file to see what encoding is used. I mean look at the hex codes, not open in a text editor. Know what WinAnsi, UTF-8 and UTF-16BE will look like. This is a lot of learning but really vital in solving problems like this. Otherwise you are forever trying to deduce what the problem is from side effects and you don't know whether it is the writing software or reading software doing something unwanted.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 20, 2020 0
Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

Thank you for your answer.

Opening the txt file with Excel, the Windows (ANSI) setting suits to display the characters correctly.

In a firs step, I know the setting to use, but I don't understand what's happening with the text stored in a variable.

Is there a way to set the txt created attached file correctly.

Thanks

Capture_d’écran_2020-04-21_à_10_34_23.png

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Most Valuable Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

" I mean look at the hex codes, not open in a text editor."

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

...Sorry, but I don't understand how can I look at the hex codes!

 

After a lot of testing, here is what I found:

Example #1 with the Tom's script -> works fine.

Example #2 where I add some accentuated characters to the special characters of the first script  -> createDataObject("Example2.Txt", "Some ascii text then, ©™Σ plus éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> works fine too.

Example #3 where I only indicated the accentuated characters -> createDataObject("Example2.Txt", "Some ascii text then, éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> doesn't work anymore.

So I decided to do an Example #4 where I add these special characters to my variable in end of the script before creating the txt data object and that works fine (why ????).

After several other tests I found I can add only one of these characters (©™Σ) at the beginning or at the end of my variable for working fine.

Any explanation?

So, for my script, I found the solution which is to create the txt dataobject with these symbols (or just one) then to fill it with my variable and streamFromString then setDataObjectContents.

 

Thanks for reading and providing comments on this post.

 

Capture_d’écran_2020-04-21_à_12_24_35.pngCapture_d’écran_2020-04-21_à_12_24_00.pngCapture_d’écran_2020-04-21_à_12_23_21.png

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Most Valuable Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

"UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters."  I have to completely disagree. UTF-8 is designed to both include readable low ASCII and all Unicode characters too. It is the best and most recommended Unicode format for most purposes, UTF-16 is not nearly so flexible.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Most Valuable Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

There is a module for Notepad++ to show Hex codes https://appuals.com/how-to-install-notepad-hex-editor-plugin/

Really I find guesswork terrifying! This is all extremely well defined once you know how the different encodings are used and represented. 

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

Hi,

You seem to be a specialist and must certainly be right. For my part, I don't understand a lot about these types of formats and I therefore try to find a solution for my script.

Do you have a good cMIMEType parameter to set directly the creating of the data object?

Thanks

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

Do you know how to read the hex codes on Mac?

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Most Valuable Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

In Mac I would just use the command line for short files. If you are happy with the command line you can use

od -xa  filename

which shows hex on one line, and ASCII characters on the next.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

...and the result for a short file is:

Capture_d’écran_2020-04-21_à_16_07_45.png

What can you deduce?

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Most Valuable Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

We can see it is NOT UTF-16 because that would have 00 in most character pairs.

Let's look at the line containing Pr??vu

We can see the hex codes 69 6d a9 c3 75 76. The interesting part, the ? ? is a9 c3. Annoyingly these od options reverse each pair of bytes so what we actually want is c3a9. 

Here we come into guesswork, but I'll assume these are recognisable European words.

So we have what looks like a UTF-8 file. I'd hope an app would accept this as Unicode. If it does not, and there is no UTF-8 encoding setting, the data might need a BOM (special marker) at the start. This is the three bytes EF BB BF. Which may look like ï»¿ if an app doesn't understand UTF-8. But you cannot write these bytes directly from JavaScript. According to info I've found but not tested, to write a BOM you can write "U+FEFF".

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

Do you mean "U+FEFF" is the charset to indicate for the cMIMEType parameter?

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Most Valuable Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

No, it needs to get directly into the file as the first three bytes.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Participant ,
Apr 21, 2020

Copy link to clipboard

Copied

Thanks for your help.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0
Adobe Community Professional ,
Apr 21, 2020

Copy link to clipboard

Copied

I have the same issue than bebarth.

When created from a variable, the attachment encoding is an issue.

According to my tests it depends on the computer used:

Acrobat Mac = Western MacOS Roman

Acrobat Windows = Western Windows Latin 1

 

So far so good since until now in the process used by my documents the users does not change computers to open the attachment they just created, but it is not correct.

 

But we cannot use Thom's tip (updating a previously created attachment) when both the PDF and its attachment are created on the fly.

If you want a true sample of this issue install my (free) FormReport utility and use it on a Mac and on a Windows computer : the attachment encoding is not the same … but the script is the same.

(I can share the not minified and full commented JavaScript of FormReport if needed)

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Apr 21, 2020 0