• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
1

Accented characters exported in a .csv file

Community Expert ,
Apr 16, 2020 Apr 16, 2020

Copy link to clipboard

Copied

I wrote a script to merge the data of different form in a cvs file attached to a document.
Everything works fine, execpt for the accented characters which don't appear correctly in the .csv file.

For exporting the data, I use the util.streamFromString with utf-8 setting.
I tryed all other setting but no one is correct.
Is there a way to export correctly the accented characters?

FYI, merging data with the Acrobat tool works fine.

Thanks for your answer.

Capture d’écran 2020-04-16 à 19.01.14.pngCapture d’écran 2020-04-16 à 19.26.07.png

 

 

TOPICS
Acrobat SDK and JavaScript

Views

6.0K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Apr 16, 2020 Apr 16, 2020

UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document. 

Use "utf-16".   

 

However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode. 

Votes

Translate

Translate
Community Expert ,
Apr 16, 2020 Apr 16, 2020

Copy link to clipboard

Copied

UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters. Basically, you can't put Unicode (16 bit) into a plain text ( 8 bit) document. 

Use "utf-16".   

 

However, why are you using a stream? The "createDataObject()" function takes a string as input. You'll save yourself some trouble if you use this function, since JavaScript is native Unicode, so all strings are Unicode. 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 17, 2020 Apr 17, 2020

Copy link to clipboard

Copied

Thank you for your answer Tom!

"Use "utf-16"" -> That doesn't work neither.

"However, why are you using a stream? The "createDataObject()" function takes a string as input." -> Because I haven't thought about that! But the result is the same...

However, I found a solution. I attach a txt file already utf-16 formatted then I fill that file. That works fine...
@+

Capture_d’écran_2020-04-17_à_18_44_08.png

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 17, 2020 Apr 17, 2020

Copy link to clipboard

Copied

Actually it does work, but you have to specify the correct Mime Type, since UTF-8 is the default.

 

This works

createDataObject("Tst2.Txt", "Some ascii text then, ©™Σ","text/html; charset=utf-16")

 

By pre-attaching a file that is already UTF-16, you are pre-setting the mime type. 

You're first solution would have worked if the file was created with a UTF-16 mimetype. 

It's all about being consistent with the typeing all the way through the process.

 

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 18, 2020 Apr 18, 2020

Copy link to clipboard

Copied

Great! That works very well...
The "cMIMEType" parameter is not very well documented in the api reference!
Thanks again Tom.

@+

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 20, 2020 Apr 20, 2020

Copy link to clipboard

Copied

Hi,

I come back on this post because I have a trouble.

When the characters are written in quotes such as your example, that woks fine.

When the characters are placed into a variable, that works fine too:

var myVariable="Some ascii text then, ©™Σ";

createDataObject("Tst2.Txt", myVariable,"text/html; charset=utf-16");

In the script I'm writting, the variable is built all along the script and recalled at the end to fill the .txt file.

In the screenshot attached, you can see the variable (lesDonnees) is correctly displayed when recalled in the console, but the special characters are not correctly displayed in the .txt file while the cMIMEType parameter seems to be correctly set!

Do you have any idea on what's happening?

Thanks

Capture_d’écran_2020-04-20_à_18_24_32.png

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 20, 2020 Apr 20, 2020

Copy link to clipboard

Copied

Examine the actual contents of the TXT file to see what encoding is used. I mean look at the hex codes, not open in a text editor. Know what WinAnsi, UTF-8 and UTF-16BE will look like. This is a lot of learning but really vital in solving problems like this. Otherwise you are forever trying to deduce what the problem is from side effects and you don't know whether it is the writing software or reading software doing something unwanted.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

Thank you for your answer.

Opening the txt file with Excel, the Windows (ANSI) setting suits to display the characters correctly.

In a firs step, I know the setting to use, but I don't understand what's happening with the text stored in a variable.

Is there a way to set the txt created attached file correctly.

Thanks

Capture_d’écran_2020-04-21_à_10_34_23.png

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

" I mean look at the hex codes, not open in a text editor."

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

...Sorry, but I don't understand how can I look at the hex codes!

 

After a lot of testing, here is what I found:

Example #1 with the Tom's script -> works fine.

Example #2 where I add some accentuated characters to the special characters of the first script  -> createDataObject("Example2.Txt", "Some ascii text then, ©™Σ plus éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> works fine too.

Example #3 where I only indicated the accentuated characters -> createDataObject("Example2.Txt", "Some ascii text then, éèàôùïîÉÈÁÀ","text/html; charset=utf-16") -> doesn't work anymore.

So I decided to do an Example #4 where I add these special characters to my variable in end of the script before creating the txt data object and that works fine (why ????).

After several other tests I found I can add only one of these characters (©™Σ) at the beginning or at the end of my variable for working fine.

Any explanation?

So, for my script, I found the solution which is to create the txt dataobject with these symbols (or just one) then to fill it with my variable and streamFromString then setDataObjectContents.

 

Thanks for reading and providing comments on this post.

 

Capture_d’écran_2020-04-21_à_12_24_35.pngCapture_d’écran_2020-04-21_à_12_24_00.pngCapture_d’écran_2020-04-21_à_12_23_21.png

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

"UTF-8 refers to the ANSI character set. So it won't properly translate Unicode Characters."  I have to completely disagree. UTF-8 is designed to both include readable low ASCII and all Unicode characters too. It is the best and most recommended Unicode format for most purposes, UTF-16 is not nearly so flexible.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

There is a module for Notepad++ to show Hex codes https://appuals.com/how-to-install-notepad-hex-editor-plugin/

Really I find guesswork terrifying! This is all extremely well defined once you know how the different encodings are used and represented. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

Hi,

You seem to be a specialist and must certainly be right. For my part, I don't understand a lot about these types of formats and I therefore try to find a solution for my script.

Do you have a good cMIMEType parameter to set directly the creating of the data object?

Thanks

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

Do you know how to read the hex codes on Mac?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

In Mac I would just use the command line for short files. If you are happy with the command line you can use

od -xa  filename

which shows hex on one line, and ASCII characters on the next.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

...and the result for a short file is:

Capture_d’écran_2020-04-21_à_16_07_45.png

What can you deduce?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

We can see it is NOT UTF-16 because that would have 00 in most character pairs.

Let's look at the line containing Pr??vu

We can see the hex codes 69 6d a9 c3 75 76. The interesting part, the ? ? is a9 c3. Annoyingly these od options reverse each pair of bytes so what we actually want is c3a9. 

Here we come into guesswork, but I'll assume these are recognisable European words.

So we have what looks like a UTF-8 file. I'd hope an app would accept this as Unicode. If it does not, and there is no UTF-8 encoding setting, the data might need a BOM (special marker) at the start. This is the three bytes EF BB BF. Which may look like ï»¿ if an app doesn't understand UTF-8. But you cannot write these bytes directly from JavaScript. According to info I've found but not tested, to write a BOM you can write "U+FEFF".

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

Do you mean "U+FEFF" is the charset to indicate for the cMIMEType parameter?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

No, it needs to get directly into the file as the first three bytes.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

Thanks for your help.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 21, 2020 Apr 21, 2020

Copy link to clipboard

Copied

LATEST

I have the same issue than bebarth.

When created from a variable, the attachment encoding is an issue.

According to my tests it depends on the computer used:

Acrobat Mac = Western MacOS Roman

Acrobat Windows = Western Windows Latin 1

 

So far so good since until now in the process used by my documents the users does not change computers to open the attachment they just created, but it is not correct.

 

But we cannot use Thom's tip (updating a previously created attachment) when both the PDF and its attachment are created on the fly.

If you want a true sample of this issue install my (free) FormReport utility and use it on a Mac and on a Windows computer : the attachment encoding is not the same … but the script is the same.

(I can share the not minified and full commented JavaScript of FormReport if needed)

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines