Skip to main content
Loic.Aigon
Legend
January 4, 2013
Question

CSV & Encoding issues

  • January 4, 2013
  • 2 replies
  • 2158 views

Hi Guys,

I am in trouble dealing with two CSV files. The first one will let me play with it as soon as I don't specify any encoding but the second one can't be read without UTF-8 specified.

On the other hand, if I specify UTF-8, then the first file displays weird characters. I guess it's related to the sotware that generated the CSV ( Excel vs open office ? ). Anyway I don' t have hand on exports.

I tried to figure out how to get the default encoding of the file but it doesn't seem possible.

Any hint ?

Loic

PS: Happy New Year 2013 to everybody.

This topic has been closed for replies.

2 replies

Jongware
Community Expert
Community Expert
January 5, 2013

Open the file in *binary* mode. Then you can safely check for the various Unicode BOM markers -- plain FFFE or FEFF, or UTF-8 encoded versions thereof. If present, these will tell you right away how the file is encoded.

If not, you will have to read the entire file (still in binary mode) and check if there is a character with a code larger than 126, the tilde. If not, well, then it's a "plain" ASCII file. If you *do* find codes larger than 126, all of them *must* form a valid UTF-8 triplet (or a longer code, see wikipedia).

If any of them fail this test, it's a regular high-ASCII encoded file. Only thing left to answer is then "with what encoding" -- Windows Western, Cyrillic, Greek, or MacRoman, or MacGreek or any other of the dozens and dozens of possible encodings. If this concerns you, you might be able to do a statistic test to determine, and then either manually convert to Unicode or re-read the file with the correct encoding.

From memory: I think I used this method in my Markdown script.

Loic.Aigon
Legend
January 5, 2013

Hi Jongware,

Thanks fo the tip. I will investigate next week.

best to you,

Loic

Bob Stucky
Adobe Employee
Adobe Employee
January 4, 2013

Hi Loic,

Excel for Mac is notoriously bad with UTF-8.

Loic.Aigon
Legend
January 4, 2013

Hi Bob,

I guess I could use a conditional encoding depending on the creator, but I can't see how to get it.

Thanks anyway,

Loic

Bob Stucky
Adobe Employee
Adobe Employee
January 4, 2013

Here's a thot - open the file in an ExtendScript file, check the encoding there (which may or may not be correct), and of course look at the UTF BOM in the first byte...

http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

That might get you there.