Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

UTF-16 representation in a ByteArray REgards

Community Beginner ,
Jan 17, 2010 Jan 17, 2010

Hello all,

Is there a way to write a UTF-16 string into a ByteArray in Flash/AS3? Basically I have a string (var test:String="allan"; for example) and I would like to write that into a ByteArray with UTF-16LE encoding. In this case it would be "61 00 6C 00 6C 00 61 00 6E 00".

I've tried using utf16le.writeMultiByte( clipText, "utf-16" ); but it just comes out with what appears to be UTF8 (or just straight ASCII given the test string).

The use case is to save a UTF-16LE file using FileReference.save(), which I understand I can do by passing it a ByteArray with the correct character encoding in it. Passing just a string saves as UTF-8. Hence the need to convert and store into a UTF-16LE representation in a ByteArray.

Regards,
Allan

TOPICS
ActionScript
7.5K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 17, 2010 Jan 17, 2010

Have you tried:

utf16le.writeMultiByte( clipText, "unicode" );

From the list here (http://help.adobe.com/en_US/AS3LCR/Flash_10.0/charset-codes.html) it shows that is the Label and the utf-16 is an alias that points back to it. Stranger things have happened.

Also I'm not too much up on all of this, but does utf-16LE mean "little endian"? It seems that the default is big endian. So that might make a difference:

I tried this little test with some Hindi unicode text:

var m:ByteArray=new ByteArray();

//m.endian=Endian.LITTLE_ENDIAN;

m.writeMultiByte("हिन्दी","unicode");

m.position=0

for(var i=0;i<6;i++){

     trace(m.readShort());

}

When I comment out the line my trace is:

14601

16137

10249

19721

9737

16393

When I use that line my trace is:

2361

2367

2344

2381

2342

2368

Which are the correct codes for those characters. Of course if I use
trace(m.readMultiByte(2,"unicode")
It traces out the proper sequence regardless of whether I have set the endianness of the array:
ि
(The fourth character is a magic character for joining characters together.)

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 17, 2010 Jan 17, 2010

Hi Rothrock,

Thanks very much for your reply! I did indeed try the 'unicode' option, but unfortunatly to no avail. Likewise I've just tried the 'unicodeFFFE' option, just to see what the difference might be - I didn't get any. Basically I'm looking for the string "allan" to be saved in a file with the following hex pattern:

61 00 6C 00 6C 00 61 00 6E 00 

My text editor tells be that is UTF-16 Little Endian (you were quite right with the acronym!). So this is what I have been trying:

var utf16:ByteArray = new ByteArray();

utf16.endian=Endian.LITTLE_ENDIAN;

utf16.writeMultiByte( "allan", "unicode" );

var fileRef:FileReference = new FileReference();

fileRef.save( utf16, fileName );

With various combinations of the endian type, and the second parameter for writeMultiByte. I just can't seem to get it - it's always outputting: 61 6C 6C 61 6E. I could of course add the zero padding in, but can imagine that would break characters with a value >255.

regards,

Allan

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 17, 2010 Jan 17, 2010

I'm totally out of my depth here, but that is how I learn stuff myself. So I'll flail around a bit more.

Using your example if I do this:

utf16.writeMultiByte( "allan", "utf-8" );

trace(utf16.length)

I get 5 which would be expected. But if I do:

utf16.writeMultiByte("allan","unicode");

I get 1, which I did not expect.

I know that flash really loves utf-8, so I wonder if some how all the strings are being converted to utf-8?

I went to the wikipedia and found the following string "水z  " (water, z, G clef). When I tried to write it using mutlibyte it broke at the "z"

It should be 34 6C, 7A 00, 34 D8, 1E DD, but it just breaks at 7A.

utf16.writeMultiByte("水z  ", "unicode");

trace(utf16.length) // returns 3

utf16.writeMultiByte( "水  z", "unicode" );

trace(utf16.length) // returns 7

Also with your example I'm only getting a length of 1. You are actually getting the 61 6C 6C 61 6E back out?

So to my mind it looks like there is some bug in mixing ascii-encodeable characters with those that need hight points. But I don't understand the standard enough to find out. Might be worth trying to open a bug for this....

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 17, 2010 Jan 17, 2010

The G clef doesn't show up here in the forums and causes all kinds of trouble in the Actionscript editor, but it did "work" as a UTF-16LE that needs 4 bytes to encode...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 17, 2010 Jan 17, 2010

Hi Rothrock,

Thanks for the replies - I'm completely learning as I go here as well. I know what I want the final output file to look like - but just can't figure out how to get it into that...

According to Wikipdia ( http://en.wikipedia.org/wiki/ActionScript ), AS3 uses UTF-16 natively, and one would need to 'convert' to UTF-8. From the FileReference documentation, this would appear to be done automatically when saving a String, but the data is left as is when writing a byte array. The fact that the Adobe documentation called UTF-16 "unicode" and UTF-8 just "UTF-8", would appear to support that UTF-16 is native.

Regarding your length traces - I tried this:

var utf16:ByteArray = new ByteArray();

utf16.writeMultiByte( "allan", "unicode" );

trace( utf16.length );

and got a byte length of 5. It might be expected for this to be 10, if ByteArray counts in bytes rather than characters - presumably it does since it's 'raw data'.
So I'm all at sea, and getting more confused! I think you are right - bug time, as this just isn't looking right.
Regards,
Allan

*edit* Bug FP-3693 opened.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 17, 2010 Jan 17, 2010

That is bizarre that you get 5. Yes it should be 10 and even odder is that I get only 1. So I'm going to go with that they have some problems there.

I'm using CS4, publishing AS3 to the 10 (and also tried 9) player. I'm on a Mac running 10.6.2. I'm guessing you are on Windows?

I'll try it tomorrow on my work machine.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 17, 2010 Jan 17, 2010

Hi Rothrock,

I'm using a Mac for the run time, but the swf is being compiled in CS4 on Windows Vista.

Here is another odd thing:

On the character encoding page ( http://livedocs.adobe.com/flash/9.0/ActionScriptLangRefV3/charset-codes.html ) Adobe list:

Character set: "Unicode" - Label: "unicode"

Character set: "Unicode (Big endian)" - Label: "unicodeFFFE"

The obvious inference from this is that "unicode" is little endian. However: Looking at the Unicode web-site ( http://unicode.org/faq/utf_bom.html#bom4 ) it suggests that:

FFFE: Little endian

FEFF: Big endian!

Oops... Bug FP-3695 added.

Regards,

Allan

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 18, 2010 Jan 18, 2010

Yeah on my windows machine it also returns a length of 5. So there is something very wonky all around. Sorry that we didn't get it working, but at least we figured out some stuff. Good luck.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 18, 2010 Jan 18, 2010

Hi Rothrock,

Thanks for the info - and for following up on this. We'll see how the bugs progress through Adobe, hopefully they will be resolved fairly easily!

Regards,

Allan

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 23, 2010 Jan 23, 2010

I think I've worked out what is going on. AS3 is using UTF-16LE internally (which is documented) including surrogates etc. However, if the character code is less that U+FF then only one byte is used for the character rather than two! I can see that this is a good optimisation to make given how common ASCII is. Having said that, my understanding is that this is not "true" UTF-16 - where each character must be represented by two bytes. I'm sure it has a name, this character encoding, but I can't see it on a quick scan of the Unicode documentation.

For anyone interested I've bashed together a function which will put a string into a true UTF-16 byte array:

private function strToUTF16( str:String ):ByteArray

{

     var utf16:ByteArray = new ByteArray();

     var iChar:uint;

     var i:uint=0, iLen:uint = str.length;

     

     /* BOM first */

     utf16.writeByte( 0xFF );

     utf16.writeByte( 0xFE );

     

     while ( i < iLen )

     {

          iChar = str.charCodeAt(i);

          trace( iChar );

          

          if ( iChar < 0xFF )

          {

               /* one byte char */

               utf16.writeByte( iChar );

               utf16.writeByte( 0 );

          }

          else

          {

               /* two byte char */

               utf16.writeByte( iChar & 0x00FF );

               utf16.writeByte( iChar >> 8 );

          }

          

          i++;

     }

     

     return utf16;

}

Phew...

Regards,

Allan

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 23, 2010 Jan 23, 2010

Sweet. Thanks for sharing that.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Aug 11, 2010 Aug 11, 2010

Hi Allan,

Thanks a lot for sharing this.  This just save me several days of pounding my head against the wall. 😃

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Jan 17, 2012 Jan 17, 2012
LATEST

First of all thanks for the solution. What I am trying to do is export datagrid data into excel using as3 excel. As you pointed out the problem with AS3 how it writes bytes for UTF 16 , I applied this patch to as3 excel where it is writing bytes but the issue is as3 excel is also reading these bytes as it maitains a byte array stream for all datagrid rows and than goes at once to write an xls file. I think reading logic also needs to be fixed but not sure how?  Any help will be much appreciated.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines