UTF-8 and converting to HTML entities

Report · Jun 14, 2013

Hoping someone here can shed some light. While handling an export in my plugin, some of the metadata has characters encoded with UTF-8. I'd like to convert those to HTML entities. Anyone have sample code for this? I found this function http://lua-users.org/files/wiki_insecure/users/WalterCruz/htmlentities.lua and tweaked the bottom few lines to this:

return string.gsub( str, "[^a-zA-Z0-9 _]",

function (v)

if entities then return entities else return v end

end)

What I get for a result is the original byte stream passed through without change. Am I missing something with basic Lua syntax or is there a subtlety with Lightroom itself?

Thanks,

db

Report · Jun 14, 2013

There are a number of issues with that sample code. I suggest that you first read this post about how LR Lua handles Unicode characters:

http://forums.adobe.com/message/3251706#3251706

And here's more general information about Lua and Unicode:

http://lua-users.org/wiki/LuaUnicode

Next, make sure your text editor is saving any code file in UTF-8 format. Otherwise, string literals may not get loaded properly.

The expression:

string.gsub( str, "[^a-zA-Z0-9 _]", function (...

won't work. The pattern [^a-zA-Z0-9 _] is matching a single 8-bit character. But the Unicode characters that are the keys of the "entities" table are in fact represented as multiple 8-bit characters in a Lua string. For example, the string '£' is actually a Lua string of length 2 (2 8-bit characters):

string.len ('£') => 2

string.byte ('£', 1, 1) => 194

string.byte ('£', 1, 2) => 168

I think you'll need to write two calls to string.gsub(), one that replaces Unicode characters whose UTF-8 encoding is 1 byte, and one for those that are multibyte. This paragraph from the above link suggests how to write those patterns:

Happily UTF-8 is designed so that it is relatively easy to count the number of unicode symbols in a string: simply count the number of octets that are in the ranges 0x00 to 0x7f (inclusive) or 0xC2 to 0xF4 (inclusive). (In decimal, 0-127 and 194-244.) These are the codes which can start a UTF-8 character code. Octets 0xC0, 0xC1 and 0xF5 to 0xFF (192, 193 and 245-255) cannot appear in a conforming UTF-8 sequence; octets in the range 0x80 to 0xBF (128-191) can only appear in the second and subsequent octets of a multi-octet encoding. Remember that you cannot use \0 in a Lua pattern.

Report · Jun 17, 2013

Thanks, John.

I've done a lot of transcoding work in the past with UTF8 - just wasn't sure what Lightroom's capabilities were. Apparently nothing more than what Lua provides (i.e. zero). The background links you gave were helpful and led to a few code examples. This one seemed best: most succinct and clear: https://github.com/alexander-yakushev/awesompd/blob/master/utf8.lua. After including that file in my project and making a small change to my own code, things now work as expected (I use IntelliJ so yes, my Lua files were saving in UTF8). It remains to be seen though if it's better for maintainability to rely on an editor embedding UTF8 characters directly or if I should translate them all to their decimal equivalents to prevent future issues, i.e. instead of the copyright symbol right in my code, using '\194\169' instead.

It's too bad Lr doesn't include UTF8 functions right in the LrUtils library. Seems useful and essential.

db

Report · Jun 17, 2013

Isn't the simplest solution just to export your HTML as UTF-8 and just add meta charset=UTF-8 tag?

Report · Jun 17, 2013

Simple but not standards-compliant. The problem is that most if not all modern browsers will properly render UTF-8 documents without problems but this is misleading/incorrect since the standard for most Latin-1 characters and special characters like ampersands, copyright, etc. requires the use of entity transcription, e.g. ©. If it's easy to "do the right thing," then I'll try to do that.

Report · Jun 17, 2013

Great. That code utf8.lua does indeed look simple, clear, and very useful.

I agree, it would be better for LR to include more UTF8 functions in LrStringUtils.

Report · Jun 17, 2013

Ah thank you, I meant LrStringUtils.

FWIW, I'd like to hear someone comment on the small function in that code called utf8len(). The code refers to a variable named "chars". I can't tell if that is actually uninitialized (and possibly a bug) or if there's a Lua idiom going on that I don't know about.

Report · Jun 17, 2013

I'd like to hear someone comment on the small function in that code called utf8len(). The code refers to a variable named "chars". I can't tell if that is actually uninitialized (and possibly a bug) or if there's a Lua idiom going on that I don't know about.

"chars" is definitely unitialized within that file and never assigned. But as long as its value remains nil, utf8len() looks correct.

Perhaps the code involving "chars" is debugging. If "chars" is a non-nil number, and if the number of UTF-8 characters in the string is greater or equal to "chars", the result is the number of string bytes representing the first "chars" UTF-8 characters of the string. Can't see why that would be useful as written.

Report · Jun 17, 2013

Yeah, I can't figure it out. I think this is vestigial and I'm not going to go digging through the history to figure it out. I've stripped that out in my local copy. I also notice that utf8charbytes() assumes a well-formed utf8 string and isn't robust to bogus byte sequences. There's a potential array bounds problem too. Not hard to fix. Thanks for taking a look.

db

UTF-8 and converting to HTML entities

Photos