à and XML

Report · Aug 10, 2009

I am haing an issue with accent grave.

If I replace à, with à it inserts in the sql table properly, but the xml will show à So if I apply a style sheet to the xml, the resulting html will actually read à instead of à.

Now if I don't replace à with anything, the html will be fine but the xml will show a �.

Is there a way to force the xml NOT to escape the & when it is part of another escape (rather than just & on its own)?

-Robert

Message was edited by: robs67

Testing some more revealed that the issue seems to be when the offending charcter is grabbed from a sql table and put into an xml doc. If I do this: <CFXML variable="MyXML" caseSensitive="yes"> <TheXml> <testNode1> <cfoutput>#GET_oD.remarks#</cfoutput> </testNode1> </TheXml> </CFXML> then the xml will have a white question mark on a black diamond.

If I do this:

<CFXML variable="MyXML" caseSensitive="yes"> <TheXml> <testNode1> <cfoutput>à</cfoutput> </testNode1> </TheXml> </CFXML> then the xml will be just fine.

I have have "String Format: Enable High ASCII characters and Unicode for data sources configured for non-Latin characters" enabled in the CF Administrator and the data type of the column in the ms sql table is nvarchar.

Any help would be appreciated as this is driving me nuts.

Report · Aug 12, 2009

I general, I would recommend persisting the data to the database in it's original format whenever possible.

When you output the value, you can use xmlFormat() function in CF and this might do the trick for you.

#xmlformat(GET_oD.remarks)#

Byron Mann

mannb@hostmysite.com

byronosity@gmail.com

Software Architect

hosting.com | hostmysite.com

http://www.hostmysite.com/?utm_source=bb

Report · Aug 13, 2009

You need to decide when you are going to do your encoding, and thereafter do it consistently.

Typically what I have done is:

Filter out any unwanted material from the user's input that might, for example, be part of an attack-vector when the results are redisplayed.
Leave special characters as they are.
Use <cfqueryparam> to allow character strings to be safely inserted into the database no matter what they contain.
Upon display, use HTMLEditFormat() or its equivalent to translate special-characters into their corresponding HTML escapes.
- I have not been pleased with the various global tags that are available to perform this sort of escaping over large blocks of code. Maybe I haven't used them enough...

Report · Aug 13, 2009

Thank you for the responses. Unfortunately, when I use htmlEditFormat() or xmlFormat(), and transform the xml into html via xsl, the ampersands are escaped again. So, for example, a " will be "

I don't know xsl at all and the xsl was written by someone who has since died. Perhaps that's my issue. Could it be the xsl is escaping the ampersands when they shouldn't be?

-Robert

Report · Aug 15, 2009

What happens when you use UTF-8 encoding throughout, including at your database, and not replace any character entities?

Report · Aug 17, 2009

BKBK:

When I do that, the databse and html are correctly displaying the character but the xml has the black/white diamond question mark.

Report · Aug 17, 2009

I should have also pointed out that when I do use xmlformat(), the charcater is then represented in the xml document in its hex format (and incorrectly in the html where the & is escaped as I said in my orginal post). I don't know if this matters though.

Report · Aug 17, 2009

When I do that, the databse and html are correctly displaying the character but the xml has the black/white diamond question mark.

The processing instructions of the XML should also contain encoding="UTF-8".

Report · Aug 18, 2009

The black and white diamond question mark is still in the xml document even with the header,

<?xml version="1.0" encoding="UTF-8"?>.

Some more info I noticed:

In Firefox, the xml document will display (with the black/white diamond),
but in IE, it won't display in the browser and generates the error,

"An invalid character was found in text content. Error processing resource" but yet when viewing the source in IE, the character is properly displayed.

Report · Aug 18, 2009

I changed the encoding to ISO-8859-1 and it worked. The xml displays the character properly in both IE and Firefox.

Are there any ramifications to leving the encoding ISO-8859-1?

Report · Aug 18, 2009

The black and white diamond question mark is still in the xml document even with the header, <?xml version="1.0" encoding="UTF-8"?>.
Some more info I noticed:
In Firefox, the xml document will display (with the black/white diamond), but in IE, it won't display in the browser and generates the error,
"An invalid character was found in text content. Error processing resource" but yet when viewing the source in IE, the character is properly displayed.

I changed the encoding to ISO-8859-1 and it worked. The xml displays the character properly in both IE and Firefox.
Are there any ramifications to leving the encoding ISO-8859-1?

I would maintain the header, <?xml version="1.0" encoding="UTF-8"?>. I suspect this now boils down to a display issue. The final link in the chain might be the encoding in the view-menu of your browser. Change it to Unicode.

Report · Aug 19, 2009

There's some good material out there on this:

http://www.cs.tut.fi/~jkorpela/chars.html -- A Tutorial on Character-Code Issues

http://www.cs.tut.fi/~jkorpela/html/chars.html -- Using National and Special Characters in HTML

Although these documents are not fully up-to-date with regards to current implementations, they do give a readable explanation of what are the issues involved. The first document (Tutorial...) is particularly informative because it presents a list of "several things that you might see" and "what might have actually happened."

In my very-limited experience, I've observed that it really depends ... not only on the character-encoding ("UTF-8" is normally adequate since it can represent both ASCII and UniCode) ... but also on the font that has been selected and in some cases the optional configuration parameters of the user's own browser. And sometimes you've got to get down-and-dirty and look at the actual byte-sequence that's coming across. Apparently "there's more than one way to do it."