Skip to main content
May 1, 2009
Question

invalid character in XML

  • May 1, 2009
  • 3 replies
  • 7497 views

Hi,

I'm trying to parse this public comment feed, which had been working until recently -

http://www.ntia.doc.gov/broadbandgrants/btopcomments.xml

- when I started getting this error -

"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."

I'm trying to strip out the character using this replace command, but it doesn't seem to be working.

XMLText = rereplace(XMLText, chr(14),"","ALL");

Any ideas?

thanks!

This topic has been closed for replies.

3 replies

BKBK
Community Expert
Community Expert
May 4, 2009
"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."
I'm trying to strip out the character using this replace command, but it doesn't seem to be working.
XMLText = rereplace(XMLText, chr(14),"","ALL");

The hexadecimal 0x14 corresponds to decimal 20, hence to chr(20).

Inspiring
May 5, 2009

It looks like you have a UTF 8 encoded XML file that is trying to include some characters from a different character set, specifically curly quotes.  CF does not do a good job in converting text from non-unicode to unicode forms.

See this blog from Ray Camden entry for info and a possible fix.
http://www.coldfusionjedi.com/index.cfm/2006/11/2/xmlFormat-and-Microsofts-Funky-Characters

Inspiring
May 2, 2009

Hi, Wingo,

I was perusing the XML file this morning after a bit of Googling and running some failed tests yesterday afternoon (I was able to get the same error you got on both CF 8.0.1 and Railo 3.1). I think I might see the problem.

I stripped out all of the Base64 code from the <content> elements in the document (I first downloaded the XML file via CFHTTP and saved the contents to a local file) and it worked. I was able to read and parse the XML and then output the XML to the browser. No errors or glitches.

I spent a little time trying to determine if it was a particular Base64 section but it seemed like I needed to remove all of them. Admittedly, I kind of got lost in the XML when trying to remove the Base64 sections and add them back in to test various combinations and such!

The only thing that bugs me about this is that I wonder if you can really replace the offending character(s). If these "problematic" characters are in a section that's Base64, can you change them (i.e., find/replace) and still get the right content when the Base64 data is converted to it's "real" format.

Might be worth spending some more time with the static XML version to see if there really is just one (or two) sections of Base64 content that's causing the issue (and not all of the Base 64 sections).

May 3, 2009

Hmmm. Thanks for the thoughtful insight/exploration.

So if I removed the base64 sections, everything should work properly? Seems all of the base64 sections begin with "Content-Transfer-Encoding: base64" and end with the "</content>" closing tag. I'll try removing them and see if I can't get the feed working again - and report back.

Many thanks!

May 3, 2009

I seem to have gotten it, though admittedly all of the attachments are being removed... which is less than desirable.

Here is the code I'm using.

<cfhttp url="http://www.ntia.doc.gov/broadbandgrants/btopcomments.xml" resolveurl="no" path="/mypath/" />

<cffile action="read" file="/mypath/btopcomments.xml" variable="XMLText" charset="utf-8">

<cfloop from="1" to="2000" index="i">
    <cfset start64 = find("Content-Type: application/", XMLText, 1)>
    <cfif start64 IS 0>
        <cfbreak>
    <cfelse>
        <cfset end64 = find("</content>", XMLText, start64)>
        <cfset length64 = end64-start64>
    </cfif>
   
    <cfoutput>
    #start64# - #end64# -- #length64#<br />
    </cfoutput>
   
   <cfset XMLText = RemoveChars(XMLText, start64, length64)>

</cfloop>

<cffile action="write" file="/mypath/btopcomments-clean.xml" output="#XMLText#" charset="utf-8">

Inspiring
May 1, 2009

Was that a dummy URL you posted?  Because when I pull up the document you linked to, I get the following HTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1252"></HEAD>
<BODY></BODY></HTML>

May 1, 2009

Thats the real feed - though it seems to have stopped responding. Hmmmm. Thankfully I've got a copy of it locally - I'll attach a copy of it to this post.

May 1, 2009

A link to the feed can be found in the top right corner of this page -

http://www.ntia.doc.gov/broadbandgrants/comments.cfm