A

Anonymous

Question

invalid character in XML

Forum|Forum|16 years ago
May 1, 2009
3 replies
7497 views

Hi,

I'm trying to parse this public comment feed, which had been working until recently -

http://www.ntia.doc.gov/broadbandgrants/btopcomments.xml

- when I started getting this error -

"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."

I'm trying to strip out the character using this replace command, but it doesn't seem to be working.

XMLText = rereplace(XMLText, chr(14),"","ALL");

Any ideas?

thanks!

Advanced techniques

This topic has been closed for replies.

BKBK

Community Expert

"An error occured while Parsing an XML document. An invalid XML character (Unicode: 0x14) was found in the element content of the document."
I'm trying to strip out the character using this replace command, but it doesn't seem to be working.
XMLText = rereplace(XMLText, chr(14),"","ALL");

The hexadecimal 0x14 corresponds to decimal 20, hence to chr(20).

J

JR__Bob__Dobbs-qSBHQ2

Inspiring

It looks like you have a UTF 8 encoded XML file that is trying to include some characters from a different character set, specifically curly quotes. CF does not do a good job in converting text from non-unicode to unicode forms.

See this blog from Ray Camden entry for info and a possible fix.
http://www.coldfusionjedi.com/index.cfm/2006/11/2/xmlFormat-and-Microsofts-Funky-Characters

C

craigkaminsky

Inspiring

Hi, Wingo,

I was perusing the XML file this morning after a bit of Googling and running some failed tests yesterday afternoon (I was able to get the same error you got on both CF 8.0.1 and Railo 3.1). I think I might see the problem.

I stripped out all of the Base64 code from the <content> elements in the document (I first downloaded the XML file via CFHTTP and saved the contents to a local file) and it worked. I was able to read and parse the XML and then output the XML to the browser. No errors or glitches.

I spent a little time trying to determine if it was a particular Base64 section but it seemed like I needed to remove all of them. Admittedly, I kind of got lost in the XML when trying to remove the Base64 sections and add them back in to test various combinations and such!

The only thing that bugs me about this is that I wonder if you can really replace the offending character(s). If these "problematic" characters are in a section that's Base64, can you change them (i.e., find/replace) and still get the right content when the Base64 data is converted to it's "real" format.

Might be worth spending some more time with the static XML version to see if there really is just one (or two) sections of Base64 content that's causing the issue (and not all of the Base 64 sections).

A

Anonymous

Hmmm. Thanks for the thoughtful insight/exploration.

So if I removed the base64 sections, everything should work properly? Seems all of the base64 sections begin with "Content-Transfer-Encoding: base64" and end with the "</content>" closing tag. I'll try removing them and see if I can't get the feed working again - and report back.

Many thanks!

A

Anonymous

I seem to have gotten it, though admittedly all of the attachments are being removed... which is less than desirable.

Here is the code I'm using.

<cfloop from="1" to="2000" index="i">
    <cfset start64 = find("Content-Type: application/", XMLText, 1)>
    <cfif start64 IS 0>
        <cfbreak>
    <cfelse>
        <cfset end64 = find("</content>", XMLText, start64)>
        <cfset length64 = end64-start64>
    </cfif>

    <cfoutput>
    #start64# - #end64# -- #length64#<br />
    </cfoutput>

   <cfset XMLText = RemoveChars(XMLText, start64, length64)>

</cfloop>

I

insuractive

Inspiring

Was that a dummy URL you posted? Because when I pull up the document you linked to, I get the following HTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=windows-1252"></HEAD>
<BODY></BODY></HTML>

A

Anonymous

Thats the real feed - though it seems to have stopped responding. Hmmmm. Thankfully I've got a copy of it locally - I'll attach a copy of it to this post.

btopcomments.zip

A

Anonymous

A link to the feed can be found in the top right corner of this page -

http://www.ntia.doc.gov/broadbandgrants/comments.cfm

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded