Skip to main content
Known Participant
January 14, 2009
Question

Parsing XML

  • January 14, 2009
  • 11 replies
  • 2287 views
I'm a bit of a noob with parsing XML with coldfusion and I could use some help with an issue.

I'm trying to parse the following XML file;
http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml

and I get the following error;

Character conversion error: "Illegal ASCII character, 0xe9" (line number may be too low).
The error occurred on line 9.

I can post my code if this is helpful? But it works fine when parsing a different XML file.

Any ideas?
This topic has been closed for replies.

11 replies

Sam_HamAuthor
Known Participant
January 15, 2009
I've done a bit of research and it would make sense to parse the XML file using SAX rather than using coldfusions DOM parser, becuase of the size of the file and processing speeds.

I've had a go at doing this, but with little success. Because of my inexperience with XML I feel like i could be going down the wrong route.

There is no real documentation online about using coldfusion with SAX.

Does anybody have any knowledge on this?
Participating Frequently
January 15, 2009
I borrowed some code from here:
http://www.javacommerce.com/displaypage.jsp?name=saxparser1.sql&id=18232

Here's your coldfusion code:
<cfset myHandler = CreateObject("Java","MyHandler")>
<cfset myHandler.init()>
<cfset xmlcontent = myHandler.parseXmlToString(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml")>
<cfset xmldoc = xmlparse(xmlcontent)>

And here's the MyHandler.java source. I have no idea what version of Java you're on, still being on ColdFusion 6, so I have no idea if this is going to compile for you or not. It runs fine for me on Coldfusion 8 with Java 1.5
Inspiring
January 14, 2009
Kronin555 wrote:
>
> Note that it takes quite awhile. My guess is that CF uses a DOM parser versus
> a SAX parser. If you wanted to speed this up, you could probably use a Java SAX
> XML parser.
>

I wonder if one used a Java SAX XML parser, if one could just use the
DTD directly and not need to pull them down and concatenate them as
apparently one has to for ColdFusion.

Participating Frequently
January 14, 2009
Sam, don't just remove the dtd declaration. All of those entity declarations in the ___.ent file are needed to make any sense out of the file. If you remove the dtd line from the XML file, none of those entities will be resolved.
Sam_HamAuthor
Known Participant
January 14, 2009
Thanks Kronin,

I posted this before I noticed your last post... thanks very much for all the help.

I'm going to have to give this ago tomorrow, it's getting late.

I'll keep this topic updated :)
Sam_HamAuthor
Known Participant
January 14, 2009
Now were getting somewhere...

I saved the XML file locally and removed line 2

<!DOCTYPE tpeg_document PUBLIC "-//EBU/tpegML/EN" " http://www.bbc.co.uk/travelnews/xml/tpegml_en/tpegML.dtd"[

The XML file now parses!

Now I need to work out how to handle the DTD from the live feed?
Participating Frequently
January 14, 2009
Oh, and I tested this all on ColdFusion 8.
Participating Frequently
January 14, 2009
When trying to do this:
<cfset myXML = XmlParse(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml",true,"http://www.bbc.co.uk/travelnews/xml/tpegml_en/tpegML.dtd")>
I got this error:
Recursive entity reference "%tpegMLDataTypes". (Reference path: %tpegMLDataTypes -> %tpegMLDataTypes -> %tpegMLDataTypes)

So CF doesn't like this at all. To simplify the DTD, I pulled it all down and put it into one file (replacing the ENTITY lines that pull in the other files with the file contents themselves).
An example is I changed this:
<ENTITY % tpegMLDataTypes PUBLIC "-//EBU//DTD tpegML data types//EN" "tpegMLDataTypes.dtd">
&tpegMLDataTypes;
to this
<!-- ENTITY % tpegMLDataTypes PUBLIC "-//EBU//DTD tpegML data types//EN" "tpegMLDataTypes.dtd" -->
<!--============================================================-->
<!-- tpegML TPEG Traffic and Travel Information Common Data Types DTD release version -->
<!-- PUBLIC "-//EBU//DTD tpegML data types//EN" -->
<!--============================================================-->
<!-- time: Time in UTC, should be in the format of "YYYY-MM-DDThh:mm:ssZ". -->
<!ENTITY % time "CDATA">
<!-- intunti: Integer Unsigned Tiny, range 0..255 -->
<!ENTITY % intunti "CDATA">
<!-- intsiti: Integer Signed Tiny, range -128..127 -->
<!ENTITY % intsiti "CDATA">
<!-- intunli: Integer Unsigned Little, range 0..65535 -->
<!ENTITY % intunli "CDATA">
<!-- intsili: Integer Signed Little, range -32768..32767 -->
<!ENTITY % intsili "CDATA">
<!-- intunlo: Integer Unsigned Long, range 0..4294967295 -->
<!ENTITY % intunlo "CDATA">
<!-- intsilo: Integer Signed Long, range -2146483648..2147483647 -->
<!ENTITY % intsilo "CDATA">
<!-- numag: Integer from 0 to 3000000 (limited subset of these numbers as defined in TPEG Part 2 - SSF -->
<!ENTITY % numag "CDATA">
<!-- short_string: String of up to 255 characters. -->
<!ENTITY % short_string "CDATA">
<!-- long_string: String of up to 65535 characters. -->
<!ENTITY % long_string "CDATA">
<!-- day_mask:Can select one or more days of the week to indicate repetition.
if (selector = 00000000) : no day selected
if (selector = 0xxxxxx1) : every Sunday
if (selector = 0xxxxx1x) : every Monday
if (selector = 0xxxx1xx) : every Tuesday
if (selector = 0xxx1xxx) : every Wednesday
if (selector = 0xx1xxxx) : every Thursday
if (selector = 0x1xxxxx) : every Friday
if (selector = 01xxxxxx) : every Saturday
-->
<!ENTITY % day_mask "CDATA">

You can get that file here: http://www.hubbach.com/tpegML.dtd
I will delete this file at some point, so don't write your code to use my file. Pull it down onto your system and use it locally. You might have to update this file if the BBC ever changes their DTD or entities.

Once you do that, this will work:
<cfset myXML = XmlParse(" http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml",true,"http://www.myserver.com/tpegML.dtd")>

Note that it takes quite awhile. My guess is that CF uses a DOM parser versus a SAX parser. If you wanted to speed this up, you could probably use a Java SAX XML parser.
Inspiring
January 14, 2009
Sam_Ham wrote:
>
> I'm trying im still trying to find a workaround.
>
> There must be a way?
>

Well, it is just a text file. There is nothing preventing you from
processing it as a plain text. You can either us Regex or similar
string manipulation techniques to extract the desired information or use
the string techniques to repair the XML and then process it as an XML file.
Sam_HamAuthor
Known Participant
January 14, 2009
Some good points there.

I understand that the XML document is not well formatted. I have already e-mail the BBC about this, hoping they will address the issues.

Despite issues, the XML file should still be usable as there are web apps already using this XML file to mashup data into google maps etc.

I'm trying im still trying to find a workaround.

There must be a way?
Inspiring
January 14, 2009
Sam_Ham wrote:
> I'm a bit of a noob with parsing XML with coldfusion and I could use some help
> with an issue.
>
> I'm trying to parse the following XML file;
> http://www.bbc.co.uk/travelnews/tpeg/en/local/rtm/rtm_tpeg.xml
>

Firefox is complaining about undefined entities in the file. Looking at
the code I see things like: "&rtm31_4;" and "&loc41_30;". These look
like custom entities and there is no entity definition section to the
XML file needed to define them. I beleive that would usually look
something like:

<!ENTITY nbsp "&#160;">
<!ENTITY copy "&#169;">
Inspiring
January 14, 2009
> Character conversion error: "Illegal ASCII character, 0xe9" (line number may
> be too low).
> The error occurred on line 9.

The error message pretty much tells you what's wrong. There's an 0xE9
character in the doc, which is illegal in XML. It's not a well-formed XML
doc, so you can't treat it as one.

You should probably do two things:
1) if the parse fails for reasons like this, catch the exception in the UI
(or wherever appropriate) and put a warning message in along the lines of
"sorry, the traffic service is not currently available).
2) get in touch with the Beeb and tell them their developers need a lesson
in creating XML docs.

--
Adam