• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

XMLParse - Need to remove a Hex Character

New Here ,
Jan 16, 2018 Jan 16, 2018

Copy link to clipboard

Copied

Hello all,

I am running into a problem where I am parsing XML and running into an error:

An error occured while Parsing an XML document.An invalid XML character (Unicode: 0x17) was found in the element content of the document.

Therefore my approach is that I obviously have to remove some kind of odd character but I have no idea what it is. I would like to take the XML and perform something like the following:

<cfset XML = Replace(XML, "&","","All")>

Where "&" is the special character (0x17). Can anyone assist with what this character is or has had this problem? I wish I could post the XML but I cannot get at it because the page is breaking.

Any help would be greatly appreciated!!!!

Thanks

Views

2.1K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 16, 2018 Jan 16, 2018

Copy link to clipboard

Copied

CFDUMP the XML before parsing it.  You can display it, or email it to yourself, and look for the 0x17 (which is decimal "23").  Then you can decide how best to fix it.

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 16, 2018 Jan 16, 2018

Copy link to clipboard

Copied

OR.. another thought.  If the XML is in a string format before it becomes an XML object, and IF you are using CF v10 or later, you can run the string through canonicalize(), set both flags to false, and then parse the XML.  That should get rid of any and all hex characters.

HTH,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 16, 2018 Jan 16, 2018

Copy link to clipboard

Copied

Thanks HTH for the assistance!!!

So you think something like this would be the best approach:

<cfset XML = #canonicalize(getHTTPRequestData().content)#>

Thanks!!!!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 16, 2018 Jan 16, 2018

Copy link to clipboard

Copied

  HTH = Hope This Helps.  My pseudonym is WolfShade.

And, yes, if the XML is in string format, you can shove it through canonicalize() to remove all hex encoding (and other encoding) before parsing it or turning it into an XML object.  Precisely as you have coded it (minus the hashtags #, as those are not necessary unless used within a string or as display.)

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 16, 2018 Jan 16, 2018

Copy link to clipboard

Copied

Ha sorry about that. Thank you WolfShade for you help. Truly appreciate it.

Tried putting this in: <cfset XML = canonicalize(getHTTPRequestData().content, true, true)>

But it then gave me the following error:

An error occured while Parsing an XML document.The entity name must immediately follow the '&' in the entity reference.

Any ideas?

Thanks!!!!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 17, 2018 Jan 17, 2018

Copy link to clipboard

Copied

According to this SO thread, all ampersands need to be replaced with either & or &#38; (I did not know that.)

I don't work with XML, much.

So.. this confuses me.  If your CF server is going to balk at 0x17 hex encoding, but it also balks at a plain ampersand, not sure what to do.  You could try using REPLACE() after canonicalize(), but I'm not sure that would work.  (shrug)  Give it a shot.

<cfset XML = canonicalize(getHTTPRequestData().content,true,true) />

<cfset XML = replace(XML,'&','&#38;','all') />

HTH,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 17, 2018 Jan 17, 2018

Copy link to clipboard

Copied

In XML, the ampersand is a metacharacter. It's used to introduce an XML entity. XML entities are pretty similar to HTML character entities, except there are only four of them. Read all about them here:

List of XML and HTML character entity references - Wikipedia

My guess is that what you're getting is actually not well-formed XML, so CF isn't going to be able to parse it unless you manually strip or replace the problematic parts. The character mentioned by the original poster is U+0017, which is an "end of transmission block" character:http://www.fileformat.info/info/unicode/char/17/index.htm

http://www.fileformat.info/info/unicode/char/17/index.htm

So, maybe this character is at the end of the file and can be removed prior to XML parsing? Maybe not. I think it would be useful to actually provide a sanitized version of the file in question here for people to look at.

Dave Watts, CTO, Fig Leaf Software

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 17, 2018 Jan 17, 2018

Copy link to clipboard

Copied

Please show the code you are using to get the XML data.

It will help remove the guesswork being used to try to help you.

Cheers

Eddie

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 17, 2018 Jan 17, 2018

Copy link to clipboard

Copied

Eddie,

Thank you for helping!!!

This is basically what we're doing:

<cfset XML = #getHTTPRequestData().content#>

<cfset xmlDoc = XmlParse(XML)>

It's breaking when XmlParse is called. I am trying to get the XML that is being posted, but it proving to be a challenge.

Would it be beneficial to perform the following:

<cfset XML = Replace(XML, "&#23;","","All")>

<cfset XML = Replace(XML, "&#x17;","","All")>

Thank you so much for your assistance!!!!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 17, 2018 Jan 17, 2018

Copy link to clipboard

Copied

If those are the only things that need to be replaced, it would be beneficial to replace them, as they're not allowed XML metacharacters and XML doesn't have HTML character entities. But (a) they might not be the only things that need to be replaced, and (b) they're presumably there for some reason. So be careful!

Dave Watts, CTO, Fig Leaf Software

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 17, 2018 Jan 17, 2018

Copy link to clipboard

Copied

sdettling222  wrote

It's breaking when XmlParse is called.

Write getHTTPRequestData().content to a file before you try to parse it as XML. Then open that file in a text editor and review it for XML correctness.

You can also run the file through an XML validator. There are plenty online.

Cheers

Eddie

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 18, 2018 Jan 18, 2018

Copy link to clipboard

Copied

Eddie/Dave/WolfShade,

Thank you all so much for your assistance. I was able to finally get the XML. Here it is:

<?xml version="1.0" ?>

<Event>

<ProductClass>O</ProductClass>

<Action>CON</Action>

<EventNumber>1</EventNumber>

<Significance>W</Significance>

<Phenomena>HZ</Phenomena>

<EventType>HZ</EventType>

<EventAction>W</EventAction>

<Sent>0001/01/01T0000Z</Sent>

<Expires>2018/01/18T1800Z</Expires>

<WFO>KJAN</WFO>

<LatLon></LatLon>

<CountyCodes>Morehouse, LA|West Carroll, LA|East Carroll, LA|Richland, LA|Madison, LA|Franklin, LA|Catahoula, LA|Tensas, LA|Concordia, LA</CountyCodes>

<FIPSCodes>LAZ007|LAZ008|LAZ009|LAZ015|LAZ016|LAZ023|LAZ024|LAZ025|LAZ026</FIPSCodes>

<Text>WWUS74 KJAN 181531

NPWJAN

URGENT - WEATHER MESSAGE

National Weather Service Jackson MS

931 AM CST Thu Jan 18 2018

ARZ074-075-LAZ007&gt;009-015-016-023&gt;026-MSZ018-019-025&gt;066-072&gt;074-

181800-

/O.CON.KJAN.HZ.W.0001.000000T0000Z-180118T1800Z/

Ashley-Chicot-Morehouse-West Carroll-East Carroll-Richland-

Madison LA-Franklin LA-Catahoula-Tensas-Concordia-Bolivar-

Sunflower-Leflore-Grenada-Carroll-Montgomery-Webster-Clay-Lowndes-

Choctaw-Oktibbeha-Washington-Humphreys-Holmes-Attala-Winston-

Noxubee-Issaquena-Sharkey-Yazoo-Madison MS-Leake-Neshoba-Kemper-

Warren-Hinds-Rankin-Scott-Newton-Lauderdale-Claiborne-Copiah-

Simpson-Smith-Jasper-Clarke-Jefferson-Adams-Franklin MS-Lincoln-

Lawrence-Jefferson Davis-Covington-Jones-Marion-Lamar-Forrest-

Including the cities of Crossett, North Crossett, Hamburg,

West Crossett, Dermott, Lake Village, Eudora, Bastrop, Oak Grove,

Epps, Lake Providence, Rayville, Delhi, Tallulah, Winnsboro,

Jonesville, Harrisonburg, Newellton, St. Joseph, Waterproof,

Vidalia, Ferriday, West Ferriday, Cleveland, Indianola,

Ruleville, Greenwood, Grenada, Vaiden, North Carrollton,

Carrollton, Winona, Eupora, Maben, Mathiston, West Point,

Columbus, Ackerman, Weir, Starkville, Greenville, Belzoni, Isola,

Durant, Tchula, Lexington, Pickens, Goodman, Kosciusko,

Louisville, Macon, Brooksville, Mayersville, Rolling Fork,

Anguilla, Yazoo City, Ridgeland, Madison, Canton, Carthage,

Philadelphia, Pearl River, De Kalb, Scooba, Vicksburg, Jackson,

Pearl, Brandon, Richland, Forest, Morton, Newton, Union, Decatur,

Conehatta, Meridian, Port Gibson, Crystal Springs, Hazlehurst,

Wesson, Magee, Mendenhall, Taylorsville, Raleigh, Bay Springs,

Heidelberg, Quitman, Stonewall, Shubuta, Fayette, Natchez, Bude,

Roxie, Meadville, Brookhaven, Monticello, New Hebron, Prentiss,

Bassfield, Collins, Mount Olive, Laurel, Columbia,

West Hattiesburg, Lumberton, Purvis, and Hattiesburg

931 AM CST Thu Jan 18 2018

...HARD FREEZE WARNING REMAINS IN EFFECT UNTIL NOON CST TODAY...

* TEMPERATURE...Temperatures will gradually come above freezing by

  noon today. High temperatures will range between 40-45 degrees.

* IMPACTS...Prolonged exposure could lead to hypothermia and may

  harm pets and livestock. Exposed plumbing is in danger of

  being damaged.

PRECAUTIONARY/PREPAREDNESS ACTIONS...

A Hard Freeze Warning means a prolonged period of sub-freezing

temperatures is ongoing. These conditions will be dangerous to

people and pets without adequate shelter and could damage exposed

pipes.

&&

$$

SKH</Text>

<SourceFile>NPWJAN  0118181800</SourceFile>

</Event>

Do you think its breaking because there's no "</xml>" tag?

Thanks!!!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Advocate ,
Jan 18, 2018 Jan 18, 2018

Copy link to clipboard

Copied

No, the first element is a processing element, it doesn't require a closing tag.

The text you posted appears to be valid XML, but copying and pasting the text probably stripped the problem character(s).

You need to write the bytes received from the request to a file and then interrogate that file for problem characters. If you find any then you will need to let the source of the XML file know that they are producing invalid XML.

Cheers

Eddie

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 18, 2018 Jan 18, 2018

Copy link to clipboard

Copied

Probably. It's also generally a good idea to use CDATA blocks within XML elements that contain large amounts of text, like so:

<Text><CDATA[[

... text goes here ...

]]></Text>

But I understand you may not have any control over what you get from someone else. Also, you may have lost the problem element during the copy and paste operation. I don't really see anything that's an obvious problem.

Dave Watts, CTO, Fig Leaf Software

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 18, 2018 Jan 18, 2018

Copy link to clipboard

Copied

CDATA has been deprecated.  It's not dead, yet, but MDN warns that it could stop working at any time.

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 18, 2018 Jan 18, 2018

Copy link to clipboard

Copied

LATEST

I don't know if that applies to XML, or just to the parsing DOM. In any event, as long as the original poster is consuming this XML within CF and not within JavaScript, I suspect it'll work out fine.https://en.wikipedia.org/wiki/Talk%3ACDATA#CDATA_Deprecated_in_DOM4?

Talk:CDATA - Wikipedia

Dave Watts, CTO, Fig Leaf Software

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation