XML parsing huge file

Report · Nov 11, 2010

Hi,

I have a 36M XML file i need to parse, I'm new to XML.

I usually get a 200K file in CSV format from most of my client that they transfer into there account i then simply update the MSSQL database with the CSV file at midnight on my server. But now i have 74 clients that are regroup and they send me 1 XML file.

When i run it using the sample they gave me it works fine but on the 36M file i get a Jrun error then i found out that :

Doesnt work on big files because it runs out of memory.

I need a way to parse that file using Java i downloaded xmlsax.js but i dont know how to use it to parse then get my parsed var back from it can anyone help me please.

I got the file here : http://xmljs.sourceforge.net/website/sampleApplications-sax.html

Thank you

Report · Nov 11, 2010

<cffile> shouldn't struggle with a 36MB file, because that's not really a terribly big file. It's a swagload of XML, sure, but it's not "big" as far as files go.

If you're running out of memory just reading the file, I'd be looking at your jvm memory settings, in case they were "suboptimal".

The other thing I note is that you say you need to parse this with Java, but then go on to talk about a JavaSCRIPT parser. Which is it? Java and JavaScript are two completely unrelated things. Well they're related in they're both programming languages, but they're more different than the same other than that.

I think you better clarify your requirement here.

--
Adam

Report · Nov 11, 2010

I must admit my initial though was that you're clearly doing something stupid here as there's no way CF can't parse a file that size. So I tried it myself.

50Mb XML file - fileread(), xmlParse(). No CFDUMP, just parse. CF shoots up from 500MB to 1100MB then explodes with an out of memory error.

That is truly tragic performance, I must say. Java libraries for the win, I'd suspect.

Report · Nov 11, 2010

50Mb XML file - fileread(), xmlParse(). No CFDUMP, just parse. CF shoots up from 500MB to 1100MB then explodes with an out of memory error.

But was it the fileRead() or the xmlParse() that did that? I mean... the former is not much use without the latter, but I suspect it's the xmlParse() that's doing it (and - wow - is it doing it), not the read.

--
Adam

Report · Nov 11, 2010

It is indeed the xmlParse() that's causing the issue. I'd imagine Adobe are trying to do something unnecessarily clever, you'll be able to get around this in Java. Probably not in JavaScript though

Report · Nov 11, 2010

No, Adobe's not doing anything "clever" here. This is expected behavior. CF uses a DOM parser for its functionality. This provides a lot more functionality than a SAX parser, as it builds a traversable tree of nodes in memory and lets you manipulate them using random access instead of sequential access. But DOM parsers aren't designed to handle really large files, which is why we also have SAX parsers. SAX provides sequential access and much less functionality in general than DOM, but can handle really large files.

Dave Watts, CTO, Fig Leaf Software

http://www.figleaf.com/

http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on

GSA Schedule, and provides the highest caliber vendor-authorized

instruction at our training centers, online, or onsite.

Read this before you post:

http://forums.adobe.com/thread/607238

Dave Watts, Eidolon LLC

Report · Nov 11, 2010

Cf file isnt the problem it is the parsing, i need another way to parse the file i looked at XMLsax.js and it looks like it could solve my problem but i dont know how to use it and i read the documentation, so if anyone else uses it please let me know how i can get the Var out of that and use it in Coldfusion .

if you have any other way i could parse this file i would apreciate the information.

Report · Nov 11, 2010

The program you mentioned is a JavaScript program, not a Java one. You need a Java SAX parser. Java comes with a SAX parser. Google "Java SAX parser" for more information.

Dave Watts, CTO, Fig Leaf Software

http://www.figleaf.com/

http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on

GSA Schedule, and provides the highest caliber vendor-authorized

instruction at our training centers, online, or onsite.

Read this before you post:

http://forums.adobe.com/thread/607238

Dave Watts, Eidolon LLC

Report · Nov 11, 2010

You might consider moving the XML processing and data import tasks off of ColdFusion and into MS SQL Server.

Some MS SQL options for importing data from XML:

http://msdn.microsoft.com/en-us/library/ms190936(SQL.90).aspx

Report · Nov 11, 2010

You might consider moving the XML processing and data import tasks off of ColdFusion and into MS SQL Server.

You want me to insert the entire 36M file into a field on my MSSQL server ?

Report · Nov 11, 2010

No that's not what he was getting at - doing that would get you nowhere, other than getting yourself slapped upside the head by any web developers in the vicinity.

Bob was implying that you use one of Microsoft's tools for importing directly from XML into MSSQL, which would be infiinitely quicker:

http://msdn.microsoft.com/en-us/library/ms191184%28v=SQL.90%29.aspx

Report · Nov 11, 2010

Owainnorth wrote:
No that's not what he was getting at - doing that would get you nowhere, other than getting yourself slapped upside the head by any web developers in the vicinity.
Bob was implying that you use one of Microsoft's tools for importing directly from XML into MSSQL, which would be infiinitely quicker:
http://msdn.microsoft.com/en-us/library/ms191184%28v=SQL.90%29.aspx

Agreed. I would investigate moving the import processes into SQL server.

Report · Nov 11, 2010

Have you tried to have xmlparse read in the file and not use cffile? I

have processed xml files with CF that were over 100mb so I am not

convinced that your file size is the issue.

--Dave

Report · Nov 11, 2010

Have you tried to have xmlparse read in the file and not use cffile?

I dont know how thats what i want to know 🙂

Report · Nov 11, 2010

Did you start by reading the docs for xmlParse()?

http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-6e90.html

XmlParse(xmlText [, caseSensitive ], validator])

xmlText

Any of the following:

A string containing XML text.
The name of an XML file.
The URL of an XML file; valid protocol identifiers include http, https, ftp, and file.

[etc]

--

Adam

Report · Nov 11, 2010

Has mention above the Coldfusion Xmlparse is the problem It cant parse big

files.

Report · Nov 11, 2010

I don't know what the actual size limitation is for XmlParse, or whether it's a factor of available memory. In any case, you should try those suggestions of loading the file directly in XmlParse without CFFILE first, before spending the time invoking a SAX parser.

Dave Watts, CTO, Fig Leaf Software

http://www.figleaf.com/

http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on

GSA Schedule, and provides the highest caliber vendor-authorized

instruction at our training centers, online, or onsite.

Read this before you post:

http://forums.adobe.com/thread/607238

Dave Watts, Eidolon LLC

Report · Nov 11, 2010

Well, having eventually found the right link back to this thread having waded through all the links in Dave's signature (only joking Dave, it's a really reasonable length) I thought I'd post up my findings.

I don't really understand the differences between a DOM and a SAX parser, so I'll leave that for now. I tried the idea someone mentioned earlier:

1. f = fileRead(filepath)

x = xmlParse(f)

= 1.1GB RAM, 20.8s on a clean CF instance

2. x = xmlParse(filepath)

= 1.0GB RAM, 9.75s on a clean CF instance

So it looks like using xmlParse directly is slightly more efficient, but I still maintain it's really poor that CF can't process a 34MB file without exploding. Contrary to what the title may suggest 34MB is *not* a "huge" file, and there's no reason why CF shouldn't be able to handle it.

As the OP made the classic "Java = JavaScript" faux pas I'd guess no-one's going to be implementing any custom xml parsing solutions here, least of all me.

64bit CF & 64GB RAM...?

Report · Nov 12, 2010

Well, after that crack about my sig, I'm tempted to add more links to it. But I digress.

A DOM parser creates a memory structure that is significantly larger than the actual XML file. Here's a link describing the memory consumed by parsing an XML file with DOM.

http://www.cafeconleche.org/books/xmljava/chapters/ch09s05.html

A SAX parser doesn't create this memory structure, and as a result can be used to process very large files.

Of course, the question of "very large" is a relative one, so moving to a 64-bit environment may provide enough memory to process larger files.

Dave Watts, CTO, Fig Leaf Software

http://www.figleaf.com/

http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on

GSA Schedule, and provides the highest caliber vendor-authorized

instruction at our training centers, online, or onsite.

Read this before you post:

http://forums.adobe.com/thread/607238

Dave Watts, Eidolon LLC

Report · Nov 12, 2010

Well, after that crack about my sig, I'm tempted to add more links to it. But I digress.

I think he does have a point, actually. Fair enough on emails, but is it really necessary for forum posts? I guess you want to get your identity out there.

But anyway.

Of course, the question of "very large" is a relative one, so moving to a 64-bit environment may provide enough memory to process larger files.

It's hardly a production server environment, but on my mighty Win Vista Home Prem 64-bit laptop runnning CF8.0.1 64-bit, with latest JDK, and giving the VM 1.5GB RAM (about as much as I can spare of the 4GB the machine has)... I could not parse a 50MB XML file. The JVM just ate all the RAM it had allocated, and then gave either an "out-of-memory" or "garbage collection overhead exceded" exception. The XML was basically this:

<aaa>

<bbb>

</fff>

</ccc>

</bbb>

[repeat the <bbb> stuff until there's 50MB of it]

</aaa>

Not too complex.

So that ain't great.

--

Adam

Report · Nov 12, 2010

As for my sig, yeah, that's how I roll.

I'm not surprised about the XML though. If a DOM parser has to multiply the file size by ten in RAM, which is what that previous link indicated as a minimum, your 50 MB file is 500 MB in RAM.

Dave Watts, CTO, Fig Leaf Software

http://www.figleaf.com/

http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on

GSA Schedule, and provides the highest caliber vendor-authorized

instruction at our training centers, online, or onsite.

Read this before you post:

http://forums.adobe.com/thread/607238

Dave Watts, Eidolon LLC

Report · Nov 12, 2010

I'm not surprised about the XML though. If a DOM parser has to multiply the file size by ten in RAM, which is what that previous link indicated as a minimum, your 50 MB file is 500 MB in RAM.

Right. Last time I checked 500MB should fit into 1.5GB quite nicely 😉 The CF instance wasn't doing anything else except running that test script, so a lot of that 1.5GB should have been available.

If CF can't deal with a 50MB file - even if this is predictable given the parser they're using - perhaps they should roll-in a SAX parser option too. Because simply not being able to handle it doesn't seem like an appropriate state of affairs to me.

Yeah, sure, we can "roll our own" with Java; but the raison d'etre for CF is that with CF one doesn't have to do this.

I had a quick look-see at the SAX parse docs & examples last night, and it didn't seem completely obvious to me on first look (whilst watching TV, replying to other posts here, and testing the CFML code... so it's not like it had a large chunk of my attention), but it looks like one's approach to dealing with the XML that way might be quite different than with a DOM parser, so perhaps just having an optional argument on xmlParse() might not be all that would need to be done though.

EG:

public xml xmlParse(String xmlData, Boolean caseSensitive, String validator, String parser=[DOM|SAX])

But if it is that straight fwd (from the perspective of us using the CFML, not the Adobe bods developing the code under the hood ;-), it might be quite handy..?

--
Adam

Report · Nov 12, 2010

Couldn't agree more, and to be honest I can't believe this hasn't come up before. To me, the thought that something like CF should have to be bypassed when you get to files of a few megs is utterly ridiculous. I haven't looked into the different methods of parsing XML as it's really not my thing, but are we saying that DOM parsing is necessary for CF to be able to perform the functions it does on the resulting XML object? Or does one create the same result, just through a different method?

Owain North

Code Monkey

Titan Internet Ltd

http://www.titaninternet.co.uk

Owain North is a mildly overweight computer programmer who likes to sit in the corner of a darkened room tapping away on his keyboard whilst wearing a massive set of headphones to avoid human contact where possible. He particularly likes to avoid natural light and salad.

In his spare time he likes to pet his dog and work on his track car: http://www.306gti6.com/forum/showthread.php?id=124722&page=1

The other day he went up to the toilets upstairs and there were no hand towels left! Bad times.

It's Filthy Friday, so we all got Dominos for lunch. Large (obviously) half and half Mighty Meaty and American Hot. Good it was, especially as one of the other guys didn't want his garlic & herb dip = win.

At the moment, he's having to look into WCF for a new project on server monitoring. He doesn't know anything about it yet but after a quick session on Amazon with the company credit card and some extortionate delivery fees he's well on his way to writing his first WCF service.

In case you're interested - in the end, he just had to dry his hands on his jeans.

Report · Nov 12, 2010

Owain North
Code Monkey
Titan Internet Ltd
http://www.titaninternet.co.uk
Owain North is a mildly overweight computer programmer who likes to sit in the corner of a darkened room tapping away on his keyboard whilst wearing a massive set of headphones to avoid human contact where possible. He particularly likes to avoid natural light and salad.
In his spare time he likes to pet his dog and work on his track car: http://www.306gti6.com/forum/showthread.php?id=124722&page=1
The other day he went up to the toilets upstairs and there were no hand towels left! Bad times.
It's Filthy Friday, so we all got Dominos for lunch. Large (obviously) half and half Mighty Meaty and American Hot. Good it was, especially as one of the other guys didn't want his garlic & herb dip = win.
At the moment, he's having to look into WCF for a new project on server monitoring. He doesn't know anything about it yet but after a quick session on Amazon with the company credit card and some extortionate delivery fees he's well on his way to writing his first WCF service.
In case you're interested - in the end, he just had to dry his hands on his jeans.

Hahahahahahahaha.

Nice. And I thought I liked to take the piss...

--
Adam

Report · Nov 12, 2010

Now that's a sig! A manly sig if I do say so myself. But you have to include it with every post, you know, or it doesn't count. And you might want to leave out the part about the toilets, because it reflects badly on Titan Internet Ltd if there are no clean hand towels.

Dave Watts, CTO, Fig Leaf Software

http://www.figleaf.com/

http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on

GSA Schedule, and provides the highest caliber vendor-authorized

instruction at our training centers, online, or onsite.

Read this before you post:

http://forums.adobe.com/thread/607238

Dave Watts, Eidolon LLC

XML parsing huge file

Photos