Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Problem Parsing Large File

New Here ,
Mar 18, 2009 Mar 18, 2009
Hi all,

I'm trying to parse out a large file (300 MB/Appox. 1 Mil. Records) for a project and insert the data into a database. I tried reading it using CFFile and it bombed out immediately because there wasn't enough heap memory(I think?). So, I tried using some Java I found to read the file one line at a time. I can get maybe halfway through it and the JVM memory maxes out (according to Server Monitor) and I get the following error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

I know it has something to do with the way Java manages memory, but its beyond me for the moment. There's really nothing too fancy in what I'm doing, just reading through the file line by line, assigning the data I need to variables based on a record id code, and inserting/updating it into a database. I tried tweaking it to figure out what's eating up the memory and I think it has something to do with the database insert/updates. Is there another or better way I should be doing this? Do I need to adjust the memory somehow?

The Java I'm using to read the file is:

FileName="D:/XXX";
FileIOClass=createObject("java","java.io.FileReader");
FileIO=FileIOClass.init(FileName);
LineIOClass=createObject("java","java.io.BufferedReader" );
LineIO=LineIOClass.init(FileIO);

The server is a Quad Core Xeon, 4 GB of RAM, Windows Server 2003 SP2, CF 8 Enterprise.

Current Settings in Admin:
Max JVM Heap Size: 1280 MB
Max Perm Size: 256m
Max Size of Post Data: 512 MB
Request Throttle Memory: 1200 MB

I hope that's enough info to start.

Thank you
766
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 18, 2009 Mar 18, 2009
Hi,

This may not be the most efficient, but I've found this a pretty stable solution when I parse files.

<cfset x = "#selectedfile#">

<cfloop file="#x#" index="fileLine">
Insert parsing logic here...
</cfloop>

You could also try a MOD function that reads in your java, then every 25 lines or so updates the db. This way you're not maxing out your string/memory.

cfwild
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 18, 2009 Mar 18, 2009
if you are always 100% sure your file contains valid data that does not
require any parsing/massaging to make it conform to db requirements, you
are better off using your db's built-in function(s) to load the data
into the table.

cf8 has added the ability to loop over a file one line at a time, and it
claims it does not open the whole file in memory.

since you are on cf8 enterprise you can do:
<cfloop file="abs-path-to-file" index="line">
to iterate over your file line-by-line.

make sure you set a large enough request timeout for the page doing the
looping!

Azadi Saryev
Sabai-dee.com
http://www.sabai-dee.com/
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Mar 18, 2009 Mar 18, 2009
Thanks guys, I'll give that a try. I would import it directly, but unfortunately much of the data needs formatted and the records are variable length.

One thing I did try, is disabling the database transactions and it ran just fine. This leads me to believe that the memory is getting held up with those somehow. I'm only doing inserts or updates, but no selects. I guess its my lack of understanding how the internals of this version of CF work, but I'm kind of stumped why simple inserts or updates would eat up all the memory like a leak. Any ideas?
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 19, 2009 Mar 19, 2009
> somehow. I'm only doing inserts or updates, but no selects. I guess its my
> lack of understanding how the internals of this version of CF work, but I'm
> kind of stumped why simple inserts or updates would eat up all the memory like
> a leak. Any ideas?

It's difficult to tell without seeing your code.

--
Adam
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Mar 19, 2009 Mar 19, 2009
quote:

Originally posted by: Newsgroup User

It's difficult to tell without seeing your code.




I'm not sure which portion you need, but here are some examples of the SQL Inserts and Updates. They're pretty simple:

<cfquery name="PutRec" datasource="XXX">
INSERT INTO XXXLoad
( CRef,Amount,CName,CommRate,IntRate )
VALUES
( '#fileno#',#amt#,'#name1#','#thecommrate#','#theintrate#' )
</cfquery>

<cfquery name="PutXXX" datasource="XXXData">
INSERT INTO XXXData
( FileNo,FFile,MFile,FID,FmID,CType,FDate,CName,Comm,IntRate )
VALUES
( '#fileno#','#ffile#','#mfile#','#fid#','#fmid#','#ctype#','#xfdate#','#name1#','#comm#','#intrate#' )
</cfquery>

<cfquery name="UpdRec2" datasource="XXX">
UPDATE xxxload
SET dname = '#dname#', address1 = '#addr1#', city = '#dcity#', state = '#dstate#',
zip = '#dzip#', dphone = '#dphone#', ssn = '#Trim(dssn)#', dob = '#thedob#',
status = '#astatus#'
WHERE cref = '#fileno#'
</cfquery>

<cfquery name="UpdXXX6" datasource="XXXData">
UPDATE xxxdata
SET dno = '#dno#', batchid = '#batchid#', batchno = '#batchno#', cid = '#cid#',
pid = '#pid#', gp1 = '#gp1#', gp2 = '#gp2#', gp3 = '#gp3#', gp4 = '#gp4#',
gp5 = '#gp5#', gp6 = '#gp6#', officenum = '#officenum#'
WHERE fileno = '#fileno#'
</cfquery>
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 19, 2009 Mar 19, 2009
On Thu, 19 Mar 2009 21:05:04 +0000 (UTC), BrianTech3 wrote:

>
quote:

Originally posted by: Newsgroup User
>
> It's difficult to tell without seeing your code.
>
>

>
> I'm not sure which portion you need, but here are some examples of the SQL
> Inserts and Updates. They're pretty simple:
>
> <cfquery name="PutRec" datasource="XXX">
> INSERT INTO XXXLoad
> ( CRef,Amount,CName,CommRate,IntRate )
> VALUES
> ( '#fileno#',#amt#,'#name1#','#thecommrate#','#theintrate#' )

[etc]

That's what I thought.

Read up on <cfqueryparam>, and replace all those static values with bind
parameters.

Also... queries that don't return anything don't need name attributes. I'm
not sure whether that will have ay impact, but every variable you set (and
then overwrite) will still be taking up heap space until a GC is called,
and running out of heap space seems to be your problem.

It's probably also worth using the server monitor to see if that reports
anything untoward. I have to concede I've never used it (the Server
Monitor), so can't be much help in that regard.

Get those <cfqueryparam> tags in first off, anyhow.

--
Adam
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Mar 19, 2009 Mar 19, 2009
LATEST
Thanks Adam, I'll give the cfqueryparam tags a try.

As a workaround in the interim, I had it read through the file without the db entries and had it write out 4 smaller files that cold fusion could digest.
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources