Skip to main content
Participating Frequently
March 19, 2009
Question

Problem Parsing Large File

  • March 19, 2009
  • 5 replies
  • 861 views
Hi all,

I'm trying to parse out a large file (300 MB/Appox. 1 Mil. Records) for a project and insert the data into a database. I tried reading it using CFFile and it bombed out immediately because there wasn't enough heap memory(I think?). So, I tried using some Java I found to read the file one line at a time. I can get maybe halfway through it and the JVM memory maxes out (according to Server Monitor) and I get the following error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

I know it has something to do with the way Java manages memory, but its beyond me for the moment. There's really nothing too fancy in what I'm doing, just reading through the file line by line, assigning the data I need to variables based on a record id code, and inserting/updating it into a database. I tried tweaking it to figure out what's eating up the memory and I think it has something to do with the database insert/updates. Is there another or better way I should be doing this? Do I need to adjust the memory somehow?

The Java I'm using to read the file is:

FileName="D:/XXX";
FileIOClass=createObject("java","java.io.FileReader");
FileIO=FileIOClass.init(FileName);
LineIOClass=createObject("java","java.io.BufferedReader" );
LineIO=LineIOClass.init(FileIO);

The server is a Quad Core Xeon, 4 GB of RAM, Windows Server 2003 SP2, CF 8 Enterprise.

Current Settings in Admin:
Max JVM Heap Size: 1280 MB
Max Perm Size: 256m
Max Size of Post Data: 512 MB
Request Throttle Memory: 1200 MB

I hope that's enough info to start.

Thank you
    This topic has been closed for replies.

    5 replies

    Participating Frequently
    March 20, 2009
    Thanks Adam, I'll give the cfqueryparam tags a try.

    As a workaround in the interim, I had it read through the file without the db entries and had it write out 4 smaller files that cold fusion could digest.
    Inspiring
    March 19, 2009
    > somehow. I'm only doing inserts or updates, but no selects. I guess its my
    > lack of understanding how the internals of this version of CF work, but I'm
    > kind of stumped why simple inserts or updates would eat up all the memory like
    > a leak. Any ideas?

    It's difficult to tell without seeing your code.

    --
    Adam
    Participating Frequently
    March 19, 2009
    quote:

    Originally posted by: Newsgroup User

    It's difficult to tell without seeing your code.




    I'm not sure which portion you need, but here are some examples of the SQL Inserts and Updates. They're pretty simple:

    <cfquery name="PutRec" datasource="XXX">
    INSERT INTO XXXLoad
    ( CRef,Amount,CName,CommRate,IntRate )
    VALUES
    ( '#fileno#',#amt#,'#name1#','#thecommrate#','#theintrate#' )
    </cfquery>

    <cfquery name="PutXXX" datasource="XXXData">
    INSERT INTO XXXData
    ( FileNo,FFile,MFile,FID,FmID,CType,FDate,CName,Comm,IntRate )
    VALUES
    ( '#fileno#','#ffile#','#mfile#','#fid#','#fmid#','#ctype#','#xfdate#','#name1#','#comm#','#intrate#' )
    </cfquery>

    <cfquery name="UpdRec2" datasource="XXX">
    UPDATE xxxload
    SET dname = '#dname#', address1 = '#addr1#', city = '#dcity#', state = '#dstate#',
    zip = '#dzip#', dphone = '#dphone#', ssn = '#Trim(dssn)#', dob = '#thedob#',
    status = '#astatus#'
    WHERE cref = '#fileno#'
    </cfquery>

    <cfquery name="UpdXXX6" datasource="XXXData">
    UPDATE xxxdata
    SET dno = '#dno#', batchid = '#batchid#', batchno = '#batchno#', cid = '#cid#',
    pid = '#pid#', gp1 = '#gp1#', gp2 = '#gp2#', gp3 = '#gp3#', gp4 = '#gp4#',
    gp5 = '#gp5#', gp6 = '#gp6#', officenum = '#officenum#'
    WHERE fileno = '#fileno#'
    </cfquery>
    Participating Frequently
    March 19, 2009
    Thanks guys, I'll give that a try. I would import it directly, but unfortunately much of the data needs formatted and the records are variable length.

    One thing I did try, is disabling the database transactions and it ran just fine. This leads me to believe that the memory is getting held up with those somehow. I'm only doing inserts or updates, but no selects. I guess its my lack of understanding how the internals of this version of CF work, but I'm kind of stumped why simple inserts or updates would eat up all the memory like a leak. Any ideas?
    Inspiring
    March 19, 2009
    if you are always 100% sure your file contains valid data that does not
    require any parsing/massaging to make it conform to db requirements, you
    are better off using your db's built-in function(s) to load the data
    into the table.

    cf8 has added the ability to loop over a file one line at a time, and it
    claims it does not open the whole file in memory.

    since you are on cf8 enterprise you can do:
    <cfloop file="abs-path-to-file" index="line">
    to iterate over your file line-by-line.

    make sure you set a large enough request timeout for the page doing the
    looping!

    Azadi Saryev
    Sabai-dee.com
    http://www.sabai-dee.com/
    March 19, 2009
    Hi,

    This may not be the most efficient, but I've found this a pretty stable solution when I parse files.

    <cfset x = "#selectedfile#">

    <cfloop file="#x#" index="fileLine">
    Insert parsing logic here...
    </cfloop>

    You could also try a MOD function that reads in your java, then every 25 lines or so updates the db. This way you're not maxing out your string/memory.

    cfwild