Skip to main content
Inspiring
July 13, 2010
Question

How to read files larger than 100mb

  • July 13, 2010
  • 2 replies
  • 4453 views

Hi,

neither LrFileUtils.readFile or io:read can read a file larger than 100mb in one chunk. Any idea how to read it in one piece?

rgds - wilko

This topic has been closed for replies.

2 replies

areohbee
Legend
July 18, 2010

Here's something I wrote recently to copy a big file - could be adapted just for reading:

local __copyBigFile = function( sourcePath, destPath, progressScope )

    local fileSize = LrFileUtils.fileAttributes( sourcePath ).fileSize

    local g
    local s
    local t
    -- local blkSize = 32768 -- typical cluster size on large system or primary data drive.
    local blkSize = 10000000 -- 10MB at a time - lua is fine with big chunks.
    local nBlks = math.ceil( fileSize / blkSize )
    local b
    local x
    g, s = pcall( io.open, sourcePath, 'rb' )
    if not g then return false, s end
    g, t = pcall( io.open, destPath, 'wb' )
    if not g then
        pcall( io.close, s )
        return false, t
    end
    local done = false
    local m = 'unknown error'
    local i = 0
    repeat -- forever - until break
        g, b = pcall( s.read, s, blkSize )
        if not g then
            m = b
            break
        end
        if b then
            g, x = pcall( t.write, t, b )
            if not g then
                m = x
                break
            end
            i = i + 1
            if progressScope then
                progressScope:setPortionComplete( i, nBlks )
            end
            LrTasks.yield()
        else
            g, x = pcall( t.flush, t ) -- close also flushes, but I feel more comfortable pre-flushing and checking -
                -- that way I know if any error is due to writing or closing after written / flushed.
            if not g then
                m = x
                break
            end
            m = '' -- completed sans incident.
            done = true
            break
        end
    until false
    pcall( s.close, s )
    pcall( t.close, t )
    if done then
        return true
    else
        return false, m
    end
       
end

Vladimir Vinogradsky
Inspiring
July 13, 2010

Wilko,

I think trying to read this much in one chunk is a bad idea. Could you explain why you even need to do this?

Inspiring
July 15, 2010

To be honest - 100mb is a small value. I want to calculate checksums from e.g. psd and tiff files .Of course reading files in smaller chunks is possible, but this is just a workaround.

rgds - wilko

johnrellis
Legend
July 20, 2010

I tested the above code with block sizes of 8k, 32k( my cluster size) and 10MB and they were all about the same (local disk). I used the "thousand-one, thousand-two, ..., finger-in-the-air..." method for comparing the differences - bottom line: not much difference I could tell.

PS - with the smaller block sizes I only yielded every 1000th time through the loop.

Granted, the above code only tested local-disk/lua-code transfer speed, since I wasn't holding the whole file in memory - that's a separate matter.

The proof is in the pudding...

Rob


The original poster was interested in computing checksums of large files, which doesn't require the entire file in memory.

Out of curiosity, I timed reading a very large file with various chunk sizes, ranging from 8K to 128M.  This confirms Rob's quickie timings -- there is little difference in performance using chunk sizes from 8K through 2M.   But larger than 2M and performance starts degrading seriously, as I suspected:

Chunk sizeSecondsRatio
8K931.00
32K900.97
128K951.02
512K910.98
2M981.05
8M1031.11
32M1191.28
128M1761.89

These times are an average of 3 runs, each run reading a 6 GB file, ensuring that Windows Vista 64 (with 6 GB of memory) wouldn't be able to cache the file in memory.

What I suspect is going on: At the OS level, Windows is reading from the disk into its cache in a uniformly large chunk size, regardless of the size passed to Lua's file:read().  But at the larger chunk sizes, the program incurs higher overhead allocating strings of the given chunk size, most likely because Lua memory allocation is optimized for small objects.