How to read files larger than 100mb

Report · Jul 13, 2010

Hi,

neither LrFileUtils.readFile or io:read can read a file larger than 100mb in one chunk. Any idea how to read it in one piece?

rgds - wilko

Report · Jul 13, 2010

Wilko,

I think trying to read this much in one chunk is a bad idea. Could you explain why you even need to do this?

Report · Jul 14, 2010

To be honest - 100mb is a small value. I want to calculate checksums from e.g. psd and tiff files .Of course reading files in smaller chunks is possible, but this is just a workaround.

rgds - wilko

Report · Jul 15, 2010

In terms of performance, you'll most likely get the best performance reading in chunks much smaller than 100 MB, e.g. 1 MB or even smaller. You might see severely degraded performance trying to read an entire 500 MB file into memory.

In terms of programming convenience, of course, it's nice to read an entire file all at once into a string.

Report · Jul 19, 2010

I spent severaly years in hard drive performance and disk subsystem performance at IBM. Until you run out of memory, big is better. Esp. if you want to read a whole image in to manipulate it., in which case your better have the memory. I have one 600ish MB image (uncompressed TIF).

Chuck

Report · Jul 20, 2010

Until you run out of memory, big is better.

I strongly suspect that isn't true of Lua in Lightroom, though running actual tests would be the best way to determine that. Allocating huge contiguous chunks of private memory can strain both the OS's virtual memory and the allocator of the garbage collector. Here's what Programming in Lua, 2nd Ed by Roberto Ierusalimschy says:

Usually, in Lua, it is faster to read a file as a whole than to read it line by line. However, sometimes we must face a big file (say, tens or hundreds megabytes) for which it is not reasonable to read it all at once. If you want to handle such big files with maximum performance, the fastest way is to read them in reasonably large chunks (e.g., 8 Kbytes each).

Report · Jul 20, 2010

In this day of 64-bit OSs and 4G+ machines, 8K is electron-microscopic.

And as I said, if you are going to do image manipulation, you probably want the image in contiguous storage anyway,

so might as well read it in all at once.

And little tiny block sizes are REALLY bad if the data is on a network drive because there will only be one little block in

flight at a time. If you use a big block size, the network file system (CIFS/SMB, NFS, etc) willl have multiple packets in flight

at a time and it will go a lot faster. That's less of a problem with local disks because the disks will do speculative

read-ahead and the data will already be in the drive cache. I recently finished an analysis of a large product install
process that worked reasonably well on local disks, but abysmally over the network, even with a Gigabit ethernet

connection on the same subnet and a bad-boy file server. Turns out they were reading small blocks to get large files

(and sometimes reading the file more than once).

Report · Jul 20, 2010

I tested the above code with block sizes of 8k, 32k( my cluster size) and 10MB and they were all about the same (local disk). I used the "thousand-one, thousand-two, ..., finger-in-the-air..." method for comparing the differences - bottom line: not much difference I could tell.

PS - with the smaller block sizes I only yielded every 1000th time through the loop.

Granted, the above code only tested local-disk/lua-code transfer speed, since I wasn't holding the whole file in memory - that's a separate matter.

The proof is in the pudding...

Rob

Report · Jul 20, 2010

The original poster was interested in computing checksums of large files, which doesn't require the entire file in memory.

Out of curiosity, I timed reading a very large file with various chunk sizes, ranging from 8K to 128M. This confirms Rob's quickie timings -- there is little difference in performance using chunk sizes from 8K through 2M. But larger than 2M and performance starts degrading seriously, as I suspected:

Chunk size	Seconds	Ratio
8K	93	1.00
32K	90	0.97
128K	95	1.02
512K	91	0.98
2M	98	1.05
8M	103	1.11
32M	119	1.28
128M	176	1.89

These times are an average of 3 runs, each run reading a 6 GB file, ensuring that Windows Vista 64 (with 6 GB of memory) wouldn't be able to cache the file in memory.

What I suspect is going on: At the OS level, Windows is reading from the disk into its cache in a uniformly large chunk size, regardless of the size passed to Lua's file:read(). But at the larger chunk sizes, the program incurs higher overhead allocating strings of the given chunk size, most likely because Lua memory allocation is optimized for small objects.

Report · Jul 20, 2010

Thank you John.

Very useful information.

I don't know enough about Lua to comment about the performance hit at largest block sizes, but I think you are right that modern OS's tend to be very smart and cache-y - so most reads at the lua level are coming from the cache and not the disk. - Certainly that's true for local disk access.

Next tests: string allocation performance benchmarks? - local-disk versus network files?

Rob

Report · Jul 23, 2010

Hi,

please excuse that I was away so long but I had / have a few family problems. I haven't done much on my code, but I have some results.

johnrellis wrote:
The original poster was interested in computing checksums of large files, which doesn't require the entire file in memory.

Hmmm - my original question was how to read more than 100mb and how to get rid of this lua limitation. I would prefere to just calculate one checksum and not dozents.

However currently my code reads a chunk of the file, renders a checksum, yields and repeats those steps untill the file is read completly. The code is not optimized for tiny chunks (string concatinations, less Task.yields and so on).

The code is not optimized for tiny chunks (string concatinations, less yields and so on). Here are my results:

http://www.diestrenges.de/share/Misc/results.pdf

Report · Jul 23, 2010

So it seems you've got your answer, right? - read in chunks and accumulate checksum as you go.

Regarding the spinoff issue - if you did want to hold it all in memory, say if it were an image that you wanted to manipulate - anybody know if there's a limit on string size?

-R

Report · Jul 23, 2010

Hi,

areohbee wrote:
So it seems you've got your answer, right? - read in chunks and accumulate checksum as you go.

unfortunately not at all. Reading in chunks was what I did all the time (without concatenation of strings). We did a lot of performance testing (wich I hate, because I did it so often in history) with no real result.

Currently I will go for chunks, lots of checksums and I will do optimization somehow later.

Report · Jul 23, 2010

I'm confused.

Can't you read in chunks but maintain only one checksum?

Rob

Report · Jul 18, 2010

Here's something I wrote recently to copy a big file - could be adapted just for reading:

local __copyBigFile = function( sourcePath, destPath, progressScope )

local fileSize = LrFileUtils.fileAttributes( sourcePath ).fileSize

    local g
    local s
    local t
    -- local blkSize = 32768 -- typical cluster size on large system or primary data drive.
    local blkSize = 10000000 -- 10MB at a time - lua is fine with big chunks.
    local nBlks = math.ceil( fileSize / blkSize )
    local b
    local x
    g, s = pcall( io.open, sourcePath, 'rb' )
    if not g then return false, s end
    g, t = pcall( io.open, destPath, 'wb' )
    if not g then
        pcall( io.close, s )
        return false, t
    end
    local done = false
    local m = 'unknown error'
    local i = 0
    repeat -- forever - until break
        g, b = pcall( s.read, s, blkSize )
        if not g then
            m = b
            break
        end
        if b then
            g, x = pcall( t.write, t, b )
            if not g then
                m = x
                break
            end
            i = i + 1
            if progressScope then
                progressScope:setPortionComplete( i, nBlks )
            end
            LrTasks.yield()
        else
            g, x = pcall( t.flush, t ) -- close also flushes, but I feel more comfortable pre-flushing and checking -
                -- that way I know if any error is due to writing or closing after written / flushed.
            if not g then
                m = x
                break
            end
            m = '' -- completed sans incident.
            done = true
            break
        end
    until false
    pcall( s.close, s )
    pcall( t.close, t )
    if done then
        return true
    else
        return false, m
    end

end