Re: Need to process 3 billion (yes billion!) rows

From: Andre Majorel (amajorel_at_teezer.fr)
Date: 04/17/04

  • Next message: kissdadog: "!!!! A chance of a lifetime !!!"
    Date: Sat, 17 Apr 2004 14:12:45 +0000 (UTC)
    
    

    On 2004-04-17, C Stabri <kyrkarsa@yahoo.com> wrote:

    > I have a flat text file which is 3 billion rows deep (220GB in size).
    > I need to process this file in the following way:
    >
    > 1. Sort it
    > 2. Perform some calculations on it by taking every row and calculating
    > a value based on values of the row above and the row below the current
    > row.

    Unless you have a machine with 220GB of memory, you'll need to
    split the file into manageable chunks, sort each chunk, and then
    reassemble the sorted chunk. I don't think sort(1) can do that
    directly.

    > Please can you advise me what the best unix tools are for this job. A
    > korn script calling the read function or use awk or a combination.

    For such a large data set, the performance hit of using a shell
    script will be huge. On my machine, piping 3e9 lines into (while
    read n; do :; done) would take about 14 days. Awk might be
    workable, though: 3e9 lines into awk '{x = y; y = $0}' would
    take about 5 hours. But, depending on the sort of calculations
    you do, it may be faster overall to write a C program.

    -- 
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    "Finally I am becoming stupider no more." -- Paul Erdös' epitaph
    

  • Next message: kissdadog: "!!!! A chance of a lifetime !!!"