Re: Need to process 3 billion (yes billion!) rows

From: Andre Majorel (amajorel_at_teezer.fr)
Date: 04/17/04

  • Next message: kissdadog: "!!!! A chance of a lifetime !!!"
    Date: Sat, 17 Apr 2004 14:12:45 +0000 (UTC)
    
    

    On 2004-04-17, C Stabri <kyrkarsa@yahoo.com> wrote:

    > I have a flat text file which is 3 billion rows deep (220GB in size).
    > I need to process this file in the following way:
    >
    > 1. Sort it
    > 2. Perform some calculations on it by taking every row and calculating
    > a value based on values of the row above and the row below the current
    > row.

    Unless you have a machine with 220GB of memory, you'll need to
    split the file into manageable chunks, sort each chunk, and then
    reassemble the sorted chunk. I don't think sort(1) can do that
    directly.

    > Please can you advise me what the best unix tools are for this job. A
    > korn script calling the read function or use awk or a combination.

    For such a large data set, the performance hit of using a shell
    script will be huge. On my machine, piping 3e9 lines into (while
    read n; do :; done) would take about 14 days. Awk might be
    workable, though: 3e9 lines into awk '{x = y; y = $0}' would
    take about 5 hours. But, depending on the sort of calculations
    you do, it may be faster overall to write a C program.

    -- 
    André Majorel <URL:http://www.teaser.fr/~amajorel/>
    "Finally I am becoming stupider no more." -- Paul Erdös' epitaph
    

  • Next message: kissdadog: "!!!! A chance of a lifetime !!!"

    Relevant Pages

    • Re: Why query does this?
      ... Thanks Baz! ... I'm planning to use queries to sort data and then do the ... calculations on the sorted data (I'm doing quite compex calculations with ... Then I make a query, select * from the table I just filled with stuff. ...
      (microsoft.public.access.modulesdaovba)
    • Re: why wont sort
      ... >My assumption is that the calculations are in the Control ... >property set to No, bind it to the Field where you want ... You can't sort ...
      (microsoft.public.access.reports)
    • Re: square root confusion
      ... With MathForum, ... avoid the sort of confusions I was descibing in the previous post. ... the result of calculations is a single ...
      (sci.math)
    • Re: OrderBy creates calculation error - URGENT
      ... What kind of field are you attempting to sort on? ... The right-click on the column heading? ... When the form opens the calculations are fine but once you use the orderby ... Is it not possible to retain the calculations and use OrderBy? ...
      (microsoft.public.access.forms)
    • Re: Complex sort on big files
      ... Python 2.x, or Python 3.x? ... merge step of merge sort. ... Now I'm also wondering about the best way to sort each chunk. ... figure out the most efficient way to do a complex sort on very large ...
      (comp.lang.python)