Re: Need to process 3 billion (yes billion!) rows
From: Andre Majorel (amajorel_at_teezer.fr)
Date: 04/17/04
- Previous message: Andre Majorel: "Re: FTP & Recursive Directory"
- In reply to: C Stabri: "Need to process 3 billion (yes billion!) rows"
- Next in thread: Keith Thompson: "Re: Need to process 3 billion (yes billion!) rows"
- Reply: Keith Thompson: "Re: Need to process 3 billion (yes billion!) rows"
- Reply: Carl Lowenstein: "Re: Need to process 3 billion (yes billion!) rows"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 17 Apr 2004 14:12:45 +0000 (UTC)
On 2004-04-17, C Stabri <kyrkarsa@yahoo.com> wrote:
> I have a flat text file which is 3 billion rows deep (220GB in size).
> I need to process this file in the following way:
>
> 1. Sort it
> 2. Perform some calculations on it by taking every row and calculating
> a value based on values of the row above and the row below the current
> row.
Unless you have a machine with 220GB of memory, you'll need to
split the file into manageable chunks, sort each chunk, and then
reassemble the sorted chunk. I don't think sort(1) can do that
directly.
> Please can you advise me what the best unix tools are for this job. A
> korn script calling the read function or use awk or a combination.
For such a large data set, the performance hit of using a shell
script will be huge. On my machine, piping 3e9 lines into (while
read n; do :; done) would take about 14 days. Awk might be
workable, though: 3e9 lines into awk '{x = y; y = $0}' would
take about 5 hours. But, depending on the sort of calculations
you do, it may be faster overall to write a C program.
-- André Majorel <URL:http://www.teaser.fr/~amajorel/> "Finally I am becoming stupider no more." -- Paul Erdös' epitaph
- Previous message: Andre Majorel: "Re: FTP & Recursive Directory"
- In reply to: C Stabri: "Need to process 3 billion (yes billion!) rows"
- Next in thread: Keith Thompson: "Re: Need to process 3 billion (yes billion!) rows"
- Reply: Keith Thompson: "Re: Need to process 3 billion (yes billion!) rows"
- Reply: Carl Lowenstein: "Re: Need to process 3 billion (yes billion!) rows"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|