Re: Sorting speedy



Vassilis wrote:
Hi,

news.t-online.de wrote:

Now we have to compare these two sets of files,
and to confirm whether the functionalty in regards
to X and pre Y related code creates the same data.
We tried it in SQL, but this is too slow,
so we export the data. The data is presumably
identical, but the rows of a file in one set
is not in the same order as they are in the file
of the other set, we cannot diff or cmp.

When we sort a file and the corresponding file in
the other set, we can.


from what I gather, your problem is the different order of data
in the rows. Why don't you try remedying this?
If you know, for the two versions, the order of the rows
you can, on the fly, transform the rows of the files in set
X and compare them against the files in set Y.
This can be easily done with a tool like awk or perl.
I would be reluctant to sort up front.
Hope this helps.

Exactly this is the problem, it's the order.
For a 8 GB file with 40,000,000 records there
world be two arrays in memory of more than 10GB
each.
In the example program below,
one could also read a line from file 1,
than a line from file 2, if they match
or are already in the other set, we ignore and do not
insert into the array or delete from the other array.
This would save memory, but only when there is
a kind of order already in the files.


BEGIN{
while( getline < "file1" >0){
file1++
if($0 in FILE1){
print "Duplicate " $0
}
FILE1[$0]=1 #or line number or whatsoever
}
close("file2")
while( getline < "file2" >0){
file2++
if($0 in FILE2){
print "Duplicate " $0
}
FILE2[$0]=1 #or line number or whatsoever
}
close("file2")
if(file1 != file2){
print "numbers of read lines not identical"
}


for(i in FILE1){
if( i in FILE2){
}else{
print i" not found in file2"
}
}
for(i in FILE2){
if( i in FILE1){
}else{
print i" not found in file1"
}
}

}
.



Relevant Pages

  • Re: repalce the some lines of file1 to file2
    ... Say file1 - has 10 lines ... file2 - 50 lines ... keys2recs is an array mapping the key field to the associated record. ... says if the current key field isn't already in the array indexed by key, then this will be a new line of output so increment the line number and add the current key as the array element indexed by that line. ...
    (comp.lang.awk)
  • Re: difference in arrays/hashes
    ... > files which are ared into @file1 and @file2. ... Loop through the second array, ... of which elements do not exists as keys to the hash. ...
    (comp.lang.perl.misc)
  • Re: Comments on parsing solution.
    ... >File1: Info. on File1 ... >File2: Info. on File2 ... >File3: Info. on File3 ... Then you make an array of the same number of elements, ...
    (comp.lang.perl.misc)
  • Re: Compare 2 files
    ... Mag Gam wrote... ... blue exists in file2 ... Read file2 first into an array, then check whether each line in file1 ...
    (comp.lang.awk)
  • Re: Query in Sort
    ... If File1 has the following records ... File2 only DDD ... send email to listserv@xxxxxxxxxxx with the message: GET IBM-MAIN INFO ... Search the archives at http://bama.ua.edu/archives/ibm-main.html ...
    (bit.listserv.ibm-main)