Re: Comparing files

From: xyz (persson_at_katamail.com)
Date: 03/23/04


Date: 22 Mar 2004 23:35:15 -0800


"John L" <jl@lammtarra.fslife.co.uk> wrote in message news:<c3nbmv$405$1@newsg4.svr.pol.co.uk>...

> You can use the join command for this.
> (It performs a relational join, as understood by
> the database people.)
>
> join -1 1 -2 1 -a 1 -a 2 file1 file2
>
> This will output all the things you want, as well as
> rows which are the same in both files -- but it is
> simple to use awk or perl to remove these, since
> the last and last-but-one fields will be equal:
>
> join .... | awk '$NF != $(NF-1)'

Actually, this is nearly exactly what I need. The only downside is
that the output from the above won't let me tell if a given unpairable
line comes from file1 or file2, so I think I will do some trick like
this

cat file2 | awk '{ print "* " $0 }' | join -1 1 -2 2 -a 1 -a 2 file1 -

This way I know that, in the output, unpairable lines starting with "*
" and with 3 fields come from file2, and lines with 2 fields come from
file1. The awk for the subsequent remove will thus be

awk '$NF != $(NF - 2)'

since every other output line will have 4 fields.

Even better, since I create file1 and file2 myself with a script, I
can modify the script to create file2 with "* " at the beginning from
the start, so it's already in the right format for the join, and of
course the "* " can always be stripped out later. Also, I think the -o
option could also be useful (I have to read carefully the man page).

> One slight complication is that join needs is input
> to have been sorted, but if this is a problem for you
> then there are ways round it.

As I said before, this is not a problem since I create file1 and file2
myself.

> Have a quick look at the diff command. It might be useful
> *if* your input files have constrained formats.
> Likewise the comm command.

These were the alternatives I was considering before posting, but none
of them really does what I need (mainly because the output they
produce is difficult to parse easily - to me at least).

> The other way to approach it is to use a scripting
> language like awk, perl or python to process the two
> files in turn, building an associative array (or hash)
> of the first file and then using this to compare with
> the second. This is close to what you are trying to
> avoid (above) but is probably quick enough for most
> purposes.

Maybe I can try this alternative if the other way turns out to be
*very* inefficient (something I don't think will happen), otherwise I
think I'll stay with the first option you proposed.

Many thanks for now.



Relevant Pages

  • Re: egrep question
    ... I personally would not use grep but awk instead, especially if file1 and file2 ... If you use egrep, ... Tom Sawyer in file2. ...
    (AIX-L)
  • Re: Trying to make awk work for me...
    ... otherwise it looks the same ads a deletion from the first file and an addition to the second file. ... As far as the "what work means", I meant that if all I was looking for were additions or deletions, that particular awk script would suffice. ... print "records in file1 but not in file2:" ...
    (comp.lang.awk)
  • Re: Trying to make awk work for me...
    ... the comment always preceeds the rest of your test you can just test for the awk record number being odd to find it rather than testing for the ... print "records in file1 but not in file2:" ... if (!(rec in file2)) ...
    (comp.lang.awk)
  • Re: Trying to make awk work for me - again...
    ... Greg Michael wrote: ... you want to print the whole of the record from file2 then follow that with just the lines from file1 that are different between that record in file1 vs file2? ... I appear to have a version of awk that doesn't like the delete command. ...
    (comp.lang.awk)
  • Re: simple ln question
    ... that makes it so the contents of file1 are the same as of file2, and if you change the contents of 1, the other ... prints the inode number for the file foobar. ...
    (Fedora)