Re: freebsd-update's install_verify routine excessive stating



Oliver,

You are making a good point but connecting two different subjects.

The algorithm with awk is still the fastest in theory.
It uses a hash table which runs at O(c) for each comparison
ASSUMING you have a good hash function that yields such result.
The total number of comparison is O(n) or O(C1 * n) where C1
is some constant number based on the hash function performance.

On the other hand, comparison sorts are in O(n * log(n))
comparison. The 2nd comparison of two files are at O(n).
So, total comparison is still O(n * log(n)) or O(C2 * n * log(n))
where C2 is some constant based on the algorithm and implementation.

That is in theory. In reality or practice, it may not be true because
you could have poor C1 such that the comparison solution becomes better
or because n is small enough so that it doesn't matter which solution to
take. In fact, awk implementation was poor and resulted large C1 as well
as slower than sort(1).

Experiment is yet important so that you can determine the best
algorithm in your TEST environment. This is where you are making a point.
If you have control over these conditions, you can not to choose the
theoretically best algorithm. I agree and had indeed expected your
outcome as awk is actually known to inefficiency in its implementations.

In that regard, in the practical point of view, the original implementation
by cperciva@ was good enough such that searching for faster algorithms
was not necessary for all people except Bert.

At the end, it is up to the implementer to pick up the solution.

Regards,
Hiro


On Fri, 23 Jan 2009 12:09:03 +0100 (CET)
Oliver Fromme <olli@xxxxxxxxxxxxxxxxx> wrote:


Yoshihiro Ota wrote:
> Oliver Fromme wrote:
> > It would be much better to generate two lists:
> > - The list of hashes, as already done ("filelist")
> > - A list of gzipped files present, stripped to the hash:
> >
> > (cd files; echo *.gz) |
> > tr ' ' '\n' |
> > sed 's/\.gz$//' > filespresent
> >
> > Note we use "echo" instead of "ls", in order to avoid the
> > kern.argmax limit. 64000 files would certainly exceed that
> > limit. Also note that the output is already sorted because
> > the shell sorts wildcard expansions.
> >
> > Now that we have those two files, we can use comm(1) to
> > find out whether there are any hashes in filelist that are
> > not in filespresent:
> >
> > if [ -n "$(comm -23 filelist filespresent)" ]; then
> > echo -n "Update files missing -- "
> > ...
> > fi
> >
> > That solution scales much better because no shell loop is
> > required at all.
>
> This will probably be the fastest.

Are you sure? I'm not.
Only a benchmark can answer that. See below.

> awk -F "|" '
> $2 ~ /^f/{required[$7]=$7; count++}
> END{FS="[/.]";
> while("find files -name *.gz" | getline>0)
> if($2 in required)
> if(--count<=0)
> exit(0);
> exit(count)}' "$@"

I think this awk solution is more difficult to read and
understand, which means that it is also more prone to
introduce errors. In fact, there are several problems
already:

First, you didn't quote the *.gz wildcard, so it will
fail if there happens to be a file "*.gz" in the current
directory.

Second, the script makes assumptions about the format of
the hash strings, e.g. that they don't contain dots.
(Currently they only contain hex digits, AFAICT, but what
if the format changes in the future.)

Third, exit(count) is a bad idea, because exit codes are
truncated to 8 bits. If 256 files happen to be missing,
the script will seem to exit successfully.

All these flaws could be fixed, of course, but it will
introduce more complexity. The shell code using sort(1)
and comm(1) doesn't have those flaws and appears more
robust.

> It verifies entries using hashtable instead of sort.
> Therefore, it is O(n+m) instead of O(n*log(n))+O(m*log(m)) in theory.
> I am not well aware how good awk's associate array is, though.

It is pretty simple. It's a hash list that starts with
50 buckets. The number of buckets is doubled (and all
entries re-hashed!) when the average number of elements
per bucket exceeds 2. The hash function is very simple,
it's just "hash = hash * 31 + c" for every character c
in the string, the result is modulo the current number
of buckets. Note that freebsd-update uses SHA256 hashes
which are fairly long (64 characters).

BTW, you can easily make benchmarks. The following command
will create 64000 files of the format "%64x.gz". Be sure
to have enough free inodes on your file system ("df -i").

jot -rw "%08x" 64000 0 2000000000 | sed 's/.*/&&&&&&&&.gz/' | xargs touch

This took 2 minutes on my notebook (3 years old). I also
noticed that the first 47000 inodes were created very
quickly (about 5 seconds), but then the speed dropped down
suddenly to about 150 inodes per second for the rest of
the two minutes. CPU was 100% system according to top.
Removing the files is equally slow.

So there seems to be a limit of about 47000 inodes per
directory -- any more than that and it gets horribly
inefficient. Therefore it would probably be a good idea
to change freebsd-update to create subdirectories and
distribute the files among them. For example, make 16
subdirectories [0-9a-f] and put the files into them
according to the first digit of the hash. This will
probably boost performance noticeably.

Sorting those 64000 files is really *not* an issue. The
difference between "ls" and "ls -f" is only 0.2 seconds on
my notebook. When using "ls -f | sort", sort(1) uses only
0.12 seconds of CPU time. This is negligible.

Next I made a simple test with awk within that directory:

awk 'BEGIN{while("find . -name \\*.gz" | getline>0)required[$1]=$1}'

That script (which does only half of the required work)
takes 4 seconds, reproducible. So I didn't bother to
write and test the full awk solution.

Bottom line: The awk solution is less robust, less readable,
and much slower.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

"The most important decision in [programming] language design
concerns what is to be left out." -- Niklaus Wirth
_______________________________________________
freebsd-hackers@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: SHA-1 vs. triple-DES for password encryption?
    ... be better to use a standard algorithm rather than a home-grown one. ... SHA-1 and 3DES have been reviewed for some time. ... This is where a hash comes in nicely. ... Longer passwords and hashes aid in making the hash much harder to work with. ...
    (SecProg)
  • Re: sort unique
    ... given that a hash table is not ... IMO if the vendor's algorithm does something "obvious", ... function to eliminate keys that hash to the same bucket per some ... strings of random lengths, and two strings are ...
    (comp.lang.lisp)
  • Re: out of memory
    ... read only the smaller file into a hash. ... the smaller file will fit into RAM. ... Depending upon the sorting algorithm this would be Ologor ... put your relevant data into a database and use ...
    (comp.lang.perl.misc)
  • Re: Probabalistic algorithms.
    ... >Hashing is typically just an optimisation. ... all the hash does is guarantee that given some ... >hard to factor the composite into its two prime factors. ... >algorithm that's dfaster than brute force factorisation, ...
    (comp.lang.pascal.delphi.misc)
  • Re: practical guarantee on iteration order?
    ... traverses a hash table... ... myarr) res2 = res2 myarr ... I could get away with the same "for " because the hash ... I still find people will not mention awk in their ...
    (comp.lang.awk)

Quantcast