Re: find -exec surprisingly slow

From: Erik Trulsson (ertr1013_at_student.uu.se)
Date: 08/15/04

  • Next message: Matthew Seaman: "Re: find -exec surprisingly slow"
    Date: Sun, 15 Aug 2004 01:32:35 +0200
    To: "Paul A. Hoadley" <paulh@logicsquad.net>
    
    

    On Sun, Aug 15, 2004 at 08:31:43AM +0930, Paul A. Hoadley wrote:
    > Hello,
    >
    > I'm in the process of cleaning a Maildir full of spam. It has
    > somewhere in the vicinity of 400K files in it. I started running
    > this yesterday:
    >
    > find . -atime +1 -exec mv {} /home/paulh/tmp/spam/sne/ \;
    >
    > It's been running for well over 12 hours. It certainly is
    > working---the spams are slowly moving to their new home---but it is
    > taking a long time. It's a very modest system, running 4.8-R on a
    > P2-350. I assume this is all overhead for spawning a shell and
    > running mv 400K times.

    I wouldn't make that assumption. The overhead for starting new
    processes is probably only a relatively small part of the time.

    You seem to have missed the fact that operations on very large
    directories (which a directory with 400K files in it certainly
    qualifies as) simply are slow.
    A directory is essentially just a list of the names of all the files in
    it and their i-nodes. To find a given file in a directory (e.g. in
    order to create, delete or rename it) the system needs to do a linear
    search through all the files in the directory. For directories
    containing large number of files this can take some time.

    If you have the UFS_DIRHASH kernel option enabled (which I believe is
    the default since 4.5-R) then the system will keep bunch of hash-tables
    in memory to avoid having to search through the whole directory every
    time. There is however an upper limit to how much memory will be used
    for such hashtables (2MB by default) and if this limit is exceeded
    (which it probably is in your case) things will slow down again.
    The effect of the UFS_DIRHASH option is effectively that instead of
    directory operations starting to slow down after a few thousand files
    in the same directory, you can have a few tens of thousands of files
    before operations start to become noticably slower.

    I am quite certain that if those 400K files had been divided into 40
    directories, each with 10K files in it, things would have been much
    faster.

    > Is there a better way to move all files based
    > on some characteristic of their date stamp? Maybe separating the find
    > and the move, piping it through xargs? It's mostly done now, but I
    > will know better for next time.

    Reducing the number of processes spawned will certainly help some, but
    a better idea is to not have so many files in a single directory - that
    is just asking for trouble.

    -- 
    <Insert your favourite quote here.>
    Erik Trulsson
    ertr1013@student.uu.se
    _______________________________________________
    freebsd-questions@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-questions
    To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"
    

  • Next message: Matthew Seaman: "Re: find -exec surprisingly slow"

    Relevant Pages

    • Re: How come Ada isnt more popular?
      ... In an imperative language like Ada, ... > overhead from features that you don't use). ... bound or manually managed memory allocated outside the GC'ed heap. ... I can imagine to allocate some memory in a special, ...
      (comp.lang.ada)
    • Re: JRuby disabling ObjectSpace: what implications?
      ... it depends on the overhead and on the invocation model. ... sound more like there is one JVM for JRuby programs... ... the way they do (object references are no pointers to memory locations). ... You just traverse the list ...
      (comp.lang.ruby)
    • Re: Another New Hardware Thought
      ... which adds data copy, interrupt, and context switching overhead. ... memory manager would become complex enough to be half an OS in its own ... On a machine with 4mb of RAM, ... Then you have the realtime cost of swapping itself, ...
      (comp.sys.apple2)
    • Re: java object member layout question
      ... > Also for an array the overhead is 16 bites. ... > I am also looking for reference of the answers to my question. ... as to how much memory overhead is involved in a single object. ... "Java" incurs 8 bytes per object of overhead. ...
      (comp.lang.java.programmer)
    • Re: [patch] remove artificial software max_loop limit
      ... chewing up too much memory on a system. ... loopdevs before you run OOM and start killing random userspace ... Does anyone uses the 2nd meaning of upper limit? ...
      (Linux-Kernel)