Re: High traffic NFS performance and availability problems

From: David Rice (drice_at_globat.com)
Date: 02/21/05

  • Next message: Eric Anderson: "Re: High traffic NFS performance and availability problems"
    To: Robert Watson <rwatson@FreeBSD.org>
    Date: Mon, 21 Feb 2005 12:34:51 -0800
    
    

    Here are the snapshots of the output you requested. These are from the NFS
    server. We have just upgraded them to 5.3-RELEASE as so many have recomended.
    Hope that makes them more stable. The performance still needs some attention.

    Thank You

    --------------------------------------------------------------------------------------------------
    D USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
        4 users Load 5.28 19.37 28.00 Feb 21 12:18

    Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER
            Tot Share Tot Share Free in out in out
    Act 19404 2056 90696 3344 45216 count
    All 1020204 4280 4015204 7424 pages
                                                              zfod Interrupts
    Proc:r p d s w Csw Trp Sys Int Sof Flt cow 7226 total
               5128 5 60861 3 14021584 9 152732 wire 4: sio0
                                                        23228 act 6: fdc0
    30.2%Sys 11.8%Intr 0.0%User 0.0%Nice 58.0%Idl 803616 inact 128 8: rtc
    | | | | | | | | | | 43556 cache 13: npx
    ===============++++++ 1660 free 15: ata
                                                              daefr 6358 16: bge
    Namei Name-cache Dir-cache prcfr 1 17: bge
        Calls hits % hits % react 18: mpt
         1704 971 57 11 1 pdwak 19: mpt
                                                         5342 pdpgs 639 24: amr
    Disks amrd0 da0 pass0 pass1 pass2 intrn 100 0: clk
    KB/t 22.41 0.00 0.00 0.00 0.00 114288 buf
    tps 602 0 0 0 0 510 dirtybuf
    MB/s 13.16 0.00 0.00 0.00 0.00 70235 desiredvnodes
    % busy 100 0 0 0 0 20543 numvnodes
                                                         7883 freevnodes
    -----------------------------------------------------------------------------------------
    last pid: 10330; load averages: 14.69, 11.81, 18.62
    up 0+09:01:13 12:32:57
    226 processes: 5 running, 153 sleeping, 57 waiting, 11 lock
    CPU states: 0.1% user, 0.0% nice, 66.0% system, 24.3% interrupt, 9.6% idle
    Mem: 23M Active, 774M Inact, 150M Wired, 52M Cache, 112M Buf, 1660K Free
    Swap: 1024M Total, 124K Used, 1024M Free

      PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
       63 root -44 -163 0K 12K WAIT 0 147:05 45.07% 45.07% swi1: net
       30 root -68 -187 0K 12K WAIT 0 101:39 32.32% 32.32% irq16:
    bge0
       12 root 117 0 0K 12K CPU2 2 329:09 19.58% 19.58% idle: cpu2
       11 root 116 0 0K 12K CPU3 3 327:29 19.24% 19.24% idle: cpu3
       13 root 114 0 0K 12K RUN 1 263:39 16.89% 16.89% idle: cpu1
       14 root 109 0 0K 12K CPU0 0 228:50 12.06% 12.06% idle: cpu0
      368 root 4 0 1220K 740K *Giant 3 45:27 7.52% 7.52% nfsd
      366 root 4 0 1220K 740K *Giant 0 48:52 7.28% 7.28% nfsd
      364 root 4 0 1220K 740K *Giant 3 53:01 7.13% 7.13% nfsd
      367 root -8 0 1220K 740K biord 3 41:22 7.08% 7.08% nfsd
      372 root 4 0 1220K 740K *Giant 0 28:54 7.08% 7.08% nfsd
      365 root -1 0 1220K 740K *Giant 3 51:53 6.93% 6.93% nfsd
      370 root -1 0 1220K 740K nfsslp 0 32:49 6.84% 6.84% nfsd
      369 root -8 0 1220K 740K biord 1 36:40 6.49% 6.49% nfsd
      371 root 4 0 1220K 740K *Giant 0 25:14 6.45% 6.45% nfsd
      374 root -1 0 1220K 740K nfsslp 2 22:31 6.45% 6.45% nfsd
      377 root 4 0 1220K 740K *Giant 2 17:21 5.52% 5.52% nfsd
      376 root -4 0 1220K 740K *Giant 2 15:45 5.37% 5.37% nfsd
      373 root -4 0 1220K 740K ufs 3 19:38 5.18% 5.18% nfsd
      378 root 4 0 1220K 740K *Giant 2 13:55 4.54% 4.54% nfsd
      379 root -8 0 1220K 740K biord 3 12:41 4.49% 4.49% nfsd
      380 root 4 0 1220K 740K - 2 11:26 4.20% 4.20% nfsd
        3 root -8 0 0K 12K - 1 21:21 4.05% 4.05% g_up
        4 root -8 0 0K 12K - 0 20:05 3.96% 3.96% g_down
      381 root 4 0 1220K 740K - 3 9:28 3.66% 3.66% nfsd
      382 root 4 0 1220K 740K - 1 10:13 3.47% 3.47% nfsd
      385 root -1 0 1220K 740K nfsslp 3 7:21 3.17% 3.17% nfsd
       38 root -64 -183 0K 12K *Giant 0 14:45 3.12% 3.12% irq24:
    amr0
      384 root 4 0 1220K 740K - 3 8:40 3.12% 3.12% nfsd
       72 root -24 -143 0K 12K WAIT 2 16:50 2.98% 2.98% swi6:+
      383 root -8 0 1220K 740K biord 2 7:57 2.93% 2.93% nfsd
      389 root 4 0 1220K 740K - 2 5:31 2.64% 2.64% nfsd
      390 root -8 0 1220K 740K biord 3 5:54 2.59% 2.59% nfsd
      387 root -8 0 1220K 740K biord 0 6:40 2.54% 2.54% nfsd
      386 root -8 0 1220K 740K biord 1 6:22 2.44% 2.44% nfsd
      392 root 4 0 1220K 740K - 3 4:27 2.10% 2.10% nfsd
      388 root -4 0 1220K 740K *Giant 2 4:45 2.05% 2.05% nfsd
      395 root 4 0 1220K 740K - 0 3:59 2.05% 2.05% nfsd
      391 root 4 0 1220K 740K - 2 5:10 1.95% 1.95% nfsd
      393 root 4 0 1220K 740K sbwait 1 4:13 1.56% 1.56% nfsd
      398 root 4 0 1220K 740K - 2 3:31 1.56% 1.56% nfsd
      399 root 4 0 1220K 740K - 3 3:12 1.56% 1.56% nfsd
      401 root 4 0 1220K 740K - 1 2:57 1.51% 1.51% nfsd
      403 root 4 0 1220K 740K - 0 3:04 1.42% 1.42% nfsd
      406 root 4 0 1220K 740K - 1 2:27 1.37% 1.37% nfsd
      397 root 4 0 1220K 740K - 3 3:16 1.27% 1.27% nfsd
      396 root 4 0 1220K 740K - 2 3:42 1.22% 1.22% nfsd

    On Saturday 19 February 2005 04:23 am, Robert Watson wrote:
    > On Thu, 17 Feb 2005, David Rice wrote:
    > > Typicly we have 7 client boxes mounting storage from a single file
    > > server. Each client box servers 1000 web sites and associate email. We
    > > have done the basic NFS tuning (ie: Read write size optimization and
    > > kernel tuning)
    >
    > How many nfsd's are you running with?
    >
    > If you run systat -vmstat 1 on your server under high load, could you send
    > us the output? In particular, I'm interested in knowing how the system is
    > spending its time, the paging level, I/O throughput on devices, and the
    > systat -vmstat summary screen provides a good summary of this and more. A
    > few snapshots of "gstat" output would also be very helpful. As would a
    > snapshot or two of "top -S" output. This will give us a picture of how
    > the system is spending its time.
    >
    > > 2. Client boxes have high load averages and sometimes crashes due to
    > > slow NFS performance.
    >
    > Could you be more specific about the crash failure mode?
    >
    > > 3. File servers that randomly crash with "Fatal trap 12: page fault
    > > while in kernel mode"
    >
    > Could you make sure you're running with at least the latest 5.3 patch
    > level on the server, which includes some NFS server stability fixes, and
    > also look at sliding to the head of 5-STABLE? There are a number of
    > performance and stability improvements that may be relevant there.
    >
    > Could you provide serial console output of the full panic message, trap
    > details, compile the kernel with KDB+DDB, and include a full stack trace?
    > I'm happy to try to help debug these problems.
    >
    > > 4. With soft updates enabled during FSCK the fileserver will freeze with
    > > all NFS processs in the "snaplck" state. We disabled soft updates
    > > because of this.
    >
    > If it's possible to do get some more information, it would be quite
    > helpful. In particular, could you compile the server box with
    > DDB+KDB+BREAK_TO_DEBUGGER, breka into the serial debugger when it appears
    > wedged, and put the contents of "show lockedvnods", "ps", and "trace
    > <pid>" of any processes listed in "show lockedvnods" output, that would be
    > great. A crash dump would also be very helpful. For some hints on the
    > information that is necessary here, take a look at the handbook chapter on
    > kernel debugging and reporting kernel bugs, and my recent post to current@
    > diagnosing a similar bug.
    >
    > If you e-enable soft updates but leave bgfsck disabled, does that correct
    > this stability problem?
    >
    > In any case, I'm happy to help try to figure out what's going on -- some
    > of the above information for stability and performance problems would be
    > quite helpful in tracking it down.
    >
    > Robert N M Watson

    _______________________________________________
    freebsd-performance@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-performance
    To unsubscribe, send any mail to "freebsd-performance-unsubscribe@freebsd.org"


  • Next message: Eric Anderson: "Re: High traffic NFS performance and availability problems"

    Relevant Pages

    • Re: PXE boot
      ... The DHCP server ... The NFS server ... The DHCP servers task is to tell any asking clients that a boot image is ... The next step is to tell it about where to get a kernel and ...
      (Ubuntu)
    • [PATCH] Documentation: move nfsroot.txt to filesystems/
      ... +filesystem mounted via NFS. ... and 'server' means the NFS server. ... Enabling nfsroot capabilities ... Kernel command line ...
      (Linux-Kernel)
    • RE: Linux 2.6.8-rc4 "Kernel panic: Attempted to kill init!" - af ter replacing /fadsroot
      ... In my previous e-mail I forgot to mention that on the remote NFS Linux ... I am booting with ramdisk as root filesystem server and then trying to ... PowerPC Linux Kernel Image ...
      (Linux-Kernel)
    • Re: 2.6.0-test2-mm1
      ... I get nfs errors from the activity with the 2.4 systems, ... Aug 8 16:56:47 srv-lr2600 kernel: nfs: server fs not responding, ...
      (Linux-Kernel)
    • Strange NFS Problem
      ... All our unix systems at work have their home directory mounted via NFS ... last kernel. ... will replace network cables tomorrow ... I have increased the number of servers on the NFS server from 8 to 16. ...
      (Fedora)