Re: NFS mounts to NetApp

From: Troy Campbell (tlc_at_studioc.org)
Date: 05/18/04


Date: 18 May 2004 12:22:02 -0700

One thing I could suggest if you have a few megabytes (3-5) of disk space
is to gather some statistics from your Sun servers to verify throughput
and look for errors. As an example a simple tool I recently posted might
help called "nsar". One thing that is handy about "nsar" is you do
not need root access to install and get complete information (e.g., prod)
since it simply uses the netstat command. For example when we backup our
Netapp's through our Sun server we get something like (truncated outputs):

titan% nsar | more

SunOS titan 5.8 Generic_108528-18 sun4u sparc SUNW,Ultra-Enterprise May 18, 2004
                                  
 time iface in% out% tot% idle%
00:05 ce0 0 0 0 100
00:10 ce0 0 0 0 100
00:15 ce0 0 0 0 100
00:20 ce0 0 0 0 100
00:25 ce0 0 0 0 100
00:30 ce0 0 0 0 100
00:35 ce0 0 0 0 100
00:40 ce0 0 0 0 100
00:45 ce0 3 3 6 94
00:50 ce0 0 0 0 100
00:55 ce0 0 0 0 100
01:00 ce0 0 0 0 100
01:05 ce0 0 0 0 100
01:10 ce0 0 0 0 100
01:15 ce0 0 0 0 100
01:20 ce0 0 0 0 100
01:25 ce0 3 3 6 94
01:30 ce0 3 3 6 94
01:35 ce0 5 5 10 90

You can get data values in kbytes with:

nsar -k | more

titan% nsar -k | more

SunOS titan 5.8 Generic_108528-18 sun4u sparc SUNW,Ultra-Enterprise May 18, 2004
                                         
 time iface iKbyte oKbyte tKbyte
00:05 ce0 8331 12341 20672
00:10 ce0 4570 55083 59653
00:15 ce0 3459 7904 11364
00:20 ce0 8523 8235 16758
00:25 ce0 5222 7367 12590
00:30 ce0 3782 8302 12084
00:35 ce0 3297 7818 11115
00:40 ce0 3477 7884 11361
00:45 ce0 951843 952673 1904517
00:50 ce0 8515 8048 16563
00:55 ce0 5379 7834 13213
01:00 ce0 6168 5936 12105
01:05 ce0 22941 22680 45622
01:10 ce0 46250 46044 92294
01:15 ce0 4842 4441 9283
01:20 ce0 168622 168509 337131
01:25 ce0 1181493 1170559 2352052
01:30 ce0 1240849 1241378 2482228

then in k bytes per second (instead of per interval above) with:

 time iface iKbyte/s oKbyte/s tKbyte/s
00:05 ce0 27 41 68
00:10 ce0 15 183 198
00:15 ce0 11 26 37
00:20 ce0 28 27 55
00:25 ce0 17 24 41
00:30 ce0 12 27 40
00:35 ce0 10 26 37
00:40 ce0 11 26 37
00:45 ce0 3172 3175 6348
00:50 ce0 28 26 55
00:55 ce0 17 26 44
01:00 ce0 20 19 40
01:05 ce0 76 75 152
01:10 ce0 154 153 307
01:15 ce0 16 14 30
01:20 ce0 562 561 1123
01:25 ce0 3938 3901 7840
01:30 ce0 4136 4137 8274

Finally you may want to add interface errors to the output:

titan% nsar -kse | more

SunOS titan 5.8 Generic_108528-18 sun4u sparc SUNW,Ultra-Enterprise May 18, 2004
                                                              
 time iface iKbyte/s oKbyte/s tKbyte/s ierr/s oerr/s coll/s
00:05 ce0 27 41 68 0 0 0
00:10 ce0 15 183 198 0 0 0
00:15 ce0 11 26 37 0 0 0
00:20 ce0 28 27 55 0 0 0
00:25 ce0 17 24 41 0 0 0
00:30 ce0 12 27 40 0 0 0
00:35 ce0 10 26 37 0 0 0
00:40 ce0 11 26 37 0 0 0
00:45 ce0 3172 3175 6348 0 0 0
00:50 ce0 28 26 55 0 0 0
00:55 ce0 17 26 44 0 0 0
01:00 ce0 20 19 40 0 0 0
01:05 ce0 76 75 152 0 0 0
01:10 ce0 154 153 307 0 0 0
01:15 ce0 16 14 30 0 0 0
01:20 ce0 562 561 1123 0 0 0
01:25 ce0 3938 3901 7840 0 0 0
01:30 ce0 4136 4137 8274 0 0 0

Also there is an option to view the "high fidelity" data in "netstat -s"
for really indepth analysis e.g.,:

titan% nsar -w "tcpRetransSegs tcpRetransBytes tcpOutAck" -S TCP | more

SunOS titan 5.8 Generic_108528-18 sun4u sparc SUNW,Ultra-Enterprise May 18, 2004

 time proto tcpRetransSegs tcpRetransBytes tcpOutAck
00:05 TCP 1 0 2015
00:10 TCP 1 0 600
00:15 TCP 0 0 637
00:20 TCP 0 0 2280
00:25 TCP 2 0 1318
00:30 TCP 1 0 605
00:35 TCP 2 0 566
00:40 TCP 1 0 599
00:45 TCP 2 0 311530
00:50 TCP 1 0 2213
00:55 TCP 0 0 1173
01:00 TCP 1 0 1574
01:05 TCP 1 0 7145
01:10 TCP 2 0 14850
01:15 TCP 1 0 974
01:20 TCP 2 0 55644
01:25 TCP 19 13149 384399
01:30 TCP 392 572320 408865

I'm still a total novice on tcp parameters so I can't help you there
however having the data might be enough to figure it out.

You can get nsar at: http://www.studioc.org/software/

Hope that helps,

Troy

ebrindle@ciena.com (E) wrote in message news:<96566f88.0405180652.22e01b6e@posting.google.com>...
> I'm looking for ideas on pinpointing the source of an issue we're
> having regarding NFS timeouts to NetApp filers from Sun Solaris boxes.
>
> We have a NetApp filer cluster where each partner has a fiber gig
> connection to a back end switch (8 port Cisco) and a 100MB connection
> to the corporate LAN. There are Sun boxes (6500s, 420s) also
> connected into the backend network, running Solaris 2.6 or 8, using
> NFS to access the shares on the filer, and some other servers (a
> couple of 220s and an Ultra 60) connecting via NFS over the corporate
> network.
>
> We're getting NFS timeout errors. Most often this is between one of
> the filers and one of the 6500s (their kind of paired up, and
> occurances are on both pairs - all connecting over the backend
> switch), but NFS timeout errors are seen on all machines at some time
> or another, sometimes all at the same time. The frequency of errors
> listed seems to correllate to how much the Sun box is requesting the
> data off the NFS mount.
>
> Error example:
> May 18 05:32:07 ncc-1701 unix: NFS server ge-maytag not responding
> still trying
> May 18 05:32:07 ncc-1701 unix: NFS server ge-maytag ok
>
> Sometimes we get the "last message repeated xx times" as well. The
> errors usually occur back and forth (not responding/ok) for maybe 10
> or so occurances at a time, and more, with repeats, when we're seeing
> serious timeouts. The lesser used machines may only see one or two
> message pairs for their outage.
>
> We see these errors every day, usually when the backups are running
> but sometimes in the middle of the day. If someone decides to copy
> lots of data from one volume to another, that seems to be a big
> trigger and the filer stops responding for about 5 minutes or so and
> the servers wait for the NFS mounts to come back. This might be
> happening in the middle of the night, too, but it's not as impacting
> to the users.
>
> We've had these problems for a while now, but I think they started
> around the time the filers' OnTap version was upgraded and new DS14
> shelves were added.
>
> NetApp says everything is fine with our configuration, but we might
> want to upgrade our Sun patches. Sun says everything is okay, but we
> might want to look at the network. I'm hoping for some field
> experience to help troubleshoot this and kind of get around this
> fingerpointing.
>
> All of our Sun mounts use the same options:
> ex.
> ge-maytag:/vol/vol0/archive - /archive nfs -
> yes timeo=25,bg,hard,intr,rw,vers=3,proto=tcp
>
> although I tried taking out the timeo value on a lesser effected
> server and we're still seeing errors. The filers are fairly heavily
> used, but not pegging at 100%... they spike in the 90's, but there
> doesn't seem to be a direct correllation between heavy CPU usage and
> our issues. There does seem to be a connection between heavy disk
> usage (which I only know from knowing when large - 1GB or greater -
> copies are going on) and the NFS timeouts.
>
> The nfs options on the filer are:
>
> ncc-1701# rsh maytag options nfs
> nfs.locking.check_domain on
> nfs.mount_rootonly on
> nfs.mountd.trace off
> nfs.per_client_stats.enable off
> nfs.require_valid_mapped_uid off
> nfs.tcp.enable on
> nfs.udp.xfersize 32768
> nfs.v2.df_2gb_lim off
> nfs.v3.enable on
> nfs.webnfs.enable off
> nfs.webnfs.rootdir XXX
> nfs.webnfs.rootdir.set off
>
> OnTap version: NetApp Release 6.3.3
>
> I haven't tried patching, but the servers are at different, although
> all pretty much older, patch levels, and different OS versions; I
> haven't tried changing network cables. These are production servers
> and I'm trying to avoid trying solutions which would be causing more
> downtime than the actual problem, but this is getting to be high
> profile.
>
> So (gosh, this is long winded), my questions are:
> Is there something incompatible in our NFS options?
> If a filer can't get to one server because of a bad network cable or
> setting, is that going to cause issues with other machines trying to
> reach that filer?
> Basically, seeing these issues, which area (filer, sun, network) could
> likely be the culprit?
>
> I only knew the solaris news group to post to... Does anyone know if
> there is one for netapps or storage that gets attention, too?
>
> Thanks very, very much in advance for any insight on this.
>
> Elizabeth Brindley
> ebrindle@ciena.com



Relevant Pages