Re: em0 watchdog timeout (and 3ware problems) 7-stable



Greg,

I have another report of this problem, and I have a patch for you to try
out, will
be sending it out a bit later today.

Jack


On Sun, Apr 26, 2009 at 5:50 AM, Greg Byshenk <freebsd@xxxxxxxxxxx> wrote:

I have one machine that is seeing watchdog timeouts on em0, running
7-STABLE
amd64 as of 2009.04.19, and also some other more perverse errors.

Twice now in the last 48 hours, this machine has become unreachable via the
network, and connecting to the console shows an endless string of

[...]
em0: watchdog timeout -- resetting
em0: watchdog timeout -- resetting
em0: watchdog timeout -- resetting

messages. The machine is almost locked up. That is, I can get a login
prompt, but can go no further than typing in a username; after the
username, no password prompt, and nothing further. The only option is
to hard reset the machine or to drop to debugger and reboot.

Now the "perverse" part. After restarting, the system partition is no
more.

Background detail: the machine is a fileserver, with a 3Ware 9650SE-16ML
SATA controller, connected to 16 1TB SATA drives, this configured as
a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
and 6.5TB data partition. The system partition is configured as da1,
with one slice and more or less standard partitions for / /var /tmp, etc.
(the data partition of the array is sliced with gpt).

The issue here is that, upon restart, all parition information on da0
seems to have disappeared, and restarting results in a "no operating
system found" message, and a failure to boot (obviously).

But all of the data is still present. If I boot into rescue mode,
recreate da0s1, mark it bootable, and restore the bsdlabel, then
everything works again. I can restart the machine, and it comes back
up normally (it requires an fsck of everything on da0, but after that
everything is back to normal).

I don't know if this is two unrelated problems, or one problem with
two symptoms, or something else. I think that I can safely say that
it is not a problem with the 3Ware controller itself, as I replaced
the controller with a spare (identical model), and the problem
recurred. Additionally, I have an almost-identical configuration on
four other machines, none of which are experiencing any problems.
One thing that is different is that the other machines use
Intel PRO/1000 PF (pci-e) NICs.

Is there some known problem with the Intel 2572 fibre NIC? Or some
potential interaction of it with the 3ware RAID controller?

For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
threads on 7.2/bge), and am building a new kernel/world from sources
csup'd one hour ago, but I'd really like to hear any ideas about this
-- particularly the wiping of the label.

Some information about the system:


# /dev/da0s1:
8 partitions:
# size offset fstype [fsize bsize bps/cpg]
a: 2097152 0 4.2BSD 0 0 0
b: 8388608 2097152 swap
c: 104856192 0 unused 0 0 # "raw" part, don't
edit
d: 8388608 10485760 4.2BSD 0 0 0
e: 2097152 18874368 4.2BSD 0 0 0
f: 41943040 20971520 4.2BSD 0 0 0
g: 41941632 62914560 4.2BSD 0 0 0


em0@pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02
hdr=0x00
vendor = 'Intel Corporation'thernet Controller (Fiber)'
device = '2572 10/100/1000 Ethernet Controller (Fiber)'
class = networktory, range 32, base 0xda000000, size 131072,
enabled
subclass = ethernetory, range 32, base 0xda000000, size 131072,
enabled
bar [10] = type Memory, range 32, base 0xda000000, size 131072,
enabled
bar [14] = type Memory, range 32, base 0xda020000, size 65536,
enabled0x00

twa0@pci0:9:0:0: class=0x010400 card=0x100413c1 chip=0x100413c1
rev=0x01 hdr=0x00
device = '9650SE Series PCI-Express SATA2 Raid Controller'
class = mass storage
subclass = RAID
bar [10] = type Prefetchable Memory, range 64, base 0xd8000000, size
33554432, enabled
bar [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
bar [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
cap 01[40] = powerspec 2 supports D0 D1 D2 D3 current D0
cap 05[50] = MSI supports 32 messages, 64 bit
cap 10[70] = PCI-Express 1 legacy endpoint

--
greg byshenk - gbyshenk@xxxxxxxxxxx - Leiden, NL
_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"

_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Dual Boot Instructions
    ... Maybe you haven't Maximized the Disk Management window and cannot see the Graphical View at the bottom of the window. ... The Graphical View doesn't bother to put column headings, but the left-most column shows the PHYSICAL DISK number, not the partition letter. ... But, at another time, when you dual-boot into a different OS, the System Partition should remain the same volume, but a different volume will have the Boot status - and the volume that was labeled Boot before will now be "just another volume". ... Well, it will no longer the Boot volume, but if it was the System Partition (in Vista?), it still should have that status. ...
    (microsoft.public.windows.vista.hardware_devices)
  • Re: Good reasons4 NOT choosing freebsd
    ... This should be handled by the controller ... can reside inside "huge files" within an NTFS/FAT32 partition. ... Windows or anything else. ... a distribution that requires Windows in order to ...
    (comp.unix.bsd.freebsd.misc)
  • SUMMARY: Re: SAN 3310 Multiple LUNs Not Visible on Solaris 9 Host
    ... My disks show up as: ... > connected on the primary controller, and currently has a single Logical ... > Partition: 0 ... > Lun: 0 ...
    (SunManagers)
  • RE: System Partition too small on new Notebook - Resolution
    ... advance planning in case the HDD fails. ... > partition is only about 20 gb with the remainder of the HDD apparently ... I also verified this using Fdisk off a win98se boot disk. ... > Image the system partition, ...
    (microsoft.public.windowsxp.general)
  • Re: Serverdimensionierung
    ... mit vernünftigem Controller einen Controller mit meheren Kanälen, ... Für's System sind jetzt zwei gespiegelte 36er Raid 1 Platten vorgesehen. ... sowie Exchange DB und Log Files des Exchange halten soll - ich halte das für ... Auf Partition 1 das System und Swap File, auf Partition 2 die Exchange DB. ...
    (microsoft.public.de.german.backoffice.smallbiz)