Re: [mfi] command timeouts



Bjoern A. Zeeb wrote:
On Mon, 19 Feb 2007, Bjoern A. Zeeb wrote:

Hi,

I am testing mfi on a Dell 2950 with 6 PD, 2LD (1st LD=RAID1,
2nd LD=RAID5, 1HTSP).
(The somewhat sucky) megacli "works".

While most commands to gather information work fine, as do pulling out
disks hard, setting a disk offline or running some other commands hangs
'something', which might be the controller?

For example:

foo# megacli -PDOffline -PhysDrv'[1:3]' -a0

EnclId-1 SlotId-3 state changed to OffLine.
foo# foo# ls -l
<hangs forever>

It's not only this process but all disk IO related processes.


On the serial console I get:

...
mfi0: COMMAND 0xffffffff80c3c040 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3b8d0 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3cb68 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3bd98 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3bc88 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3cbf0 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3cc78 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3cf20 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3cd88 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3cfa8 TIMEOUT AFTER 732 SECONDS
mfi0: COMMAND 0xffffffff80c3d828 TIMEOUT AFTER 684 SECONDS
mfi0: COMMAND 0xffffffff80c3db58 TIMEOUT AFTER 679 SECONDS
mfi0: COMMAND 0xffffffff80c3de88 TIMEOUT AFTER 44 SECONDS
mfi0: COMMAND 0xffffffff80c3c728 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3c040 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3b8d0 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3cb68 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3bd98 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3bc88 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3cbf0 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3cc78 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3cf20 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3cd88 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3cfa8 TIMEOUT AFTER 763 SECONDS
mfi0: COMMAND 0xffffffff80c3d828 TIMEOUT AFTER 715 SECONDS
mfi0: COMMAND 0xffffffff80c3db58 TIMEOUT AFTER 710 SECONDS
mfi0: COMMAND 0xffffffff80c3de88 TIMEOUT AFTER 75 SECONDS
mfi0: COMMAND 0xffffffff80c3c728 TIMEOUT AFTER 793 SECONDS
mfi0: COMMAND 0xffffffff80c3c040 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3b8d0 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3cb68 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3bd98 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3bc88 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3cbf0 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3cc78 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3cf20 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3cd88 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3cfa8 TIMEOUT AFTER 794 SECONDS
mfi0: COMMAND 0xffffffff80c3d828 TIMEOUT AFTER 746 SECONDS
mfi0: COMMAND 0xffffffff80c3db58 TIMEOUT AFTER 741 SECONDS
mfi0: COMMAND 0xffffffff80c3de88 TIMEOUT AFTER 106 SECONDS
mfi0: COMMAND 0xffffffff80c3c728 TIMEOUT AFTER 824 SECONDS
...


I can still break to ddb. Without disk I/O, the only
possible thing I can really do is type reset.

I'll build a debugging kernel so I can do show alllocks, etc
but if someone with more experience with this driver/hw could
contact me I can run further tests.


this time with the debugging kernel:

foo# megacli -PDOffline -PhysDrv'[1:3]' -a0

EnclId-1 SlotId-3 state changed to OffLine.
foo# foo# foo# foo#


I was able to hit <enter> multiple times after the "uh it still lives"
but then ...

command 0xffffffff80c40000 not in queue, flags = 0x20, bit = 0x80
panic: command not in queue
cpuid = 2
Uptime: 1m17s
Physical memory: 4084 MB
Dumping 199 MB: 184 168 152 136 120 104 88 72 56 40 24 8
Dump complete

telnet> send brk
KDB: enter: Line break on console
[thread pid 15 tid 100009 ]
Stopped at kdb_enter+0x2f: nop
db> where
Tracing pid 15 tid 100009 td 0xffffff012f5c4000
kdb_enter() at kdb_enter+0x2f
siointr1() at siointr1+0x400
siointr() at siointr+0x2e
intr_execute_handlers() at intr_execute_handlers+0x124
Xapic_isr1() at Xapic_isr1+0x7f
--- interrupt, rip = 0xffffffff803c9787, rsp = 0xffffffffac06eb30, rbp = 0xffffffffac06eb60 ---
_mtx_lock_sleep() at _mtx_lock_sleep+0x137
_mtx_lock_flags() at _mtx_lock_flags+0xe1
mfi_timeout() at mfi_timeout+0x32
softclock() at softclock+0x1c8
ithread_loop() at ithread_loop+0xfe
fork_exit() at fork_exit+0xaa
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffac06ed40, rbp = 0 ---
db> show alllocks
Process 24 (irq78: mfi0) thread 0xffffff012f5c5000 (100020)
exclusive sleep mutex MFI I/O lock r = 0 (0xffffff012f5cc630) locked @ /u1/src/HEAD/sys/dev/mfi/mfi.c:775


After the reboot it does not seem that the command
was executed as the disk still seems to be online (at least
it was the last time).


megacli is known to be fragile. Don't Do That (tm). As for the panic,
It's probably a side effect of megacli putting the card and the driver into a chaotic state.

Scott

_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Really need help on this one
    ... Is there a way to read the output of a particular command into ... Heres a better example using ssh. ... set timeout $timeout ... exec kill -9 $pid ...
    (comp.lang.tcl)
  • Re: What if Expect buffer overflows
    ... expect_outwhen eof and timeout events happen. ... with your command and see what happens. ... Can anybody please guide what should I do to display the whole info? ...
    (comp.lang.tcl)
  • Re: Timeout error from SqlDataReader even when ConnectionTimeout = 0
    ... as well as the ConnectionTimeout. ... > Make sure you also set Command Time out to a large enough value. ... > Command Timing out even though Connection timeout is not reached. ... >> at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, ...
    (microsoft.public.dotnet.framework.adonet)
  • [mfi] command timeouts
    ... It's not only this process but all disk IO related processes. ... mfi0: COMMAND 0xffffffff80c3c040 TIMEOUT AFTER 732 SECONDS ...
    (freebsd-current)
  • aac0 command timeouts
    ... Today one of my admins noticed the following errors on a 6.0-REL-p4 system with an Adaptec 2230SLP RAID card: ... aac0: COMMAND 0xffffffff80841700 TIMEOUT AFTER 36 SECONDS ...
    (freebsd-stable)