Re: CURRENT freezes on Laitude D520
- From: Tai-hwa Liang <avatar@xxxxxxxxxxxxxxxxxxxx>
- Date: Mon, 11 Dec 2006 21:50:14 +0800 (CST)
On Sun, 10 Dec 2006, Robert Watson wrote:
On Sun, 10 Dec 2006, Tai-hwa Liang wrote:
which get a bit more to the heart of most problems. debug.mpsafenet=1 really exists for the purposes of supporting components which are not sufficiently locked to allow the stack to run MPSAFE, rather than as a means of disabling direct dispatch and preemption, which speak to different types of problems. The main reason that I haven't removed the administrator tunable to date is that I suspect it will be quite helpful when KAME IPSEC locking happens, but since that appears not to have happened yet, debug.mpsafenet as an option is likely causing more harm than good by being available as a stand-in sysctl masking other problems, causing people to not get to the point of properly identifying the actual cause (device driver bugs, etc).
Can the aforementioned tricks(1/2/3) being applied to RELENG_6 as well?
WITNESS is available in RELENG_6, and should be used in combination with INVARIANTS, DDB, KDB, and BREAK_TO_DEBUGGER to debug deadlocks.
Would a kernel with WITNESS/[KD]DB/BREAK_TO_DEBUGGER enabled but w/o
INVARIANTS compiled adequate to dump useful information through remote
serial console?
In RELENG_6, net.isr.direct is not enabled by default, so unless you've enabled it yourself (or are using IP fast forwarding, which is functionally similar), that won't apply.
In RELENG_6, PREEMPTION is in GENERIC and hence enabled by default, and it can be disabled by removing it from your kernel configuration. I'd like it if we could add a run-time sysctl to disable preemption even if PREEMPTION is compiled in, as it would make it easier to explore its stability and performance impact. However, this is also just a debugging step to see if that quiesces the problem, and not a fix for the actual bug.
Right now, we're discussing removing the manual debug.mpsafenet configuration flag from 7.x, and not 6.x. I fully recognize the importance of having it in place as a workaround for bugs in production, although it concerns me greatly that we're not getting these problems debugged and fixed, and instead masking them. Architectural changes are on the way that will require these bugs to be fixed properly, not just masked.
We are using RELENG_6 as our production server(postfix, squid, pf firewall/NAT, FAST_IPSEC VPN, ...), which is a dual Athlon MP board with three NICs(two fxp cards and one onboard xl, connected to three different networks).
I haven't try WITNESS, yet; however, I'm very sure that net.isr.direct=0 plus that there is no PREEMPTION in current kernel. The problem is that, with debug.mpsafenet=1, we'll always run into hard freeze w/o having any kdb> prompt on console.
Whilst turning debug.mpsafenet off only masks the real problem, I'm still wondering about if there is any less damaging way to track such problem down in a _production_ environment.
It sounds like you need to follow the instructions for kernel debugging. Depending on your tolerance of performance loss, downtime, etc, a good starting point is to configure the kernel with INVARIANTS and WITNESS. WITNESS is particularly important, if you can tolerate the performance hit, as it warns of potential deadlocks, not just actual deadlocks. Also, compile
With WITNESS enabled(debug.mpsafenet=0), I got another three pf related
warnings in the last 8 hours:
#
# this one looks like lor#046; however, I'm not sure if it is safe to drop
# pf_task_mtx with PF_UNLOCK before invoking pf_test() in pf_check_out().
#
lock order reversal:
1st 0xc06fca2c tcp (tcp) @ netinet/tcp_input.c:625
2nd 0xc69a1060 pf task mtx (pf task mtx) @ /usr/src/sys/modules/pf/../../contrib/pf/net/pf.c:6386
KDB: stack backtrace:
kdb_backtrace(ffffffff,c06bc0d0,c06ba6b8,c06847a4,c06f8b00,...) at kdb_backtrace+0x29
witness_checkorder(c69a1060,9,c699e609,18f2) at witness_checkorder+0x4cd
_mtx_lock_flags(c69a1060,0,c699e609,18f2,0,...) at _mtx_lock_flags+0x1e
pf_test(2,c64c6800,e4f6cae0,0,0,...) at pf_test+0x6b
pf_check_out(0,e4f6cae0,c64c6800,2,0,...) at pf_check_out+0x3d
pfil_run_hooks(c06fc5e0,e4f6cb5c,c64c6800,2,0,...) at pfil_run_hooks+0xb3
ip_output(c64e0300,0,e4f6cb28,0,0,...) at ip_output+0x75e
tcp_respond(0,c650a030,c650a044,c64e0300,eb232182,0,14) at tcp_respond+0x22a
tcp_input(c64e0300,14,8,121245cb,1,...) at tcp_input+0x2559
ip_input(c64e0300) at ip_input+0x729
netisr_processqueue(c06fa298) at netisr_processqueue+0x6e
swi_net(0) at swi_net+0x8c
ithread_execute_handlers(c63ff430,c644a080) at ithread_execute_handlers+0xce
ithread_loop(c63b6b10,e4f6cd38,c06aafc0,0,c06514db,...) at ithread_loop+0x4f
fork_exit(c04dd70c,c63b6b10,e4f6cd38) at fork_exit+0x61
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xe4f6cd6c, ebp = 0 ---
lock order reversal:
1st 0xc6b3984c inp (tcpinp) @ netinet/tcp_usrreq.c:368
2nd 0xc69a1060 pf task mtx (pf task mtx) @ /usr/src/sys/modules/pf/../../contrib/pf/net/pf.c:6386
KDB: stack backtrace:
kdb_backtrace(ffffffff,c06bc0a8,c06ba6b8,c06847a4,c06f8b38,...) at kdb_backtrace+0x29
witness_checkorder(c69a1060,9,c699e609,18f2) at witness_checkorder+0x4cd
_mtx_lock_flags(c69a1060,0,c699e609,18f2,0,...) at _mtx_lock_flags+0x1e
pf_test(2,c6568800,e8ef7b40,0,c6b397bc,...) at pf_test+0x6b
pf_check_out(0,e8ef7b40,c6568800,2,c6b397bc,...) at pf_check_out+0x3d
pfil_run_hooks(c06fc5e0,e8ef7bbc,c6568800,2,c6b397bc,...) at pfil_run_hooks+0xb3
ip_output(c66d3400,0,e8ef7b88,0,0,c6b397bc) at ip_output+0x75e
tcp_output(c6cf8740) at tcp_output+0xd51
tcp_usr_connect(c6ca242c,c6c25260,c6b48600) at tcp_usr_connect+0xe3
soconnect(c6ca242c,c6c25260,c6b48600) at soconnect+0x4e
kern_connect(c6b48600,4,c6c25260,c6c25260,0,...) at kern_connect+0xbb
connect(c6b48600,e8ef7d04) at connect+0x2f
syscall(3b,3b,3b,804f380,80521a0,...) at syscall+0x22f
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (98, FreeBSD ELF32, connect), eip = 0x28126fbf, esp = 0xbfbfd26c, ebp = 0xbfbfd3d8 ---
lock order reversal:
1st 0xc6d667d0 ipsec request (ipsec request) @ netipsec/xform_esp.c:897
2nd 0xc69a1060 pf task mtx (pf task mtx) @ /usr/src/sys/modules/pf/../../contrib/pf/net/pf.c:6386
KDB: stack backtrace:
kdb_backtrace(ffffffff,c06ba4b0,c06ba6b8,c06847a4,c06f8ac8,...) at kdb_backtrace+0x29
witness_checkorder(c69a1060,9,c699e609,18f2) at witness_checkorder+0x4cd
_mtx_lock_flags(c69a1060,0,c699e609,18f2,0,...) at _mtx_lock_flags+0x1e
pf_test(2,c64c6800,e4f6c7cc,0,0,...) at pf_test+0x6b
pf_check_out(0,e4f6c7cc,c64c6800,2,0,...) at pf_check_out+0x3d
pfil_run_hooks(c06fc5e0,e4f6c848,c64c6800,2,0,...) at pfil_run_hooks+0xb3
ip_output(c64e4e00,0,e4f6c814,2,0,...) at ip_output+0x75e
ipsec_process_done(c64e4e00,c6d66780,c6f59e88,c6f4a240,c0693460,...) at ipsec_process_done+0x120
esp_output_cb(c6f59e88) at esp_output_cb+0x1ed
crypto_done(c6f59e88,2,c6452000,c6f59e88,c6f4a264,...) at crypto_done+0xa0
swcr_process(0,c6f59e88,0,c04e7b7d,c0693460,...) at swcr_process+0x126
crypto_invoke(c6452000,c6f59e88,0) at crypto_invoke+0x9e
crypto_dispatch(c6f59e88,1a,c6f59e88,c6f5be38,c6f0d488,...) at crypto_dispatch+0x4a
esp_output(c64e4e00,c6d66780,0,14,9,...) at esp_output+0x4ee
ipsec4_process_packet(c64e4e00,c6d66780,0,0,c6786400,...) at ipsec4_process_packet+0x206
ip_output(c64e4e00,0,c6de9bb0,0,0,...) at ip_output+0x6df
in_gif_output(c6539000,2,c64e4e00,c6de9b84,c6de9b80,...) at in_gif_output+0x237
gif_output(c6539000,c6b95300,c6d73790,c6ba04a4,c6e1de00,...) at gif_output+0x20f
ip_output(c6b95300,0,e4f6caf8,0,0,...) at ip_output+0x8de
syncache_respond(c6d5eaf0,c6b95300) at syncache_respond+0x2a1
syncache_add(e4f6cbec,e4f6cc3c,c694a86c,e4f6cbe8,c6912b00,e4f6cc3c,c694a880,8,1) at syncache_add+0x37f
tcp_input(c6912b00,14,8,c065b728,0,...) at tcp_input+0x6dc
ip_input(c6912b00) at ip_input+0x729
netisr_processqueue(c06fa298) at netisr_processqueue+0x6e
swi_net(0) at swi_net+0x8c
ithread_execute_handlers(c63ff430,c644a080) at ithread_execute_handlers+0xce
ithread_loop(c63b6b10,e4f6cd38,c06aafc0,0,c06514db,...) at ithread_loop+0x4f
fork_exit(c04dd70c,c63b6b10,e4f6cd38) at fork_exit+0x61
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xe4f6cd6c, ebp = 0 ---
the kernel with KDB, DDB, and BREAK_TO_DEBUGGER, and user a serial or firewire console. If the hang occurs, see if you can get into the debugger, in which case the logged output from DDB for the following commands would be very useful:
show pcpu
show allpcpu
trace
alltrace
ps
show locks
show alllocks
show lockedvnods
show uma
show malloc
Please open a PR that describes your configuration, includes your kernel config (since it seems quite customized), any loader.conf settings, a detailed description of the problem, and the output. I'd be quite interested
Okay, I'll file a PR once I can collect more information with the serial
console(probably weekend). For now our system administrator is pretty
nervous about my suggestion on turning debug.mpsafenet back to 1. ;)
in know, once the machine is in a hung state, whether the numlock light goes on and off when you hit the numlock key on the keyboard.
The numlock light doesn't respond to the key when the machine is hanging;
hence Ctrl-Alt-Esc wouldn't break to debugger.
--
Thanks,
Tai-hwa Liang
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"
- Follow-Ups:
- Re: CURRENT freezes on Laitude D520
- From: Robert Watson
- Re: CURRENT freezes on Laitude D520
- References:
- CURRENT freezes on Laitude D520
- From: yal
- Re: CURRENT freezes on Laitude D520
- From: Maxim Konovalov
- Re: CURRENT freezes on Laitude D520
- From: Robert Watson
- Re: CURRENT freezes on Laitude D520
- From: Andrew Pantyukhin
- Re: CURRENT freezes on Laitude D520
- From: Robert Watson
- Re: CURRENT freezes on Laitude D520
- From: Tai-hwa Liang
- Re: CURRENT freezes on Laitude D520
- From: Robert Watson
- CURRENT freezes on Laitude D520
- Prev by Date: dcons kldunload panic
- Next by Date: Re: New Challenges ( FreeBSD on PS3 )
- Previous by thread: Re: CURRENT freezes on Laitude D520
- Next by thread: Re: CURRENT freezes on Laitude D520
- Index(es):
Relevant Pages
|