Re: 6-stable locking problem



Darn, I thought that that was already fixed. I'll go dig up my patches and take care of this.

Scott


Pavel Merdin wrote:
Hello.

There's a problem with a very busy server (ad server, CPU is close to
0% idle most of the time).
Configuration: Dual AMD Opteron 252 2.6GHz
Chipset: AMD 8131
Integrated LAN Controller: Broadcom BCM5704 dual-channel GbE Gigabit
Adaptec AIC-7902W Ultra 320 SCSI controller
amr0: <LSILogic MegaRAID 1.53>

We tried both 6.1-RELEASE and 6-STABLE amd64 kernels. (bge driver is
always from recent stable with full Broadcom support).

The server hangs one or more times a day. It even hangs for some time
right after boot sequence finishes (when "login:" prompt occurs).
During a hang everything stops, even keyboard (interrupts).

We already removed PREEMPTION and linux support.
Sometimes the server can panic with:
Sleeping thread (tid 100006, pid 4) owns a non-sleepable lock
panic: sleeping thread
cpuid=0
KDB: enter: panic
and hangs there without even starting a debugger.
pid 4 seems to be [g_down]

Today I compiled a kernel with INVARIANTS and WITTNESS.
Right after booting sequence I got the following:

Aug 10 04:37:09 ad1 kernel: lock order reversal: (Giant after non-sleepable)
Aug 10 04:37:09 ad1 kernel: 1st 0xffffff026c4ebe70 AMR List Lock (AMR List Lock) @ dev/amr/amr.c:403
Aug 10 04:37:09 ad1 kernel: 2nd 0xffffffff8073adc0 Giant (Giant) @ vm/vm_contig.c:579
Aug 10 04:37:09 ad1 kernel: KDB: stack backtrace:
Aug 10 04:37:09 ad1 kernel: kdb_backtrace() at kdb_backtrace+0x37
Aug 10 04:37:09 ad1 kernel: witness_checkorder() at witness_checkorder+0x6fb
Aug 10 04:37:09 ad1 kernel: _mtx_lock_flags() at _mtx_lock_flags+0x9a
Aug 10 04:37:09 ad1 kernel: contigmalloc() at contigmalloc+0x57
Aug 10 04:37:09 ad1 kernel: alloc_bounce_pages() at alloc_bounce_pages+0x75
Aug 10 04:37:09 ad1 kernel: bus_dmamap_create() at bus_dmamap_create+0x149
Aug 10 04:37:09 ad1 kernel: amr_alloccmd_cluster() at amr_alloccmd_cluster+0x102
Aug 10 04:37:09 ad1 kernel: amr_alloccmd() at amr_alloccmd+0x55
Aug 10 04:37:09 ad1 kernel: amr_bio_command() at amr_bio_command+0x27
Aug 10 04:37:09 ad1 kernel: amr_startio() at amr_startio+0x6a
Aug 10 04:37:09 ad1 kernel: amr_submit_bio() at amr_submit_bio+0x51
Aug 10 04:37:09 ad1 kernel: amrd_strategy() at amrd_strategy+0x23
Aug 10 04:37:09 ad1 kernel: g_disk_start() at g_disk_start+0x17d
Aug 10 04:37:09 ad1 kernel: g_io_schedule_down() at g_io_schedule_down+0x189
Aug 10 04:37:09 ad1 kernel: g_down_procbody() at g_down_procbody+0x80
Aug 10 04:37:09 ad1 kernel: fork_exit() at fork_exit+0xdf
Aug 10 04:37:09 ad1 kernel: fork_trampoline() at fork_trampoline+0xe
Aug 10 04:37:09 ad1 kernel: --- trap 0, rip = 0, rsp = 0xffffffffb8e8bd00, rbp = 0 ---

Any advice (except suggestion of switching to Linux) ?


_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: nfsrvstats.srvrpc_errs rapidly increasing
    ... > The srvrcp_errs are very likely unrelated to the hangs. ... > friends) from the NFS server routines in nfs_serv.c. ... other clients may behave strangely. ...
    (freebsd-net)
  • Re: SBS 2003 Hung on " Applying computer settings"
    ... > SBS 2003 system hung on "Applying computer settings" like a few other ... I shut down the server last night to avoid a lightning ... Workstations can't access server, can't ... On restart it still hangs the same only a little more ...
    (microsoft.public.windows.server.sbs)
  • Re: System hang due to high iowait
    ... The problem is that the server hangs. ... but authentication fails because the daemon died. ... but it happens with small or large lists. ...
    (RedHat)
  • socket_select() hangs sometimes; Bug?
    ... socket_selectfor accepting new connections and dealing with existing ... The select runs on all connected sockets for read and ... the server hangs after some time. ...
    (php.general)
  • 6-stable locking problem
    ... There's a problem with a very busy server (ad server, ... Dual AMD Opteron 252 2.6GHz ... The server hangs one or more times a day. ... pid 4) owns a non-sleepable lock ...
    (freebsd-stable)