Re: LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock




On 23/04/2008, at 3:34 AM, John Baldwin wrote:

The
real problem at the bottom of the screen though is a real issue.
It's a LOR
of two different sleepqueue chain locks. The problem is that when
setrunnable() encounters a swapped out thread it tries to wakeup
proc0, but
if proc0 is asleep (which is typical) then its thread lock is a
sleep queue
chain lock, so waking up a swapped out thread from wakeup() will
usually
trigger this LOR.

I think the best fix is to not have setrunnable() kick proc0 directly.
Perhaps setrunnable() should return an int and return true if proc0
needs to
be awakened and false otherwise. Then the the sleepq code (b/c only
sleeping
threads can be swapped out anyway) can return that value from
sleepq_resume_thread() and can call kick_proc0() directly once it
has dropped
all of its own locks.

--
John Baldwin

The way you describe it, it almost sounds like this LOR should be
happening for everyone, all the time. To try and eliminate the factors
which trigger it for us, we tried the following: removed PAE from
kernel, disabled PF. Neither of these things made any difference and
the error is fairly quickly reproducible (within a couple of hours
running various things to load the machine). The one thing we did not
test yet is removing ZFS from the picture. Note also that this box ran
for years and years on FreeBSD 4.x without a hiccup (non PAE, ipfw
instead of pf and no ZFS of course).

There are two things. 1) Most people who run witness (that I know of) don't
run it on spinlocks because of the overhead, so LORs of spin locks are less
well-reported than LORs of other locks (mutexes, rwlocks, etc.). 2) You have
to have enough load on the box to swap out active processes to get into this
situation. Between those I think that is why this is not more widely
reported.


Hi John,

Thanks for your efforts so far to track this LOR down. I've been keeping an eye on cvs logs, but haven't seen anything which looks like a patch for this.

* is this still outstanding?
* or will it be addressed soon?
* if not, should I create a PR so that it doesn't get forgotten?
* in our case, although we can trigger it quickly with some load, the problem occurs (and causes a complete machine lock) even under < 10% load. Not sure if the combination of PAE/ZFS/SCHED ULE exacerbates that in any way compared to a 'standard' build.


Thank you
Ari Maniatis


-------------------------->
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001 fax +61 2 9550 4001
GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A


_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock
    ... These are all garbage in kuickshow. ... The specific LOR at ... Basically, the console driver locks ... if proc0 is asleep then its thread lock is a ...
    (freebsd-stable)
  • LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock
    ... locks, so any printf while holding a thread lock will trigger a LOR. ... if proc0 is asleep then its thread lock is a sleep queue ...
    (freebsd-stable)
  • Re: LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock
    ... It's a LOR ... of two different sleepqueue chain locks. ... if proc0 is asleep then its thread lock is a ... running various things to load the machine). ...
    (freebsd-stable)
  • Re: LOR sleepq/scrlock
    ... These are all garbage in kuickshow. ... locks, so any printf while holding a thread lock will trigger a LOR. ... if proc0 is asleep then its thread lock is a sleep queue ...
    (freebsd-stable)
  • Re: A7N8X (Deluxe) Madness
    ... I thought I would share some of my experiences with the ASUS A7N8X. ... perfectly stable and I see no performance hit with the IDE disks. ... This is clearly connected to high IDE load, ... > locks up with a 100% chance while doing an fsck. ...
    (Linux-Kernel)