Re: Race condition in debugger?

From: David Xu (davidxu_at_freebsd.org)
Date: 04/18/05

  • Next message: Rob: "DEVICE_POLLING is not compatible with SMP? (was: xl(4) & polling)"
    Date: Mon, 18 Apr 2005 13:26:44 +0800
    To: Peter Edwards <peadar.edwards@gmail.com>
    
    

    Peter Edwards wrote:

    >[Very late response: I just experienced the same problem and
    >remembered the issue had been brought up before]
    >
    >On 2/14/05, Greg 'groggy' Lehey <grog@freebsd.org> wrote:
    >
    >
    >>I'm having some problems with userland gdb on recent -CURRENT builds:
    >>at some point it hangs.
    >>
    >>Specifically, I'm setting a conditional breakpoint like this:
    >>
    >> b Minsert_blockletpointer if I->inode_num == 0x1f0bb
    >>
    >>inode_num increments for 1, so I hit this breakpoint about 100,000
    >>times. Or I should. What happens is that the debugger hangs at some
    >>point on the way. ktrace shows multiple copies of:
    >>
    >> 12325 gdb CALL ptrace(12,0x3026,0xbfbfd5e0,0)
    >> 12325 gdb RET ptrace 0
    >> 12325 gdb CALL ptrace(PT_STEP,0x3026,0x1,0)
    >> 12325 gdb RET ptrace 0
    >> 12325 gdb CALL wait4(0xffffffff,0xbfbfd808,0,0) <-- stops here
    >> 12325 gdb RET wait4 12326/0x3026
    >> 12325 gdb CALL kill(0x3026,0)
    >> 12325 gdb RET kill 0
    >> 12325 gdb CALL ptrace(PT_GETREGS,0x3026,0xbfbfd5c0,0)
    >>
    >>When it hangs, it's at the call to wait4, as shown. It looks like the
    >>completion of the ptrace request isn't being reported back.
    >>
    >>
    >
    >I think I know what's going on with this, and I have a feeling that
    >there's a couple of other wait()-related issues that were left open on
    >the lists that might be explained by the issue.
    >
    >Here's my hypothesis: kern_wait() checks each child of the current
    >process to see if they have exited, or should otherwise report status
    >to wait/wait3/wait4/waitpid, If it finds that all candidate children
    >have nothing to report, it goes asleep, waiting to be awoken by the/a
    >child reporting status, and repeats the process: it looks a bit like
    >this:
    >
    >kern_wait()
    >{
    >loop:
    > foreach child of self {
    > if (child has status to report)
    > return status;
    > }
    > lock self
    > msleep(on "self")
    > unlock self
    > goto loop;
    >}
    >
    >Problem is, that there's no lock protecting that the conditions in the
    >inner loop hold by the time the current process locks its own "struct
    >proc" and invokes msleep(). (It's probably most likely the race will
    >happen on an SMP machine or with PREEMPTION, but the aquiry of
    >curproc's lock could possibly cause the issue if it needed to sleep.),
    >i.e., you can miss the wakeup generated by a particular child between
    >checking the process in the inner loop, and going to sleep.
    >
    >I can at least reproduce this for the ptrace/gdb case, but AFAICT, it
    >could happen for the standard wait()/exit() path, too. I worked up a
    >patch to fix the problem by having those parts of the kernel that wake
    >the process up flag the fact in the parent's flags and doing the
    >wakeup while holding tha parent process lock, and noticing if this
    >flag has been set before sleeping. (A simpler solution would be to
    >hold the parent lock across the bulk of kern_wait, but from what I can
    >gather this will lead to at least one LOR)
    >
    >I've been unable to reproduce the problem with a kernel with this
    >patch, and using a nice sprinkling of printfs can show that when GDB
    >hangs, the race has just occurred.
    >
    >Anyone got opinions on this?
    >Cheers,
    >Peadar.
    >
    >
    If the parent has PS_NOCLDSTOP set, no SIGCHLD will be sent to parent,
    so there
    is race in the case, but if PS_NOCLDSTOP is set, the signal will be sent
    to parent,
    and parant should resume from msleep() immediately.
    I don't know why it still does have race, I am looking the code, I think
    stop() should
    be merged into thread_stopped(), there is no another caller at all.

    David Xu

    _______________________________________________
    freebsd-current@freebsd.org mailing list
    http://lists.freebsd.org/mailman/listinfo/freebsd-current
    To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"


  • Next message: Rob: "DEVICE_POLLING is not compatible with SMP? (was: xl(4) & polling)"

    Relevant Pages

    • Re: lock used in thread and by event
      ... First class has a method which use lockand then it put itself ... public Parent() ... lock ... public void CalledByEvent() ...
      (microsoft.public.dotnet.languages.csharp)
    • Re: Race condition in debugger?
      ... What happens is that the debugger hangs at some ... kern_waitchecks each child of the current ... that there's no lock protecting that the conditions in the ... checking the process in the inner loop, ...
      (freebsd-current)
    • Re: Race condition in debugger?
      ... What happens is that the debugger hangs at some ... >process to see if they have exited, or should otherwise report status ... that there's no lock protecting that the conditions in the ... >patch, and using a nice sprinkling of printfs can show that when GDB ...
      (freebsd-current)
    • Re: device struct bloat
      ... Just locking the tree root is not enough? ... modifying operation to descend into the tree). ... I'd first start by asking if you want to lock all the children or the ... parent again. ...
      (Linux-Kernel)
    • PROBLEM: threaded process stuck on 2.4.2[12]
      ... Download OpenProducer ... For me the application hangs, until I hit CTRL+Z and then do ... Process 5191 attached (waiting for parent) ... I didn't try to reproduce this problem on 2.5/6 kernels, ...
      (Linux-Kernel)