Re: lock up



Kris Kennaway wrote:
Joao Pedras wrote:
Hi again,

hope this info sheds some light on this. A similar system is getting
setup to reproduce the problem there as well.

Thanks!

Joao

Kris Kennaway wrote:
Joao Pedras wrote:
Greetings all!

A system (Tyan S2932) I am testing CURRENT amd64 with is experiencing a
strange lock up. No panic, not much on the console, just a lock up,
freeze.

I first noticed the issue while tailing a build in a ssh session over a
vpn connection. On the local network the issue doesn't seem to occur.
I can reproduce the lock up all the time.

Last I tried, the system was running CURRENT from a couple hours ago.

I have tried:

- 4BSD and ULE
- switching network cards
- taking the IPMI card out (seems to work locally with freeipmi and
ipmitoll remotely)
- enabled/disabled "redirection after post" (BIOS setting)
- without debug
- without IPv6 and friends (see rtfree below)

dmesg and pciconf attached. The dmesg is after a lock up.
rtfree pops a few times before the lock up. I noticed from a recent
post
some action was taken and the related patch is there (ie. today's
CURRENT).

The system had a fresh install this past weekend and the LSI MegaRAID
array doesn't contain any data, it is just mounted. The system boots
off
the onboard LSI SAS (couple disks in RAID 1).

Thank you for your input.
Break to DDB and obtain process traces, etc. See the developers
handbook.

Kris


Well that's a start, but you need to trace other processes too, e.g. the
running ones. Do an 'alltrace' to trace everything.

Kris
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"

Thank you for your help Kris. I am not very familiar with these tasks.

I got the 'alltrace' and I also saved a crash dump.
The session output is here http://pedras.webvolution.net/s2932-9.txt.

I was able to reproduce the issue on a similar system (a vanilla Tyan
S2932, one SATA disc only) and it exhibits the exact same problem.

Thanks again.

Joao
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: [RFC] tcp: race in receive part
    ... Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already ... The customer has been able to reproduce this problem only on one CPU model: ... AJ18 only matters on unaligned accesses, tcp code doesnt do this. ... Memory operations issued after the LOCK will be completed after the LOCK ...
    (Linux-Kernel)
  • Re: [RFC] tcp: race in receive part
    ... Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already ... The customer has been able to reproduce this problem only on one CPU model: ... AJ18 only matters on unaligned accesses, tcp code doesnt do this. ... Memory operations issued after the LOCK will be completed after the LOCK ...
    (Linux-Kernel)
  • Re: [RFC] tcp: race in receive part
    ... The customer has been able to reproduce this problem only on one CPU model: ... Memory operations issued after the LOCK will be completed after the LOCK ... static void sock_def_readable(struct sock *sk, ...
    (Linux-Kernel)
  • Re: NFS Locking Issue
    ... to am-utils running into some race condition the other problem is related to throughput, freebsd is slower than linux, and while freebsd/nfs/tcp is faster on Freebsd than udp, on linux it's the same. ... If you can help to produce simple test cases to reproduce the bugs you're seeing, ... First, architectural issues, some derived from architectural problems in the NLM protocol: for example, assumptions that there can be a clean mapping of process lock owners to locks, which fall down as locks are properties of file descriptors that can be inheritted. ... Once you've established whether it can be reproduced with a single client, you have to track down the behavior that triggers it -- normally, this is done by attempting to narrow down the specific program or sequence of events that causes the bug to trigger, removing things one at a time to see what causes the problem to disappear. ...
    (freebsd-stable)
  • Re: [RFC] tcp: race in receive part
    ... Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already ... The customer has been able to reproduce this problem only on one CPU model: ... AJ18 only matters on unaligned accesses, tcp code doesnt do this. ... Memory operations issued after the LOCK will be completed after the LOCK ...
    (Linux-Kernel)