Re: Panic in 6.2-PRERELEASE with bge on amd64
- From: Sven Willenberger <sven@xxxxxxx>
- Date: Tue, 09 Jan 2007 09:37:05 -0500
On Tue, 2007-01-09 at 12:50 +1100, Bruce Evans wrote:
On Mon, 8 Jan 2007, Sven Willenberger wrote:
On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote:
On Sun, 7 Jan 2007, Sven Willenberger wrote:
The short and dirty of the dump:
...
--- trap 0xc, rip = 0xffffffff801d5f17, rsp = 0xffffffffb371ab50, rbp = 0xffffffffb371aba0 ---
bge_rxeof() at bge_rxeof+0x3b7
What is the instruction here?
I will do my best to ferret out the information you need. For the
bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is:
0xffffffff801d5f17 <bge_rxeof+951>: mov %r15,0x28(%r14)
...
Looks like a null pointer panic anyway. I guess the instruction is
movl to/from 0x28(%reg) where %reg is a null pointer.
from the above lines, apparently %r14 is null then.
Yes. It's a bit suprising that the access is a write.
...
#8 0xffffffff801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707
What is the statement here? It presumably follow a null pointer and only
the exprssion for the pointer is interesting. xsc is already null but
that is probably a bug in gdb, or the result of excessive optimization.
Compiling kernels with -O2 has little effect except to break debugging.
the block of code from if_bge.c:
2705 if (ifp->if_drv_flags & IFF_DRV_RUNNING) {
2706 /* Check RX return ring producer/consumer. */
2707 bge_rxeof(sc);
2708
2709 /* Check TX ring producer/consumer. */
2710 bge_txeof(sc);
2711 }
Oops. I should have asked for the statment in bge_rxeof().
#7 0xffffffff801d5f17 in bge_rxeof (sc=0xffffffff8836b000) at /usr/src/sys/dev/bge/if_bge.c:2528
2528 m->m_pkthdr.len = m->m_len = cur_rx->bge_len - ETHER_CRC_LEN;
(where m is defined as:
2449 struct mbuf *m = NULL;
)
By default -O2 is passed to CC (I don't use any custom make flags other
than and only define CPUTYPE in my /etc/make.conf).
-O2 is unfortunately the default for COPTFLAGS for most arches in
sys/conf/kern.pre.mk. All of my machines and most FreeBSD cluster
machines override this default in /etc/make.conf.
With the override overridden for RELENG_6 amd64, gcc inlines bge_rxeof(),
so your environment must be a little different to get even the above
ifo. I think gdb can show the correct line numbers but not the call
frames (since there is no call). ddb and the kernel stack trace can
only show the call frames for actual calls.
With -O1, I couldn't find any instruction similar to the mov to the
null pointer + 28. 28 is a popular offset in mbufs
If you have a suggestion for an /etc/make.conf line, I can recompile the
kernel accordingly assuming it still panics or locks up after the change
of interface noted below.
The short of it is that this interface sees pretty much non-stop traffic
as this is a mailserver (final destination) and is constantly being
delivered to (direct disk access) and mail being retrieved (remote
machine(s) with nfs mounted mail spools. If a momentary down of the
interface is enough to completely panic the driver and then the kernel,
this hardly seems "robust" if, in fact, this is what is happening. So
the question arises as to what would be causing the down/up of the
interface; I could start looking at the cable, the switch it's connected
to and ... any other ideas? (I don't have watchdog enabled or anything
like that, for example).
I don't think down/up can occur in normal operation, since it takes ioctls
or a watchdog timeout to do it. Maybe some ioctls other than a full
down/up can cause problems... bge_init() is called for the following
ioctls:
- mtu changes
- some near down/up (possibly only these)
Suspend/resume and of course detach/attach do much the same things as
down/up.
BTW, I added some sysctls and found it annoying to have to do down/up
to make the sysctls take effect. Sysctls in several other NIC drivers
require the same, since doing a full reinitialization is easiest.
Since I am tuning using sysctls, I got used to doing down/up too much.
Similarly for the mtu ioctl. I think a full reinitialization is used
for mtu changes mainly in cases the change switches on/off support for
jumbo buffers. Then there is a lot of buffer reallocation to be
done, and interfaces have to be stopped to ensure that the bufferes
being deallocated are not in use, etc.
Bruce
As this was connected to a gigE switch with mtu left at 1500 I supposed
it is possible that perhaps some mtu discovery/change may have been
happening on the switch but that seems a bit out in left field. For now
I am using the fxp interface connected to the same switch to see if the
issue continues (the change of interface was driven by a hard lockup
yesterday where I could not even type anything on the term).
Sven
_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"
- Follow-Ups:
- Re: Panic in 6.2-PRERELEASE with bge on amd64
- From: John Baldwin
- Re: Panic in 6.2-PRERELEASE with bge on amd64
- References:
- Panic in 6.2-PRERELEASE with bge on amd64
- From: Sven Willenberger
- Re: Panic in 6.2-PRERELEASE with bge on amd64
- From: Bruce Evans
- Re: Panic in 6.2-PRERELEASE with bge on amd64
- From: Sven Willenberger
- Re: Panic in 6.2-PRERELEASE with bge on amd64
- From: Bruce Evans
- Panic in 6.2-PRERELEASE with bge on amd64
- Prev by Date: Re: Fatal Trap 12 in 6.2-PRERELEASE
- Next by Date: aaccli on recent conrollers?
- Previous by thread: Re: Panic in 6.2-PRERELEASE with bge on amd64
- Next by thread: Re: Panic in 6.2-PRERELEASE with bge on amd64
- Index(es):
Relevant Pages
|