Re: nvidia-driver crashing kernel on head



On Saturday 17 July 2010 17:25:27 Christian Zander wrote:
On Sat, Jul 17, 2010 at 07:24:54AM -0700, David Naylor wrote:
(...)

These freezes and panics are due to the driver using a spin mutex
instead of a
regular mutex for the per-file descriptor event_mtx. If you patch
the driver
to change it to be a regular mutex I think that should fix the
problems.

Can you give an example? :) I don't mind creating a patch for all of
them if you can illustrate what needs to be changed.

See the attached patch

In order to use 195.36.15 it was necessary to use the patch Rene sent,
the suggestion from jhb previously to remove some locks, plus a bit
more. The patch that got it working on HEAD for me (specifically
r209633) is attached. With that patch I could start X, and run it for a
while, but performance was very poor, even in comparison with the stock
nv driver, and it crashed a couple times (although not nearly as bad as
previously).

So based on other suggestions I tried the newest release version at
nvidia, 256.35. Some of the same locking stuff was needed to patch it,
a patch for the port which includes the locking patch is also
attached. If you are running an amd64 system you'll have to type 'make
makesum' after applying this patch to the port. I'm not sure this
patch is complete, or what Alexey might want to do with the update,
but it does create an accurate plist which means you can cleanly
deinstall/pkg_delete when you're done.

With 256.35 performance and stability have both been quite good,
comparable even to before the the drama started. The only concern I
have at this point is that I'm periodically getting a strange sort of
"flash" popping up on my screen that I didn't get while I was running
the nv driver recently. It looks sort of like the default X background
(the tiny gray crosshatch) is popping through for just a split second.

I've been getting these messages on the console:

NVRM: Xid (0001:00): 16, Head 00000000 Count 000218d5
NVRM: Xid (0001:00): 8, Channel 00000000
NVRM: Xid (0001:00): 16, Head 00000000 Count 000218d6
NVRM: Xid (0001:00): 8, Channel 00000002

This is preceded by X locking hard. I cannot VT switch to a normal
console and sometimes the computer needs a hard reset (i.e. does not
respond to power button). It appears to only trigger when under heavy
load. eg
make -C /usr/src -j8 buildworld

This seems to be messing with interrupts with other subsystems as my
network drivers are less than reliable of late. (Watchdog timeouts).

The messages indicate that the NVIDIA driver hasn't received
interrupts from the GPU @ PCI:1:00.0 over a significant
period of time. If you are seeing similar problems with other
system components, there's a good chance that the above is
a symptom of some larger problem.

I think you are right. I'm not sure if this is a hardware problem or FreeBSD.
I reverted to a kernel from May 01 and the system is solid (~5 days). I'm
using the patched 256.35 driver without problem.

This happens with 195.36.15 unpatched and 256.35 patched.

I have not checked if booting with WITNESS enabled works.

Regards

* David Naylor <naylor.b.david@xxxxxxxxx>
* 0xFF6916B2

Attachment: signature.asc
Description: This is a digitally signed message part.