Re: nvidia-driver crashing kernel on head



On Saturday 17 July 2010 17:25:27 Christian Zander wrote:
On Sat, Jul 17, 2010 at 07:24:54AM -0700, David Naylor wrote:
(...)

These freezes and panics are due to the driver using a spin mutex
instead of a
regular mutex for the per-file descriptor event_mtx. If you patch
the driver
to change it to be a regular mutex I think that should fix the
problems.

Can you give an example? :) I don't mind creating a patch for all of
them if you can illustrate what needs to be changed.

See the attached patch

In order to use 195.36.15 it was necessary to use the patch Rene sent,
the suggestion from jhb previously to remove some locks, plus a bit
more. The patch that got it working on HEAD for me (specifically
r209633) is attached. With that patch I could start X, and run it for a
while, but performance was very poor, even in comparison with the stock
nv driver, and it crashed a couple times (although not nearly as bad as
previously).

So based on other suggestions I tried the newest release version at
nvidia, 256.35. Some of the same locking stuff was needed to patch it,
a patch for the port which includes the locking patch is also
attached. If you are running an amd64 system you'll have to type 'make
makesum' after applying this patch to the port. I'm not sure this
patch is complete, or what Alexey might want to do with the update,
but it does create an accurate plist which means you can cleanly
deinstall/pkg_delete when you're done.

With 256.35 performance and stability have both been quite good,
comparable even to before the the drama started. The only concern I
have at this point is that I'm periodically getting a strange sort of
"flash" popping up on my screen that I didn't get while I was running
the nv driver recently. It looks sort of like the default X background
(the tiny gray crosshatch) is popping through for just a split second.

I've been getting these messages on the console:

NVRM: Xid (0001:00): 16, Head 00000000 Count 000218d5
NVRM: Xid (0001:00): 8, Channel 00000000
NVRM: Xid (0001:00): 16, Head 00000000 Count 000218d6
NVRM: Xid (0001:00): 8, Channel 00000002

This is preceded by X locking hard. I cannot VT switch to a normal
console and sometimes the computer needs a hard reset (i.e. does not
respond to power button). It appears to only trigger when under heavy
load. eg
make -C /usr/src -j8 buildworld

This seems to be messing with interrupts with other subsystems as my
network drivers are less than reliable of late. (Watchdog timeouts).

The messages indicate that the NVIDIA driver hasn't received
interrupts from the GPU @ PCI:1:00.0 over a significant
period of time. If you are seeing similar problems with other
system components, there's a good chance that the above is
a symptom of some larger problem.

I think you are right. I'm not sure if this is a hardware problem or FreeBSD.
I reverted to a kernel from May 01 and the system is solid (~5 days). I'm
using the patched 256.35 driver without problem.

This happens with 195.36.15 unpatched and 256.35 patched.

I have not checked if booting with WITNESS enabled works.

Regards

* David Naylor <naylor.b.david@xxxxxxxxx>
* 0xFF6916B2

Attachment: signature.asc
Description: This is a digitally signed message part.



Relevant Pages

  • [PATCH]Documentation update broken web addresses.
    ... Below you will find an updated version from the original series bunching all patches into one big patch ... Kernel Developer's Guide at ... problems that need to be cleaned up and fixed within the Linux kernel ... ps/2 keybd is multiplexed through this driver ...
    (Linux-Kernel)
  • Re: [PATCH]Documentation update broken web addresses.
    ... As for the patch itself if anybody see's anything that might be fixed let me know and I'll fix it up ... Kernel Developer's Guide at ... problems that need to be cleaned up and fixed within the Linux kernel ... ps/2 keybd is multiplexed through this driver ...
    (Linux-Kernel)
  • Re: [GIT PULL] SCSI queuecommand API change for 2.6.37-rc1
    ... host template API to go from being called with the host lock held to ... one (i.e. the locking is simply pushed into each HBA) but will form the ... basis for optimising locking in the driver patch for the next merge ...
    (Linux-Kernel)
  • Re: [RFC 0/2] new kfifo API
    ... and I disagree about the locking changes. ... Except for Andi's NMI driver, ... Changing the behaviour of an existing interface without changing ... but the first patch will break something. ...
    (Linux-Kernel)
  • Re: [PATCH] uio: User IRQ Mode
    ... In this mode the user space driver ... is responsible for acknowledging and re-enabling the interrupt. ... This can easily be done without your patch. ...
    (Linux-Kernel)