Re: Sol9 x86 4/03 generates a DMA state error during install

From: Bruce Adler (bruce.NxOxSxPxAxMx.adler_at_acm.org)
Date: 08/12/03


Date: Tue, 12 Aug 2003 12:49:46 GMT


"Dennis Clarke" <dclarke@blastwave.org> wrote in message news:Pine.GSO.4.53.0308111840310.3675@blastwave...
> ... From
> there is seems to be running as it prints out the Solaris 9 license message
> and then out comes :
>
> pcplusmp: NMIreceived

NMIs are usually associated with either a Machine Check or a memory bus
error. Does your system perhaps have ECC memory? Or perhaps, L3 cache
chips? If not, then maybe you've got an overheating or buggy CPU.

> WARNING: /pci@0,0/pci1000,1@1 (ncrs0): Unexpected DMA state : ACTIVE.
> dstat=c0<DMA-FIFO-empty,master-data-parity-error>

That's not supposed to happen.

It says that at a DMA buffer boundary (or at the conclusion of the whole
I/O request), that the PCI Bus Mastering engine on the NCR chip detected a
parity error on the PCI bus. I don't think you can tell from that
which direction the bus was transferring (to or from system RAM), but
it probably doesn't matter.

If it's a add-in card, the first thing I would do is try reseating it,
or moving it to a different slot.

If it's a really ancient BIOS, check to see whether there are any
PCI bus mastering, or chip set, settings that have been tweaked and
untweak them. Or, maybe your battery died and the BIOS settings
saved in the CMOS somehow got set to some random value (if that's
possible).

> ... Then kaboom! black screen and the system reboots.

That's definitely never supposed to happen, no matter what kind of
DMA errors the NCR chip detects. And especially not once you've
got the whole kernel loaded and initialized.

Most of the time spontaneous reboots like that are caused either by
an overheating system (your CPU fans are running aren't they?) or
flaky or bad memory chips.

> ... I could also try my copy of Solaris 8 for intel.
>
> Thoughts?

Try the Solaris 7 version if you have it. I don't know whether you've
got a real hardware problem (bit rot, perhaps), or whether Sun managed
to break the driver support for older HBAs, but I do know that there
was a *MAJOR* change to the driver between the Solaris 7 and Solaris 8
versions. There are some key pieces of code that dropped out of the
driver during those changes (and the driver performance ended up
dropping by half).

BTW, just as a random data point, even though I designed and wrote
the original version of the driver, I no longer use the ncrs driver
on any of my Solaris 8 x86 systems. And a Solaris 2.4 system on which
I installed the 2.5.1 version of the driver, ran for nearly 4 years
with out any problems. Of course, the driver wasn't totally bug free,
it's just that most people wouldn't run into any of them in normal
operations, especially not during an install.

=============================================

There's a big long story behind why I have reasons to dislike the
Solaris 8 x86 version of the driver. Before the Solaris 8 release,
Sun had a x86 version of ncrs and a SPARC version of pretty much
the same driver (called glm on SPARC). But the two versions supported
different sets of NCR chips. Sun's x86 managers wanted to pick up
support for the newer faster and wider chips and all the bug fixes that
had been already added to the SPARC version. But although both
versions of the driver were based on the driver I original wrote for
2.4 and were still functionally mostly identical, you really couldn't
diff the two versions and make sense out of the output or use it to do
an automated filemerge. The problem was that the SPARC bigot that
originally ported ncrs from x86 to SPARC had gratuitously changed all
the labels (from "ncrs" to "glm") and rearranged all the source code
so that any sort of merge other than a line by line manual merge was
impossible (which is a major headache on a 9,000 line device driver).
The changes were gratuitous because all he really needed to was update
the driver to use the new PCI DDI functions. Instead he made pointless
changes to nearly every line of code in the program (including changing
the whitespace everywhere).

So after surveying the damage (and trying various semi-automated
scripts to attempt make diff-ing possible), I suggested that rather than
suffering through an error-prone and time-consuming line-by-line manual
merge, that Sun's x86 drivers group should instead drop the old ncrs, and backport the SPARC glm version to x86 and make certain
that it
works okay on both x86 and SPARC systems (and just rename the x86 driver
binary back to ncrs). Once it was correctly backported and tested on
x86 with the newer, faster NCR chips, Sun could then merge back in
the (relatively small amount of) chip-specific code that the SPARC
people had removed from ncrs when they created glm (they removed the
support for the slower narrow-SCSI chips because Sun wasn't going to
sell a board that used those chips and leaving it in would have
slowed the driver a few nanoseconds while booting a SPARC system; they
also removed the code because it was x86 specific and back then
pissing on Solaris x86 whenever possible was an amusing sport for some
of Sun's SPARC bigots; Sun eventually replaced them with their current
crop of Linux/Cobalt bigots).

I estimated that an experienced Solaris driver writer could do the whole
job (including two rounds of light testing) in 2-4 man weeks (depending
mostly on how hard it was going to be to re-locate all the different
adapters the original driver had been tested on over the years; Sun's x86
hardware inventory was total chaos). By stipulating "an experienced
Solaris driver writer" I meant that either I or one of Sun's two x86
engineers (that had prior ncrs bug fixing experience) would have to do
the job.

But the (too smart) Sun x86 manager in charge of IHV drivers decided
that rather than make room in the schedule for one of the existing x86
engineers to do the job (he'd have to fight with the other too smart
Sun x86 managers over schedule priorities), that he would instead hire a
completely inexperienced contractor to do the job instead. The
contractor had written "lots of BSD drivers" but nothing for either
Solaris x86 or Solaris SPARC. The rationalization for this of course
completely ignored the fact that no matter who updated the driver,
that Sun's QA department was going to take up to 12 man-months to
bless the finished driver and they were already completely booked
doing other QA work. So the scheduling arguments were unavoidable and
using a contractor with no prior Solaris experience probably ensured
the added QA load would be on the high end. The developer bottle-neck
was temporary and probably very flexible but the QA bottle-neck was
permanent and mostly fixed.

(I don't know how or where Sun found the contractor they hired, but I
did notice that he was some sort of big-snot BSD expert who never missed
an opportunity to rant (to Sun engineers) about how much better BSD was
than Solaris; exactly the kind of contractor one wants to have working
on one of Sun's most important x86 HBA drivers).

Well for reasons I've never understand, the contractor managed to
stretch the glm-to-ncrs job out to over *MORE* than a year (and the
Sun manager in charge let it happen; maybe they were related or
something). And rather than doing the job in stages like I had
recommended, he tried to do all the changes in one big swell foop.

I don't remember all the gory details, but I think that at that
point, even though the contractor claimed both the new and
old adapters were working just fine on his test systems, what Sun
built from the delivered sources just wouldn't work at all on any x86
system any of us tested it on (or even on a SPARC-based telco cPCI box
some group was trying port the "improved" merged driver to). I think
if I hadn't called attention to how long the project was taking, the
lies about the results, and hadn't insisted on an immediate code
freeze and a full code review of whatever code was available at that
point, that the contractor might still be working on that project to
this date. I got the impression the Sun manager running the project
didn't want to call attention to the fact that the thing was nearly a
year late, and just wanted to go on believing that it was going to be
finished "real soon" (and that the Easter bunny is real).

My recollection is that the code review identified at least 5 major
bugs directly due to either a lack of understanding of x86 DMA versus
SPARC DVMA, and/or the faulty merge of the x86 and SPARC versions.
And that a second code review a month later noted that none of the
previous 5 bugs had been fixed and that there were major new bugs (I
think the end result was that the code workspace was rolled back to
prior less-buggy version). I believe one of the 5 major bugs was that
certain chip initialization functions that were necessary for some of
the older chips might have been lost in the shuffle. My recollection
is that not all of the bugs were fixed before Solaris 8 FCS (which is
the primary reason the performance of the current ncrs sucks). But I
don't remember whether the 53c810 chip inits made it into FCS or not,
or even if Sun ever located an old 53c810 HBA with which to regression
test the new driver. After the second code review, I never again looked
at that driver.



Relevant Pages

  • Re: checksums of S10 u 2 - what is md5sum.list ?
    ... I surfed to the sparc CD page, and the SPARC md5sum.list file/page is ... Note it is asking me to complete a survey and say how many SPARC and x86 systems I have. ... Dear Solaris software customer: To receive a valid Solaris Operating System license, ... Reference the;>Free Solaris Binary License Program websitefor licensing terms and conditions related to this program. ...
    (comp.unix.solaris)
  • Re: SPARC UFS Support
    ... > Solaris x86 does not even support of reading its own OS. ... Linux can but Solarix x86 can't! ... Sparc and x86 versions ... s1 contains a ufs copy of the miniroot as well. ...
    (comp.os.linux.misc)
  • Re: SPARC UFS Support
    ... > Solaris x86 does not even support of reading its own OS. ... Linux can but Solarix x86 can't! ... Sparc and x86 versions ... s1 contains a ufs copy of the miniroot as well. ...
    (comp.unix.solaris)
  • Re: SPARC UFS Support
    ... > Solaris x86 does not even support of reading its own OS. ... Linux can but Solarix x86 can't! ... Sparc and x86 versions ... s1 contains a ufs copy of the miniroot as well. ...
    (comp.sys.sun.apps)
  • Re: SPARC UFS Support
    ... > Solaris x86 does not even support of reading its own OS. ... Linux can but Solarix x86 can't! ... Sparc and x86 versions ... s1 contains a ufs copy of the miniroot as well. ...
    (comp.sys.sun.admin)