Re: IBM bets 2.9 Billion on Linux for semiconductor manufacture

From: Bruce Adler (bruce.NxOxSxPxAxMx.adler_at_acm.org)
Date: 06/30/03


Date: Mon, 30 Jun 2003 09:16:19 GMT


"James smyth" <starlink101@yahoo.co.uk> wrote in message news:k2QLa.10947$aB3.79107135@news-text.cableinet.net...
>
> Let's face it, when your dealing with sums of money this large you need the
> reliability that solaris simply can't provide.

What a load of nonsense.

Six or seven years ago, when I was doing consulting work at Sun, because
of a meltdown problem I was asked to work on, I first found out that Intel
used Solaris x86 2.5 to automate and manage all their Pentium production
plants. Intel called that system called something like "MCS" (I've forgotten
what that stood for so don't bother asking).

I don't have any of Intel's financial data from that period, but I wouldn't
be surprised if Intel was spending $3 billion per year, for several years
running, building Pentium production lines all over the world. And AFAIK
each of those production lines was managed by a network of Solaris x86
systems. Of course, every individual piece of equipment at every step on
the line had it's own embedded controller and/or proprietary computer, but
the whole production line and the whole plant was tied together with Solaris
x86 systems. In other words, Sx86 was the backbone of a Pentium plant.

In other words, Linux is still seven years and tens of billions of dollars
behind Solaris x86 when it comes to reliability and suitability to very
large business ventures.

I heard that Intel had considered a whole bunch of different options and
decided that Sx86 was the most reliable and most capable system.

I was also told that downtime on a single Pentium production line would
cost Intel more than $1 million per hour in lost revenue. So they clearly
must have done their homework before choosing Solaris x86.

I was also told by Sun that if Intel ever requested any help from Sun
on their Pentium production system, that it was automatically a "meltdown"
that had higher priority than anything else for everyone at Sun (assuming
Sprint or NASA or the US Army didn't also report a meltdown at
the same time with their cellular voicemail, space station, or main battle
tank Sx86 systems).

AFAIK, there was only ONE time that Intel had to ask Sun for help to
resolve (what they thought was) a Solaris x86 problem. When that
happened, I was the person that caught the problem. Of course
a couple of other Sun x86 people were looking over my shoulder every step
of the way. And, the only reason there weren't a ton of Sun people handling
the meltdown is that Intel didn't want any more people than necessary
to know the details about how their production lines worked. Intel
treated their Solaris x86 based production management system like it was
a secret weapon they didn't want their competitors (or maybe Microsoft)
to know anything about. (I think I was the only person at Sun that actually
got to read Intel's paper that explained the techinical details of how their
systems actually worked and were interconnected and all the fallback
and backup procedure and policies).

Basically, Intel used redundant systems, and mirrored disks, and clusters,
and watchdogs, and hot spares and every method you can think of to ensure
that a problem should never come up that could ever possibly cause them to
have to halt one of their production lines due to a failure of one of the
Sx86 systems.

After several years of experience with their production systems they were
very happy with Solaris x86 and said that the only problems they ever
saw were hardware related.

And when a problem did happen, they had it designed such that they could
pull a single system or a full set of systems and replace them in less
than a minute without stopping production. And in order to avoid
such surprises, they continuously monitored everything and kept
reams of data on reliability and failures and tracked every single part
of every single system (and all the software) from day one until they'd
throw a dead system on the rubbish heap. They used that data to try to
detect when a system's hardware was getting marginal and would repair
or replace it before it actually failed.

Intel thought they were having a rash of mysterious crashes on a set of
systems at one of their plants. I was given a couple of kernel crash dumps
and various details about the problem (including the fact that most of the
time the systems crashed and rebooted without generating a kernel dump).
Basically, Intel said they were getting random crashes about once a week
and felt certain it wasn't a hardware problem. The crashes weren't
(yet) affecting production but they still wanted the problem diagnosed and
fixed as quickly as possible.

I gave them an immediate response that given all the symptoms they
described and given that both crash dumps they'd given me looked hosed,
it looked to me like bad memory or an overheating motherboard to me.

Intel responded that they continuously monitored the temperature on all
their hardware and had ruled that out. And they said they had already
considered bad memory. Each time a system failed they'd pull it out
of production, run it through all their tests, and then eventually return
it to production when the set of (about) thirty systems it belonged to
was next due for maintenance (about twice a week). They had even several
times pulled offline a full set of thirty systems and run all the memory
in each of them through a hardware memory checker, several times. But
not matter what they did they never found any hardware problems and
always returned the newly verified system to its original set of
30 production systems. They had been seeing failures about twice a week
for six weeks before they had finally convinced themselves that it
couldn't possibly be a hardware problem and that they needed Sun's
help.

They seemed so certain about their hardware and made it sound like
it was happening so frequently on so many systems that there just
had to be a software explanation for the failures (since that was
the only thing that was shared across so many different systems).
That seemed semi-reasonable, except that when I asked about crash dumps
from other locations (Intel had a dozen or more Pentium plants around
the world) Intel said that so far the problem was only showing up at
a single plant. That seemed curious to me, but it still didn't rule
out a possible kernel problem.

So I went back to the dumps and the sources but 10 or 15 hours later
decide it really had to be bad memory. In my mind there wasn't any
other option. But that wasn't going to be sufficient to satisfy Intel.

So I didn't tell Intel why I wanted it, but I then asked Intel for all
the data about all the failing systems and every crash they got from
that point on (they hadn't saved most of the previous ones since before
they returned a failing system to service, they reloaded the root disk
from a master copy). I wanted to know *ALL* the dates and times of each
failure and *ALL* the model and serial numbers of *ALL* the pieces of
*ALL* the failing systems. I also wanted to know everything that they
had ever done to a failing system during its complete life history.
And finally I asked whether they were using ECC memory or non-ECC memory.

I thought there might be some possibility that Intel had a single batch
of bad memory chips which were floating around between different
motherboards. Perhaps they just didn't notice that the reason they were
seeing so many systems fail randomly was that the problem was following
the memory chips.

Intel told me that wasn't possible (they didn't ever swapped pieces
of hardware amongst different systems) but they would get me all the
data anyway since there might be something else hidden in it that
might be of use to me.

Two days later, Intel sent me a message through Sun management that
said basically, "nevermind". When I asked how come, Intel was embarrassed
to admit that they hadn't noticed that (A) it was a single bad system
that was causing all their problems and (B) they weren't using ECC
memory, and (C) they could reproduce the failure by warming the system's
memory chips by a few degrees and (D) their hardware memory checker was a
piece of ***.


Quantcast