Re: 5.0.6 grinds to a complete halt
From: Barry Swane (bswane_at_rogers.com)
Date: 05/25/04
- Previous message: Michael Suddith: "Re: 5.0.6 grinds to a complete halt"
- In reply to: Michael Suddith: "Re: 5.0.6 grinds to a complete halt"
- Next in thread: Bela Lubkin: "Re: 5.0.6 grinds to a complete halt"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 25 May 2004 16:51:11 GMT
Thanks so much for this, Michael. It's a big help to me, to learn of your
experience.
Of course, I don't know for a fact that this is causing my problem-- but it
does seem like the logical place to start.
I haven't found any messages anywhere, in our problems-- machine just stops.
Sometimes, it was working the hard drive relatively hard (well, not really,
but compared to other times, anyway) but other times, there wasn't anything
unusual going on at all-- near as I can tell.
Barry
"Michael Suddith" <mike@nbcisoftware.com> wrote in message
news:8uKsc.32935$KE6.8273@newsread3.news.atl.earthlink.net...
> I upgraded one of my arrays to the new firmware a few months ago. I did
not
> have the problems that you are describing however. On my system the raid
> controller (an adaptec 2200S) would give me an occasional message of
"Abort
> Time-out. Resetting Bus" followed by a message 1 second later saying bus
> restarted. I got that message about once a day until I upgraded the
> firmware. You can not upgrade the firmware while the drives are in a SCO
> box, Seagate's utility runs best on a windows machine. They do have a
linux
> version if you can get it to run. Upgrading the firmware did erase the
raid
> configuration off of the drive, so I just set it back up identical to
before
> and told it to rebuild and it did. I did not lose any data on the drive,
I
> just had to rebuild the array. I did not notice any real difference on
the
> machine from before and after the upgrade other then the timed out message
> went away. The time out usually occoured early in the morning when at
most
> 2 or 3 people were on the system, and they never reported a problem to me
> about performance during the time out.
>
>
> "Barry Swane" <bswane@rogers.com> wrote in message
> news:d3e78b1d.0405250715.32460091@posting.google.com...
> > Bela Lubkin <belal@sco.com> wrote in message
> news:<20040520080340.GS10272@sco.com>...
> >
> > I think we are getting darned close to the root of my problem now.
> > I found the following on Seagate Web site:(these drives are ST336607)
> >
> > -------------------------------------------------------------
> > Ultra 320 Time-Out Firmware Upgrade
> >
> > The Ultra 320 firmware update applies to the following Seagate hard
> > drives:
> > Cheetah 10K.6: ST3146807; ST373307; ST336607
> > Cheetah 15K.3: ST373453; ST336753; ST318453
> > Problem description
> >
> > Some Seagate Cheetah 10K.6 hard drives with OEM firmware up to 0006
> > and Cheetah 15K.3 hard drives with OEM firmware up to 0005 on Ultra
> > 320 SCSI host adapters are experiencing time out issues when running
> > RAID 0, 1 and 5 with some host adapters. This issue has been observed
> > using U320 Adaptec and LSI SCSI controllers, but may not limited to
> > these host adapter manufacturers.
> >
> > Root cause
> >
> > The Cheetah 10K.6 and 15K.3 drives (models listed above) will
> > sometimes hang due to an issue in the firmware when reading and
> > writing at U320 packetized mode.
> >
> > Corrective Action
> >
> > Seagate has modified the firmware and added an additional register
> > check while in U320 packetized mode thus preventing the time out issue
> > due to system hang.
> >
> > Contact your host adapter manufacturer for the latest BIOS revision
> > for your U320 SCSI controller.
> >
> > To obtain the unique firmware download certificate number for your
> > hard drive, contact Seagate Technical Support via phone or by email at
> > discsupport@seagate.com.
> >
> > Please have your drive part number (Example: 9U8006-001) and the
> > current firmware level available when contacting Technical Support.
> >
> > For the 10K.6 models the new code will be OEM 0007
> > For the 15K.3 models the new code will be OEM 0006
> >
> > Please backup any important files before upgrading the firmware.
> > Seagate is not responsible for any data loss.
> >
> > Copyright ©2004, Seagate Technology LLC | About Seagate | Privacy
> > Policy | Legal
> > --------------------------------------------------
> >
> > That last comment is a little chilling-- the implication being it
> > might blow out the entire RAID 5 hard drive?
> >
> > Has anybody else had to deal with this?
> >
> > Barry
> >
> >
> >
> > Bela Lubkin <belal@sco.com> wrote in message
> news:<20040520080340.GS10272@sco.com>...
> >
> > > Barry Swane wrote:
> > >
> > > > It appears I declared victory a little too early.
> > >
> > > > Killing the amirdmon process did indeed have salutory effects on the
> > > > performance. Customer stopped reporting noticeable slowness in
system
> > > > performance.
> > > >
> > > > I'm now inclined to theorize that Bela's suggestion is correct--
that
> > > > the disk (RAID 5) has stopped responding completely. Would that be
> > > > consistent with the reported behavior? i.e., if you are in a shell,
> > > > you can type characters, and they echo, and you can do a carriage
> > > > return-- but nothing is ever executed?
> > >
> > > Perfectly consistent. OpenServer is very conservative about swapping;
> > > it never pushes process pages out to swap unless it's out of memory.
On
> > > modern systems this generally means that swap is never touched. Thus,
> > > any active process resides entirely in memory. Also, the kernel
itself
> > > is all hard-loaded in RAM -- none of it is pagable. If the disk
> > > subsystem hangs, the kernel continues to function. Each individual
> > > process continues to function until the first time it tries to access
> > > the disk.
> > >
> > > For instance, the program that provides the login prompt (`getty`, for
> > > console ttys) will continue to accept and echo characters. If you hit
> > > return on a name, it goes to exec `login`, which involves disk access,
> > > so you never get to the password prompt.
> > >
> > > If you're sitting at a shell prompt, you can type; you can run
internal
> > > commands like "echo foo"; but any attempt to run a binary will hang.
> > > (Even if the binary is fully cached, its access time needs to be
updated
> > > on disk.)
> > >
> > > > Also, this seems major-league weird-- that the system can perform
> > > > absolutely normally, all the time-- except once in a while it loses
> > > > contact with the disk?
> > >
> > > It isn't particularly weird. What you're describing is a fairly
> > > standard set of symptoms for a variety of conditions including SCSI
bus
> > > timing, parity or signal integrity problems; internal errors in a disk
> > > drive; and so on. You might rightly expect a RAID controller to be a
> > > bit more thorough about error recovery, but apparently this particular
> > > one -- in this particular failure case, whatever it is -- isn't.
> > >
> > > You also mischaracterizze the situation here. It _isn't_ performing
> > > absolutely normally. It's running 6 times slower than older and
> > > presumably much slower machines.
> > >
> > > But I bet the two symptoms are actually unrelated, and you have two
> > > separate problems to solve. (1) complex application jobs run much
more
> > > slowly than expected; (2) the disk subsystem occasionally hangs.
> > >
>
>
--- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.683 / Virus Database: 445 - Release Date: 22/05/2004
- Previous message: Michael Suddith: "Re: 5.0.6 grinds to a complete halt"
- In reply to: Michael Suddith: "Re: 5.0.6 grinds to a complete halt"
- Next in thread: Bela Lubkin: "Re: 5.0.6 grinds to a complete halt"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|