Re: 5.0.6 grinds to a complete halt

From: Barry Swane (bswane_at_rogers.com)
Date: 05/25/04

  • Next message: Pineapple: "Open Server 5.0.6 - problem in login"
    Date: Tue, 25 May 2004 16:51:11 GMT
    
    

    Thanks so much for this, Michael. It's a big help to me, to learn of your
    experience.
    Of course, I don't know for a fact that this is causing my problem-- but it
    does seem like the logical place to start.
    I haven't found any messages anywhere, in our problems-- machine just stops.
    Sometimes, it was working the hard drive relatively hard (well, not really,
    but compared to other times, anyway) but other times, there wasn't anything
    unusual going on at all-- near as I can tell.

    Barry

    "Michael Suddith" <mike@nbcisoftware.com> wrote in message
    news:8uKsc.32935$KE6.8273@newsread3.news.atl.earthlink.net...
    > I upgraded one of my arrays to the new firmware a few months ago. I did
    not
    > have the problems that you are describing however. On my system the raid
    > controller (an adaptec 2200S) would give me an occasional message of
    "Abort
    > Time-out. Resetting Bus" followed by a message 1 second later saying bus
    > restarted. I got that message about once a day until I upgraded the
    > firmware. You can not upgrade the firmware while the drives are in a SCO
    > box, Seagate's utility runs best on a windows machine. They do have a
    linux
    > version if you can get it to run. Upgrading the firmware did erase the
    raid
    > configuration off of the drive, so I just set it back up identical to
    before
    > and told it to rebuild and it did. I did not lose any data on the drive,
    I
    > just had to rebuild the array. I did not notice any real difference on
    the
    > machine from before and after the upgrade other then the timed out message
    > went away. The time out usually occoured early in the morning when at
    most
    > 2 or 3 people were on the system, and they never reported a problem to me
    > about performance during the time out.
    >
    >
    > "Barry Swane" <bswane@rogers.com> wrote in message
    > news:d3e78b1d.0405250715.32460091@posting.google.com...
    > > Bela Lubkin <belal@sco.com> wrote in message
    > news:<20040520080340.GS10272@sco.com>...
    > >
    > > I think we are getting darned close to the root of my problem now.
    > > I found the following on Seagate Web site:(these drives are ST336607)
    > >
    > > -------------------------------------------------------------
    > > Ultra 320 Time-Out Firmware Upgrade
    > >
    > > The Ultra 320 firmware update applies to the following Seagate hard
    > > drives:
    > > Cheetah 10K.6: ST3146807; ST373307; ST336607
    > > Cheetah 15K.3: ST373453; ST336753; ST318453
    > > Problem description
    > >
    > > Some Seagate Cheetah 10K.6 hard drives with OEM firmware up to 0006
    > > and Cheetah 15K.3 hard drives with OEM firmware up to 0005 on Ultra
    > > 320 SCSI host adapters are experiencing time out issues when running
    > > RAID 0, 1 and 5 with some host adapters. This issue has been observed
    > > using U320 Adaptec and LSI SCSI controllers, but may not limited to
    > > these host adapter manufacturers.
    > >
    > > Root cause
    > >
    > > The Cheetah 10K.6 and 15K.3 drives (models listed above) will
    > > sometimes hang due to an issue in the firmware when reading and
    > > writing at U320 packetized mode.
    > >
    > > Corrective Action
    > >
    > > Seagate has modified the firmware and added an additional register
    > > check while in U320 packetized mode thus preventing the time out issue
    > > due to system hang.
    > >
    > > Contact your host adapter manufacturer for the latest BIOS revision
    > > for your U320 SCSI controller.
    > >
    > > To obtain the unique firmware download certificate number for your
    > > hard drive, contact Seagate Technical Support via phone or by email at
    > > discsupport@seagate.com.
    > >
    > > Please have your drive part number (Example: 9U8006-001) and the
    > > current firmware level available when contacting Technical Support.
    > >
    > > For the 10K.6 models the new code will be OEM 0007
    > > For the 15K.3 models the new code will be OEM 0006
    > >
    > > Please backup any important files before upgrading the firmware.
    > > Seagate is not responsible for any data loss.
    > >
    > > Copyright ©2004, Seagate Technology LLC | About Seagate | Privacy
    > > Policy | Legal
    > > --------------------------------------------------
    > >
    > > That last comment is a little chilling-- the implication being it
    > > might blow out the entire RAID 5 hard drive?
    > >
    > > Has anybody else had to deal with this?
    > >
    > > Barry
    > >
    > >
    > >
    > > Bela Lubkin <belal@sco.com> wrote in message
    > news:<20040520080340.GS10272@sco.com>...
    > >
    > > > Barry Swane wrote:
    > > >
    > > > > It appears I declared victory a little too early.
    > > >
    > > > > Killing the amirdmon process did indeed have salutory effects on the
    > > > > performance. Customer stopped reporting noticeable slowness in
    system
    > > > > performance.
    > > > >
    > > > > I'm now inclined to theorize that Bela's suggestion is correct--
    that
    > > > > the disk (RAID 5) has stopped responding completely. Would that be
    > > > > consistent with the reported behavior? i.e., if you are in a shell,
    > > > > you can type characters, and they echo, and you can do a carriage
    > > > > return-- but nothing is ever executed?
    > > >
    > > > Perfectly consistent. OpenServer is very conservative about swapping;
    > > > it never pushes process pages out to swap unless it's out of memory.
    On
    > > > modern systems this generally means that swap is never touched. Thus,
    > > > any active process resides entirely in memory. Also, the kernel
    itself
    > > > is all hard-loaded in RAM -- none of it is pagable. If the disk
    > > > subsystem hangs, the kernel continues to function. Each individual
    > > > process continues to function until the first time it tries to access
    > > > the disk.
    > > >
    > > > For instance, the program that provides the login prompt (`getty`, for
    > > > console ttys) will continue to accept and echo characters. If you hit
    > > > return on a name, it goes to exec `login`, which involves disk access,
    > > > so you never get to the password prompt.
    > > >
    > > > If you're sitting at a shell prompt, you can type; you can run
    internal
    > > > commands like "echo foo"; but any attempt to run a binary will hang.
    > > > (Even if the binary is fully cached, its access time needs to be
    updated
    > > > on disk.)
    > > >
    > > > > Also, this seems major-league weird-- that the system can perform
    > > > > absolutely normally, all the time-- except once in a while it loses
    > > > > contact with the disk?
    > > >
    > > > It isn't particularly weird. What you're describing is a fairly
    > > > standard set of symptoms for a variety of conditions including SCSI
    bus
    > > > timing, parity or signal integrity problems; internal errors in a disk
    > > > drive; and so on. You might rightly expect a RAID controller to be a
    > > > bit more thorough about error recovery, but apparently this particular
    > > > one -- in this particular failure case, whatever it is -- isn't.
    > > >
    > > > You also mischaracterizze the situation here. It _isn't_ performing
    > > > absolutely normally. It's running 6 times slower than older and
    > > > presumably much slower machines.
    > > >
    > > > But I bet the two symptoms are actually unrelated, and you have two
    > > > separate problems to solve. (1) complex application jobs run much
    more
    > > > slowly than expected; (2) the disk subsystem occasionally hangs.
    > > >
    >
    >

    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.683 / Virus Database: 445 - Release Date: 22/05/2004
    

  • Next message: Pineapple: "Open Server 5.0.6 - problem in login"

    Relevant Pages

    • Re: Seagate disk problems (NCQ bug???)
      ... messages on the Seagate forum are for user to user communication ... The one exception is the particular firmware update that prevents the ... bricking of 7200.11 drives. ... fix a performance problem with my ST31500343AS. ...
      (Fedora)
    • Re: 5.0.6 grinds to a complete halt
      ... I found the following on Seagate Web site:(these drives are ST336607) ... Ultra 320 Time-Out Firmware Upgrade ... > return on a name, it goes to exec `login`, which involves disk access, ...
      (comp.unix.sco.misc)
    • Re: Lost 1500 gigs - Thanks Seagate!
      ... Thanks Seagate! ... in a RAID configuration of some sort.... ... I have two of those same drives and applied the fix day one! ... the firmware issues affected "A SMALL NUMBER" of these drives. ...
      (alt.comp.hardware.pc-homebuilt)
    • Re: disk activity alert
      ... Go to the Dell support site, to the downloads page for the PE2800. ... There's a long list of Maxtor drives with firmware ... the firmware upgrade process is simple and obvious. ...
      (microsoft.public.windows.server.sbs)
    • Re: 5.0.6 grinds to a complete halt
      ... > I found the following on Seagate Web site:(these drives are ST336607) ... > Ultra 320 Time-Out Firmware Upgrade ...
      (comp.unix.sco.misc)