Re: 5.0.6 grinds to a complete halt

From: Barry Swane (bswane_at_rogers.com)
Date: 05/25/04


Date: 25 May 2004 08:15:43 -0700

Bela Lubkin <belal@sco.com> wrote in message news:<20040520080340.GS10272@sco.com>...

I think we are getting darned close to the root of my problem now.
I found the following on Seagate Web site:(these drives are ST336607)

-------------------------------------------------------------
Ultra 320 Time-Out Firmware Upgrade
 
  The Ultra 320 firmware update applies to the following Seagate hard
drives:
Cheetah 10K.6: ST3146807; ST373307; ST336607
Cheetah 15K.3: ST373453; ST336753; ST318453
Problem description

Some Seagate Cheetah 10K.6 hard drives with OEM firmware up to 0006
and Cheetah 15K.3 hard drives with OEM firmware up to 0005 on Ultra
320 SCSI host adapters are experiencing time out issues when running
RAID 0, 1 and 5 with some host adapters. This issue has been observed
using U320 Adaptec and LSI SCSI controllers, but may not limited to
these host adapter manufacturers.

Root cause

The Cheetah 10K.6 and 15K.3 drives (models listed above) will
sometimes hang due to an issue in the firmware when reading and
writing at U320 packetized mode.

Corrective Action

Seagate has modified the firmware and added an additional register
check while in U320 packetized mode thus preventing the time out issue
due to system hang.

 Contact your host adapter manufacturer for the latest BIOS revision
for your U320 SCSI controller.

To obtain the unique firmware download certificate number for your
hard drive, contact Seagate Technical Support via phone or by email at
discsupport@seagate.com.

 Please have your drive part number (Example: 9U8006-001) and the
current firmware level available when contacting Technical Support.

For the 10K.6 models the new code will be OEM 0007
For the 15K.3 models the new code will be OEM 0006

 Please backup any important files before upgrading the firmware.
Seagate is not responsible for any data loss.

Copyright ©2004, Seagate Technology LLC | About Seagate | Privacy
Policy | Legal
 --------------------------------------------------
 
That last comment is a little chilling-- the implication being it
might blow out the entire RAID 5 hard drive?

Has anybody else had to deal with this?

Barry

 Bela Lubkin <belal@sco.com> wrote in message news:<20040520080340.GS10272@sco.com>...

> Barry Swane wrote:
>
> > It appears I declared victory a little too early.
>
> > Killing the amirdmon process did indeed have salutory effects on the
> > performance. Customer stopped reporting noticeable slowness in system
> > performance.
> >
> > I'm now inclined to theorize that Bela's suggestion is correct-- that
> > the disk (RAID 5) has stopped responding completely. Would that be
> > consistent with the reported behavior? i.e., if you are in a shell,
> > you can type characters, and they echo, and you can do a carriage
> > return-- but nothing is ever executed?
>
> Perfectly consistent. OpenServer is very conservative about swapping;
> it never pushes process pages out to swap unless it's out of memory. On
> modern systems this generally means that swap is never touched. Thus,
> any active process resides entirely in memory. Also, the kernel itself
> is all hard-loaded in RAM -- none of it is pagable. If the disk
> subsystem hangs, the kernel continues to function. Each individual
> process continues to function until the first time it tries to access
> the disk.
>
> For instance, the program that provides the login prompt (`getty`, for
> console ttys) will continue to accept and echo characters. If you hit
> return on a name, it goes to exec `login`, which involves disk access,
> so you never get to the password prompt.
>
> If you're sitting at a shell prompt, you can type; you can run internal
> commands like "echo foo"; but any attempt to run a binary will hang.
> (Even if the binary is fully cached, its access time needs to be updated
> on disk.)
>
> > Also, this seems major-league weird-- that the system can perform
> > absolutely normally, all the time-- except once in a while it loses
> > contact with the disk?
>
> It isn't particularly weird. What you're describing is a fairly
> standard set of symptoms for a variety of conditions including SCSI bus
> timing, parity or signal integrity problems; internal errors in a disk
> drive; and so on. You might rightly expect a RAID controller to be a
> bit more thorough about error recovery, but apparently this particular
> one -- in this particular failure case, whatever it is -- isn't.
>
> You also mischaracterizze the situation here. It _isn't_ performing
> absolutely normally. It's running 6 times slower than older and
> presumably much slower machines.
>
> But I bet the two symptoms are actually unrelated, and you have two
> separate problems to solve. (1) complex application jobs run much more
> slowly than expected; (2) the disk subsystem occasionally hangs.
>



Relevant Pages

  • Re: Blockbusting news, this is important (Re: Why are bad disk sectors numbered strangely, and what
    ... but some blocks on drives that are used for archiving data may ... >> you a specialist firmware if you want to do data recovery, ... Well, the original firmware wasn't 'free' in any sense of the word, so ... data may get written to disk. ...
    (Linux-Kernel)
  • Re: Seagate disk problems (NCQ bug???)
    ... messages on the Seagate forum are for user to user communication ... The one exception is the particular firmware update that prevents the ... bricking of 7200.11 drives. ... fix a performance problem with my ST31500343AS. ...
    (Fedora)
  • SUMMARY: 6120 Disk Array-> Remove and Replace Disk, does not enable
    ... the not-latest) firmware for the array which resulted in failure to ... I was able to source a pair of sun-part drives (good used working ... arrays has a disk that went bad - not a problem since we're running ...
    (SunManagers)
  • Re: Help with diagnosis
    ... >>> So I am running some diagnostics on that disk. ... Ben> Although I advocate running manufacturer diagnostics to assess hard ... Ben> the current version of one of the Seagate diagostics ... Ben> to check SATA drives. ...
    (alt.sys.pc-clone.dell)
  • Re: "The disk in drive G cannot be formatted" "Windows was unable to complete the format"
    ... The technique is described in Answer ID 3119 of the Seagate knowledge base. ... trying to erase an internal hard disk, moved from HD0 to HD1 on the IDE cable. ... If you know how to navigate on the Seagate website to download a low-level formatting program, ... The Seagate JPG-format chart showed four groupings of Seagate drives, with both consistencies and inconsistencies in jumper positions for CLJ. ...
    (microsoft.public.windowsxp.perform_maintain)