Re: 5.0.5 - access to a filesystem hangs occasionally

From: Mike Brown (mike_at_tkg.ca)
Date: 06/05/03


Date: Thu, 05 Jun 2003 14:29:04 GMT


"Stephen M. Dunn" wrote:
>
> One of my clients is having a problem I've never seen before.
> They're running 5.0.5 Enterprise on a Compaq server; it's been
> reliable for a couple of years. I recently installed the latest
> few patches - oss640a, oss646a, oss647a (which really shouldn't
> be necessary, as their disk subsystem is a RAID array rather than
> individual SCSI disks), oss650a - that they didn't already have,
> and the problems have been happening since then.
>
> Twice since then, the system has become unusable. The first
> time, they couldn't run shutdown, so they hit the power switch and
> couldn't provide me a good description, but today it happened again,
> and I was able to go in via ssh. They can also telnet in, and they
> could log in on the console (and they did, screen after screen, until
> every screen was locked up trying to run a frozen app). Every
> application which was accessing the /u filesystem was hung and could
> not be killed, even -9. Each of the following commands hung my
> session; I had to kill my ssh client and connect again:
>
> cd /u
> ls -l / (but ls / worked)
> df
> nuc stop (hung trying to unmount the /.NetWare filesystem)
> any command which tried to access any file under /u
>
> I also noticed that last night's backup was hung and unkillable.
> visionfs stop didn't hang, but it was unable to stop the visionfs
> processes, and they could not be killed; the main share that they
> use is on /u.
>
> Any command I tried that did not have anything to do with /u ran
> without problems. The CPU was basically 100% idle according to cpuhog.
> root, boot, u, and swap are all in the same partition on the array,
> so access to the array itself is not the problem. The data collected
> by sar every 20 minutes looks normal up to the point where the users
> got hung, and after that it looks like a normal system that's idle.
>
> The first time, they let it start up automatically, so fsck would
> have run in quick mode (/dev/u is an HTFS filesystem). A few days
> later, when they told me they were having problems with some of their
> data files (which turned out to contain corrupt data, probably due to
> the previous hang), I had them reboot to single-user mode and run fsck
> -ofull; it reported no problems. Today, fsck -ofull is reporting a
> number of problems that might be expected given that they could not
> shut down cleanly.
>
> Between the first and second hangs, I updated the Compaq EFS to
> the latest version and ran Compaq's array diagnostics, which reported
> that the drive array was working properly. I have been unable to
> find any unusual messages in syslog or /usr/adm/messages.
>
> The server is powered from a UPS, though they tell me that the UPS
> is in need of repair and that the server crashed due to a power failure
> roughly a week before the first hang. This is certainly a possible
> cause, and they are aware that they need to get the UPS fixed ASAP.
>
> I'm thinking of rolling back the patches, particularly oss647a,
> since the only system configuration change that's taken place between
> when the system was reliable and when it first hung was the
> installation of the patches. Both hangs, as far as they can tell,
> happened after the system had been running for a week or so, so I've
> suggested that they might reboot a couple of times a week to see if
> that keeps the problems at bay.
>
> Any other ideas? The system is going to be replaced in a month or
> two with a newer box running 5.0.7, but we need to keep the old box
> stable until then ...
> --
> Stephen M. Dunn <stephen@stevedunn.ca>
> >>>----------------> http://www.stevedunn.ca/ <----------------<<<
> ------------------------------------------------------------------
> Say hi to my cat -- http://www.stevedunn.ca/photos/toby/

Have you used the compaq online utility to check the hard drives? The raid
controller will hide as much of the problems as possible. Also some
revs of the EFS had problems, I would guess on an older Compaq you
should be running EFS5.48.

Mike

-- 
Michael Brown
The Kingsway Group


Relevant Pages

  • Re: 5.0.5 - access to a filesystem hangs occasionally
    ... Each of the following commands hung my ... > root, boot, u, and swap are all in the same partition on the array, ... > the previous hang), I had them reboot to single-user mode and run fsck ... > The server is powered from a UPS, though they tell me that the UPS ...
    (comp.unix.sco.misc)
  • 5.0.5 - access to a filesystem hangs occasionally
    ... Each of the following commands hung my ... root, boot, u, and swap are all in the same partition on the array, ... the previous hang), I had them reboot to single-user mode and run fsck ... The server is powered from a UPS, though they tell me that the UPS ...
    (comp.unix.sco.misc)
  • vfs_mountroot panic on tru64 v5.1A
    ... When I tried to reboot, bcheckrc hung, stating that there was a ... On contacting Compaq, they decided that the hardware database was the ... This is when the major problem kicked in - the machine hung, ...
    (comp.unix.tru64)
  • devices.fcp.changer and devices.fcp.array
    ... I am trying to configure an Emulex LP7000 to see my P1000 library and ... Compaq EMA-12000 array. ... When I run cfgmgr I get the message that the ...
    (AIX-L)
  • Re: Increasing Partition size.
    ... > We are in the process of testing our own NAS box (Compaq with Smart ... > Array 6400). ... > restraints, we have initially only purchased 5 drives. ...
    (Debian-User)