5.0.5 - access to a filesystem hangs occasionally

From: Stephen M. Dunn (stephen_at_stevedunn.ca)
Date: 06/03/03


Date: Tue, 3 Jun 2003 17:14:54 GMT


   One of my clients is having a problem I've never seen before.
They're running 5.0.5 Enterprise on a Compaq server; it's been
reliable for a couple of years. I recently installed the latest
few patches - oss640a, oss646a, oss647a (which really shouldn't
be necessary, as their disk subsystem is a RAID array rather than
individual SCSI disks), oss650a - that they didn't already have,
and the problems have been happening since then.

   Twice since then, the system has become unusable. The first
time, they couldn't run shutdown, so they hit the power switch and
couldn't provide me a good description, but today it happened again,
and I was able to go in via ssh. They can also telnet in, and they
could log in on the console (and they did, screen after screen, until
every screen was locked up trying to run a frozen app). Every
application which was accessing the /u filesystem was hung and could
not be killed, even -9. Each of the following commands hung my
session; I had to kill my ssh client and connect again:

cd /u
ls -l / (but ls / worked)
df
nuc stop (hung trying to unmount the /.NetWare filesystem)
any command which tried to access any file under /u

   I also noticed that last night's backup was hung and unkillable.
visionfs stop didn't hang, but it was unable to stop the visionfs
processes, and they could not be killed; the main share that they
use is on /u.

   Any command I tried that did not have anything to do with /u ran
without problems. The CPU was basically 100% idle according to cpuhog.
root, boot, u, and swap are all in the same partition on the array,
so access to the array itself is not the problem. The data collected
by sar every 20 minutes looks normal up to the point where the users
got hung, and after that it looks like a normal system that's idle.

   The first time, they let it start up automatically, so fsck would
have run in quick mode (/dev/u is an HTFS filesystem). A few days
later, when they told me they were having problems with some of their
data files (which turned out to contain corrupt data, probably due to
the previous hang), I had them reboot to single-user mode and run fsck
-ofull; it reported no problems. Today, fsck -ofull is reporting a
number of problems that might be expected given that they could not
shut down cleanly.

   Between the first and second hangs, I updated the Compaq EFS to
the latest version and ran Compaq's array diagnostics, which reported
that the drive array was working properly. I have been unable to
find any unusual messages in syslog or /usr/adm/messages.

   The server is powered from a UPS, though they tell me that the UPS
is in need of repair and that the server crashed due to a power failure
roughly a week before the first hang. This is certainly a possible
cause, and they are aware that they need to get the UPS fixed ASAP.

   I'm thinking of rolling back the patches, particularly oss647a,
since the only system configuration change that's taken place between
when the system was reliable and when it first hung was the
installation of the patches. Both hangs, as far as they can tell,
happened after the system had been running for a week or so, so I've
suggested that they might reboot a couple of times a week to see if
that keeps the problems at bay.

   Any other ideas? The system is going to be replaced in a month or
two with a newer box running 5.0.7, but we need to keep the old box
stable until then ...

-- 
Stephen M. Dunn                             <stephen@stevedunn.ca>
>>>----------------> http://www.stevedunn.ca/ <----------------<<<
------------------------------------------------------------------
     Say hi to my cat -- http://www.stevedunn.ca/photos/toby/


Relevant Pages

  • Re: 5.0.5 - access to a filesystem hangs occasionally
    ... Each of the following commands hung my ... > root, boot, u, and swap are all in the same partition on the array, ... > the previous hang), I had them reboot to single-user mode and run fsck ... > The server is powered from a UPS, though they tell me that the UPS ...
    (comp.unix.sco.misc)
  • Re: 5.0.5 - access to a filesystem hangs occasionally
    ... Each of the following commands hung my ... > root, boot, u, and swap are all in the same partition on the array, ... > the previous hang), I had them reboot to single-user mode and run fsck ... > Between the first and second hangs, I updated the Compaq EFS to ...
    (comp.unix.sco.misc)
  • Re: IE temporarily "hung ap"
    ... >> It does not hang in any pattern that I have identified. ... It has NEVER hung on www.crh.noaa.gov ... >>> disabling ... >>> Windows XP Startup Programs Tracker ...
    (microsoft.public.windowsxp.general)
  • Re: Server responds to ping, but event log says "it is shutdown" and server realy hung up
    ... You'll need to describe "hung up" in much greater detail. ... Our server is HP DL380 run Windows 2003 Server Standard Edition and Sql Server 2000 Standard Edition. ... Event log says it is shutdown. ...
    (microsoft.public.windows.server.general)
  • Re: IE temporarily "hung ap"
    ... >> It does not hang in any pattern that I have identified. ... It has NEVER hung on www.crh.noaa.gov ... >>> Windows XP Startup Programs Tracker ...
    (microsoft.public.windowsxp.general)