5.0.5 - access to a filesystem hangs occasionally
From: Stephen M. Dunn (stephen_at_stevedunn.ca)
Date: 06/03/03
- Next message: maanas: "Re: How to check for second cpu on 5.0.6"
- Previous message: Bill Campbell: "Re: SCO Technical Articles to say "tata"."
- Next in thread: Bill Vermillion: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Reply: Bill Vermillion: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Reply: Abid Khan: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Reply: Mike Brown: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: Tue, 3 Jun 2003 17:14:54 GMT
One of my clients is having a problem I've never seen before.
They're running 5.0.5 Enterprise on a Compaq server; it's been
reliable for a couple of years. I recently installed the latest
few patches - oss640a, oss646a, oss647a (which really shouldn't
be necessary, as their disk subsystem is a RAID array rather than
individual SCSI disks), oss650a - that they didn't already have,
and the problems have been happening since then.
Twice since then, the system has become unusable. The first
time, they couldn't run shutdown, so they hit the power switch and
couldn't provide me a good description, but today it happened again,
and I was able to go in via ssh. They can also telnet in, and they
could log in on the console (and they did, screen after screen, until
every screen was locked up trying to run a frozen app). Every
application which was accessing the /u filesystem was hung and could
not be killed, even -9. Each of the following commands hung my
session; I had to kill my ssh client and connect again:
cd /u
ls -l / (but ls / worked)
df
nuc stop (hung trying to unmount the /.NetWare filesystem)
any command which tried to access any file under /u
I also noticed that last night's backup was hung and unkillable.
visionfs stop didn't hang, but it was unable to stop the visionfs
processes, and they could not be killed; the main share that they
use is on /u.
Any command I tried that did not have anything to do with /u ran
without problems. The CPU was basically 100% idle according to cpuhog.
root, boot, u, and swap are all in the same partition on the array,
so access to the array itself is not the problem. The data collected
by sar every 20 minutes looks normal up to the point where the users
got hung, and after that it looks like a normal system that's idle.
The first time, they let it start up automatically, so fsck would
have run in quick mode (/dev/u is an HTFS filesystem). A few days
later, when they told me they were having problems with some of their
data files (which turned out to contain corrupt data, probably due to
the previous hang), I had them reboot to single-user mode and run fsck
-ofull; it reported no problems. Today, fsck -ofull is reporting a
number of problems that might be expected given that they could not
shut down cleanly.
Between the first and second hangs, I updated the Compaq EFS to
the latest version and ran Compaq's array diagnostics, which reported
that the drive array was working properly. I have been unable to
find any unusual messages in syslog or /usr/adm/messages.
The server is powered from a UPS, though they tell me that the UPS
is in need of repair and that the server crashed due to a power failure
roughly a week before the first hang. This is certainly a possible
cause, and they are aware that they need to get the UPS fixed ASAP.
I'm thinking of rolling back the patches, particularly oss647a,
since the only system configuration change that's taken place between
when the system was reliable and when it first hung was the
installation of the patches. Both hangs, as far as they can tell,
happened after the system had been running for a week or so, so I've
suggested that they might reboot a couple of times a week to see if
that keeps the problems at bay.
Any other ideas? The system is going to be replaced in a month or
two with a newer box running 5.0.7, but we need to keep the old box
stable until then ...
-- Stephen M. Dunn <stephen@stevedunn.ca> >>>----------------> http://www.stevedunn.ca/ <----------------<<< ------------------------------------------------------------------ Say hi to my cat -- http://www.stevedunn.ca/photos/toby/
- Next message: maanas: "Re: How to check for second cpu on 5.0.6"
- Previous message: Bill Campbell: "Re: SCO Technical Articles to say "tata"."
- Next in thread: Bill Vermillion: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Reply: Bill Vermillion: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Reply: Abid Khan: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Reply: Mike Brown: "Re: 5.0.5 - access to a filesystem hangs occasionally"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|