Re: Missing Files After Panic/Fsck?
From: Tim Bradshaw (tfb_at_cley.com)
Date: 11/06/03
- Previous message: Stephen Gray: "Sendmail upgrade"
- In reply to: Gavin Maltby: "Re: Missing Files After Panic/Fsck?"
- Next in thread: Thomas H Jones II: "Re: Missing Files After Panic/Fsck?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 06 Nov 2003 09:32:57 +0000
* Gavin Maltby wrote:
> If this happens then it's a bug and you should complain to the
> cluster vendor. It is possible to knobble some heartbeat designs
> with device interrupts, but you can design around that.
Yes, I'm not disputing it's a bug: I'm just wondering whether it can
happen in current cluster systems. Complaining to the vendor may not
get a fix to something like this very fast, if it happens, because I
suspect it's not something a three-line change will fix.
And just to play devil's advocate: *is* it a bug? We've all dealt
with systems where some horrible thing has happened (runaway memory
use, fork bomb or what have you), and while the system is in theory
still up, it's not actually doing any useful work or likely to any
time soon, and the kindest thing is to put it out of its misery with
the big red button. OK, *those* problems can, in theory, be worked
around as well, but they still happen. Maybe the right thing in that
case is for the cluster to just decide that machine is gone, and fail
over? Of course it should probably make a conscious decision (`load
average is 903, memory shortfall is 20GB, last user code ran 10mins
ago, time to die') rather than just failing to respond...
--tim
- Previous message: Stephen Gray: "Sendmail upgrade"
- In reply to: Gavin Maltby: "Re: Missing Files After Panic/Fsck?"
- Next in thread: Thomas H Jones II: "Re: Missing Files After Panic/Fsck?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|