Re: Disaster recovery ?

From: MaXX (bs139412_at_skynet.be)
Date: 09/01/05


Date: Thu, 01 Sep 2005 12:44:13 +0200

Philip Paeps wrote:
> MaXX <bs139412@skynet.be> wrote:
>> Philip Paeps wrote:
>> > [...]
>> > After the second power failure, the machine came up, panicked, then
>> > refused to boot with an LBA error (could have been 16, can't remember).
>> > Makes me think that bgfsck might be breaking things.
>> > Has anyone else noticed anything like this?
>> Got a similar faillure on a cheap Maxtor DiamondMax Plus 9 (ATA/133
>> 80Gb),
>> but without any power failure, Kernel panic on read error. Hopefully my
>> backups were up to date and I have a spare disk, just in case...

> Both my problems where after the power failed, but as I mentioned, in one
> case, the disk came up fine right after the power failure, but died after
> panicing and trying to come back up after that.
I can't recall exactly what was the panic message.
 
>> After an fsck -y (lasted many hours and generated megs of log) I was able
>> to recover almost all my data (lost some files in /usr/src), and
>> PostgreSQL started without any errors.
>
> I wasn't so lucky. I couldn't boot the machine from the disk (LBA error
> in the loader), when I tried to mount the filesystems in another machine,
> I got
> heaps and heaps of 'uncorrectable' DMA_READ errors. fsck died on me with
> a message that it couldn't read certain sectors [...]
I have activated the reboot on panic stuff, and the machine has not started
asking for the boot loader. I made it boot with a live CD, I was forced to
start fsck -y many times in order to get the "FILESYSTEM IS MARKED CLEAN"
message... Seems that fsck gives up if more than X unreadable blocks.
I'm sure that the ATA controller is healthy.
I was very astounded to see Postgres starting (he's a bit picky about file
system integrity)...

> Good thing I keep fairly religious backups. I've learned to backup config
> files too now, though, those were a bitch to retype. :-)
Yep, DVD-RW are a gift from god (or whoever) but this time I've heavily
modified the setup... Yeah! RTFM night!

>> A funny thing is that smartctl (sysutils/smartmontools) displays errors
>> on the HD log, but fdisk+newfs restore the disk to an usable state.
>> Errors I've seen are DMA_READ (most frequent) and some DMA_WRITE.
>
> Same thing here - the errors in the DMA log are the DMA_READ errors, and
> in the details there's the 'uncorrectable' bit always with addresses in
> the same area, which makes me suspicious of the disk. But as I said,
> after newfsing, the disk was happy to work again and smart still says
> 'passed' on the overall health checks.
>
>> Can that be an ATA driver problem of some kind? I've seen threads about
>> ATA on freebsd-stable, [...]
>
> I'm beginning to wonder. Both my issues were with ide disks from the same
> manufacturer (though on different controllers). During the power
> failures, a number of scsi machines went black too, but they came up
> without a hitch, and
> had no problems surviving fsck. Of course, surviving fsck depends a lot
> on the phase of the moon and other unpredictable things like 'luck' but it
> is an interesting side-note.
SCSI is for more reliable, even when the power fails, I wonder why
(expensive=reliable?)...
My failed (now healthy) drive has a 8MB cache and the machine failed during
a heavy transfer (copying files from another machine, about 2GB) according
to my crontab and logs. Cache is a good thing for performance but if the
cache was full when the system crashed, I can't believe that there where no
pending SoftUpdate commands/tasks in it...

>> [...]
>>
>> The machine runs FreeBSD 5.4-STABLE built from sources as 25/06/05 and
>> was up from installworld to panic this week-end.
>
> My problems were on a 5.4-STABLE from a bit before that and -CURRENT from
> around the same date.
My liveCD was older (around march) and now the machine runs on a freshly
built world, if it was a driver problem, I hope it's solved.
I found this thread on freebsd-stable, the problem is not exactly the same,
but this thread started 19th June 05... Interesting...
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2005-06/0619.html
 
> - Philip

-- 
MaXX


Relevant Pages

  • Re: power failure
    ... had a power failure. ... at one time, i used a disk check utility under 3.x and 4.x, but it's ... system reboots to give me the option ... Try fsck -y ...
    (freebsd-isp)
  • Re: [PATCH] Clustering indirect blocks in Ext3
    ... This patch modifies the block allocation strategy in ext3 in order to ... Slow fsck is not a serious problem on ... Most of Ext3 metadata is clustered on disk. ... indirect blocks are an exception. ...
    (Linux-Kernel)
  • Re: Re. Suse 10.
    ... That may be an approximation of the message, but it is certainly not the ... First wild guess - you have a hardware problem with the second partition ... taking a long time to access the disk or work something out? ... you need to run fsck manually on your scsi drive - ...
    (alt.os.linux.suse)
  • Re: Why is this MacBook Pro so slow?
    ... If your computer won't start up normally, you may need to use a disk ... Mac OS X includes two utilities for ... Utility instead of fsck, whenever possible. ... (Edwin on GEM) ...
    (comp.sys.mac.advocacy)
  • SUMMARY: repair a SAN disk, revised
    ... Please try to run the full fsck and see if it can ... Otherwise only option is to restore from backup. ... repair a SAN disk, ...
    (SunManagers)