Re: zpool scrub errors on 3ware 9550SXU



Quoting Kip Macy <kmacy@xxxxxxxxxxx>:

Did you answer my question of whether or not this can be reproduced on 7-STABLE?

Yes I did, but the threading is a little broken, sorry that's my fault.

To reiterate, with 7 stable circa Jun 25th scrubs complete okay on the exact same hardware and v6 zpool as fails under 8.0-BETA1.

I'm scrubbing under 7 every time a run under 8 fails.

A reminder of the setup.

3ware 9550SXU-16
16x 1.5TB seagate. These drives throw bad sectors!

2 8 disk raidz2 vdevs combined into one pool.21.8TB.

Test file system with compression on copies 2

I don't think this is a zfs error as such, it looks like the card gives up, which then spawns a whole series of bogus checksum errors (but what do I know).

It's odd that it seems to take 40m+ to fail. Offsets are always large.

How can I test for/eliminate any LBA error?
What else might cause the card to fail (after 40m)?

BTW I have to put this into production soon, so I can start testing all the other stuff which might not work (ie samba).

Thanks for your help.



-Kip



On Tue, Jul 7, 2009 at 1:03 PM, Ian J Hart<ianjhart@xxxxxxxxxxxx> wrote:
Quoting ianjhart@xxxxxxxxxxxx:

Quoting ianjhart@xxxxxxxxxxxx:

Quoting Kip Macy <kmacy@xxxxxxxxxxx>:


As usual scrubs cleanly on 7.2. Started throwing errors within a few
minutes under 8. Then it paniced, possibly due to scrub -s.

It's sat at the DB prompt if there's anything I can do. I'll need
idiots guide level instruction. I have a screen dump if someone want to step
up. Off list?

Highlight seems to be...

Memory modified after free 0xffffff0004da0c00(248) val=3000000 @
0xffffff0004dc00
Panic: most recently used by none

Can you test with recent 7-STABLE? That would tell me whether or not
your hitting a general HEAD issues or problems with the v13 import.

It's doing a scrub under 7.2 following another failed test. I'll pull it
up to stable after that.

Have more data will post that once I've done a couple a jobs.


Thanks,
Kip

Here's that extra data.

Updated 3ware/AMCC card firmware.

Enable onboard SATA and fit a 300GB SATA disk. Remove the floppy and fit a
second 300GB SATA disk.

Remove the two 500GB disks and replace with 1.5TB units. I can now create
two 8 disk raidz2 giving the same 12 disks worth of storage I had with one
14 disk raidz2.

Reinstall the two O/S on the 300GB drives.

<slight tangent>
May be of use to someone, so bear with me.

Reset to BIOS defaults. Some issues! Disabling sound helps.

Now suspect motherboard BIOS may be part of the problem. Removed both
cards and tested each version in turn.

ref: http://www.tyan.com.tw/support_download_bios.aspx?model=S.S2895

Started with 1.04 ended up with 1.04. Versions after, detect the internal;
SATA disks as 150 not 300. Most versions lock the keyboard (KVM) when legacy
USB is enabled. That's a PITA when you've just taken the floopy disk out.No
internal SATA disk settings. Be nice to check the geometry as 7 and 8
sysinstall seem to be behaving differently.

With the cards back in.

Add an ATA disk and CDROM while testing.Easyboot order is SATA0 ATA0
SATA1. Fdisk the so far blank ATA disk :)

On board audio clashes with something. BIOS 1.03 and later supports 16
SCSI boot devices. I disabled booting from the RAID card to allow the
onboard SATA drives to boot.

Out of space for option ROM error has gone.

AFAIK CPUs are late enough to support DDR400. Check anyway. Clock down to
333Mhz. Still fails.

</slight tangent>

There's one last thing, this BIOS (1.04) does not supply the fix for AMD
errata 169. Later BIOS incorrectly detect the onboard SATA disks.

Northbridge System Request Queue may stall.

ref:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25759.pdf

We don't seem to  have /dev/msr. Could I fix this using (the shiny new)
cpucontrol?

Thanks

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"


FWIW this is still reproducable with 8.0-BETA1.

--
ian j hart

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"




--
When bad men combine, the good must associate; else they will fall one
by one, an unpitied sacrifice in a contemptible struggle.

Edmund Burke
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"




--
ian j hart

--
ian j hart

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: zpool scrub errors on 3ware 9550SXU
    ... Remove the floppy and fit a second 300GB SATA disk. ... Now suspect motherboard BIOS may be part of the problem. ... This message was sent using IMP, the Internet Messaging Program. ...
    (freebsd-current)
  • Re: Server Advice Wanted.
    ... per disk on a mildle used disk on the other a hard drive ... > write out to DVDs. ... > expensive to mirror and backup. ... If a 3TB mirrored system fails, ...
    (borland.public.delphi.non-technical)
  • Continuous Reboot
    ... A friend has a machine with XP-Home on it which suddenly fails by repeatedly rebooting whenever it's turned on. ... On down one can catch a quick flash something about perhaps it's got insufficient disk space. ... I rather doubt that any backups have been done on the system, so I'd rather not trash the data files if possible until XP can be made stable enough to off load backups of the data files. ...
    (microsoft.public.windowsxp.basics)
  • Re: Corrupt data - RAID sata_sil 3114 chip
    ... >>> There are apparently some reports of issues on NVidia chipsets as ... with it on a 250GB disk, it got massive corruption very quickly. ... I created a partition in this bad area to speed up testing.. ... (Though why only the 0x00 pattern fails would still be a mystery). ...
    (Debian-User)
  • Re: zpool scrub errors on 3ware 9550SXU
    ... It's doing a scrub under 7.2 following another failed test. ... Enable onboard SATA and fit a 300GB SATA disk. ... Now suspect motherboard BIOS may be part of the problem. ...
    (freebsd-current)