Apparently spurious ZFS CRC errors (was Re: ZFS data error without reasons)



On Mon, 16 Mar 2009, kevin wrote:

My laptop is T61. RAM is also tested by memtest86+ and return no error.

Same here. Memtest fine.

"zfs send tank/usr/home/kevin@2009-03-15-16:51:21|zfs receive backup/kevin" hangs system and i have to power off the machine.when the system up,i find file error in snapshot tank/usr/home/kevin@2009-03-15-16:51:21.when i destroy tank/usr/home/kevin@2009-03-15-16:51:21,then reboot system, i find more errors.

I've moved a box that was running that has been running FreeBSD 7 with a 7x1TB drive RAIDZ2 array.
I've created the same RAIDZ2 with 8-CURRENT and am restoring data from tape to the new array (I wanted to rejig the zfs setup). All will appear well for a while i.e. no CRC errors, can scrub and rescrub the data whilst the data is restoring without problem. I restored the entire 3.5TB from tape without error. All data still scrubs fine. Then suddenly I get CRC errors on every disk. Repeated scrubs show up different amounts of errors.
I just couldn't stop them. So I've started again, this time checking everything and moving drives onto different controllers to isolate problems. I have a gigabyte GA-P35-DS4 MB which has 8xSATA; 6xICH9R & 2xJMB363. It also has an Sil3132 in there which in previous incarnations had the odd drive on it. There's been mention of Sil problems & even though the ICH9, JMB363 and Sil3132 had been perfect with 7, I moved drives off it:

1. Rebuilt kernel and world from last night; Thu Mar 19 18:27:18 GMT 2009.
2. 6x1B drives on ICH9R
2. 2x500GB on JMB363, striped into 1TB
3. / is ufs on USB KEY
4. created RAIDZ2 again
5. recreated zfs filesystems
6. started restore from tape.

Same again. I can restore data and perform a scrub after each tape (LTO2 ~200GB each) is restored. No errors. Get up to ~350GB, still no errors. Then the last scrub I've done throws up:

-----
pool: pool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 0h51m with 0 errors on Fri Mar 20 10:57:18 2009
config:

NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
raidz2 ONLINE 0 0 23
stripe/str0 ONLINE 0 0 489 12.3M repaired
ad14 ONLINE 0 0 786 19.7M repaired
ad16 ONLINE 0 0 804 20.1M repaired
ad18 ONLINE 0 0 754 18.8M repaired
ad20 ONLINE 0 0 771 19.3M repaired
ad22 ONLINE 0 0 808 20.2M repaired
ad24 ONLINE 0 0 848 21.2M repaired

errors: No known data errors
-----

So it happens on both controllers, on plain drives and the stripe. There just seems no way to get rid of these errors once they appear. As I said, last time I got the whole 3.5TB restored without error, was using it for a few days without error, constantly scrubbing to check reliability, then once the errors appear there's no way to remove them.
As this same hardware worked, well with 7 for a long time, and can work perfectly with 8 for several days until the errors strike, this seems like some curious 8 problem?
Any help would be appreciated. I'll be happy to provide any further info to help debug this. I didn't want to unnecessarily make this any longer than it already is.
Cheers.

--
Mark Powell - UNIX System Administrator - The University of Salford
Information & Learning Services, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 6843 Fax: +44 161 295 5888 www.pgp.com for PGP key
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Prime 2550
    ... SMD drives had plugs or thumbwheels on ... you'll have the job of restoring from the tape. ... To boot from tape, you do sysclr, boot 100005. ...
    (comp.sys.prime)
  • Re: Request for feedback - BACKUP enhancement
    ... >>up by writing the save set to multiple tape drives. ... >>Restoring the saveset will require all tapes making the ...
    (comp.os.vms)
  • Re: Request for feedback - BACKUP enhancement
    ... > achieved for many sites by a supported BACKUP/COPY. ... >> up by writing the save set to multiple tape drives. ... >> Restoring the saveset will require all tapes making the ...
    (comp.os.vms)
  • Re: Request for feedback - BACKUP enhancement
    ... achieved for many sites by a supported BACKUP/COPY. ... > up by writing the save set to multiple tape drives. ... > Restoring the saveset will require all tapes making the ...
    (comp.os.vms)
  • Re: NTBackup hangs after restoring files...
    ... NTBackup hangs after it has completed restoring the files to the hard ... tape I can see all the files that are mean't to be on there. ... Operation: Verify After Backup ...
    (microsoft.public.windows.server.sbs)