Re: "New" ZFS crash on FS (pool?) unmount/export



On Jun 20, 2009, at 09:11, Thomas Backman wrote:

I just ran into this tonight. Not sure exactly what triggered it - the box stopped responding to pings at 02:07AM and it has a cron backup job using zfs send/recv at 02:00, so I'm guessing it's related, even though the backup probably should have finished before then... Hmm. Anyway.

r194478.

kernel trap 12 with interrupts disabled

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x288
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff805a4989
stack pointer = 0x28:0xffffff803e8b57e0
frame pointer = 0x28:0xffffff803e8b5840
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = resume, IOPL = 0
current process = 57514 (zpool)
panic: from debugger
cpuid = 0
Uptime: 10h22m13s
Physical memory: 2027 MB

(kgdb) bt
#0 doadump () at pcpu.h:223
#1 0xffffffff8059c409 in boot (howto=260) at /usr/src/sys/kern/ kern_shutdown.c:419
#2 0xffffffff8059c85c in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:575
#3 0xffffffff801f1377 in db_panic (addr=Variable "addr" is not available.
) at /usr/src/sys/ddb/db_command.c:478
#4 0xffffffff801f1781 in db_command (last_cmdp=0xffffffff80c38620, cmd_table=Variable "cmd_table" is not available.
) at /usr/src/sys/ddb/db_command.c:445
#5 0xffffffff801f19d0 in db_command_loop () at /usr/src/sys/ddb/ db_command.c:498
#6 0xffffffff801f3969 in db_trap (type=Variable "type" is not available.
) at /usr/src/sys/ddb/db_main.c:229
#7 0xffffffff805ce465 in kdb_trap (type=12, code=0, tf=0xffffff803e8b5730) at /usr/src/sys/kern/subr_kdb.c:534
#8 0xffffffff8088715d in trap_fatal (frame=0xffffff803e8b5730, eva=Variable "eva" is not available.
) at /usr/src/sys/amd64/amd64/trap.c:847
#9 0xffffffff80887fb2 in trap (frame=0xffffff803e8b5730) at /usr/ src/sys/amd64/amd64/trap.c:345
#10 0xffffffff8086e007 in calltrap () at /usr/src/sys/amd64/amd64/ exception.S:223
#11 0xffffffff805a4989 in _sx_xlock_hard (sx=0xffffff0043557d50, tid=18446742975830720512, opts=Variable "opts" is not available.
)
at /usr/src/sys/kern/kern_sx.c:575
#12 0xffffffff805a52fe in _sx_xlock (sx=Variable "sx" is not available.
) at sx.h:155
#13 0xffffffff80fe2995 in zfs_freebsd_reclaim () from /boot/kernel/ zfs.ko
#14 0xffffffff808cefca in VOP_RECLAIM_APV (vop=0xffffff0043557d38, a=0xffffff0043557d50) at vnode_if.c:1926
#15 0xffffffff80626f6e in vgonel (vp=0xffffff00437a7938) at vnode_if.h:830
#16 0xffffffff8062b528 in vflush (mp=0xffffff0060f2a000, rootrefs=0, flags=0, td=0xffffff0061528000)
at /usr/src/sys/kern/vfs_subr.c:2450
#17 0xffffffff80fdd3a8 in zfs_umount () from /boot/kernel/zfs.ko
#18 0xffffffff8062420a in dounmount (mp=0xffffff0060f2a000, flags=1626513408, td=Variable "td" is not available.
)
at /usr/src/sys/kern/vfs_mount.c:1287
#19 0xffffffff80624975 in unmount (td=0xffffff0061528000, uap=0xffffff803e8b5c00)
at /usr/src/sys/kern/vfs_mount.c:1172
#20 0xffffffff8088783f in syscall (frame=0xffffff803e8b5c90) at /usr/ src/sys/amd64/amd64/trap.c:984
#21 0xffffffff8086e290 in Xfast_syscall () at /usr/src/sys/amd64/ amd64/exception.S:364
#22 0x000000080104e49c in ?? ()
Previous frame inner to this frame (corrupt stack?)

I might have hit the same thing again... only it didn't *crash* this time, just freeze! I got a "GEOM_GATE: Device ggateX destroyed" on my console, and it stopped responding to pings, keyboard input, etc.
(NOTE: The GEOM_GATE message MAY have been an old one. I *think* it was from the night before but can't be sure...)

It obviously happened while running my ugly-hack backup script this time too, since it stopped responding to pings ~02:02AM with the script running at 02:00.
I'm not sure where it crashed, since snapshots were NOT taken on the "local box". It usually crashes during export, long after taking the snapshots.

BTW, current system is BETA1 r195422M (dtrace timestamp patch + libzfs_sendrecv.c patch ( http://lists.freebsd.org/pipermail/freebsd-current/2009-May/006814.html ).

Here's the script in its relevant entirety (I removed the "initial backup" part since it never runs using cron anyway).
I realize it's an ugly hack (and that my bash-fu could be stronger), but what the heck. Essentially, it creates a GEOM provider of a file, containing a zpool, imports the pool and creates a clone to it, and then exports the pool. The export appears to be what causes all the trouble - usually, not this time around. Every time it crashes it seems to be during or very soon after the export - only this time it didn't even take the snapshots.

#!/bin/bash

PATH="$PATH:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/ sbin"

function die() {
echo "$@" 2>&1
zpool export slave 2>&1 > /dev/null
ggatel destroy -u 666 2>&1 > /dev/null
exit 1
}

function mount_unmount {
if [ -z "$1" ]; then
die 'Invalid argument given to mount_unmount'
elif [[ "$1" == "mount" ]]; then
zpool list | grep -q slave
if [ "$?" = "0" ]; then
echo Already mounted
return 0
fi
echo Creating ggate device
ggatel create -u 666 /mnt/backup/chaos/slavefile || die 'Unable to create GEOM provider from file'
echo 'Sleeping for 5 seconds...'
sleep 5
echo Importing pool
zpool import -R /slave slave || die 'Unable to import slave pool'
elif [[ "$1" == "unmount" ]]; then
echo Exporting pool
zpool export slave || die 'Unable to export slave pool'
ggatel destroy -u 666 || die 'Unable to destroy GEOM provider'
fi
}

f [ ! -z "$1" ]; then
case $1 in
mount) mount_unmount mount; exit 0;;
unmount) mount_unmount unmount; exit 0;;
initial) initial; exit 0;;
backup) ;;
*) help;;
esac
else
help
fi

if [ ! -f "/mnt/backup/chaos/slavefile" ]; then
echo 'Backup error! slavefile does not exist!' | mail -s "Backup error" root
echo 'Slavefile does not exist!'
exit 1
fi

mount_unmount mount

CURR=$(date +"backup-%F-%H%M")

echo Taking snapshots
zfs snapshot -r tank@$CURR || die 'Unable to create $CURR snapshot'

echo Starting backup...
LAST=$(cat /root/.last-backup)
zfs send -R -I $LAST tank@$CURR | zfs recv -Fvd slave

echo $CURR > /root/.last-backup

mount_unmount unmount

echo Running rsync
rsync -av --delete /bootdir/boot exscape::backup-freebsd/chaos
rsync -av --delete /root exscape::backup-freebsd/chaos
rsync -av --delete ~serenity exscape::backup-freebsd/chaos

echo 'All done!'

-------------------

So, in *normal* cases, everything runs just fine. This is the case perhaps 90% of the time. In normal *crashes*, it hangs on export with the above backtrace. This time, all I know is that it crashes soon after starting the backup... during import, perhaps?

Regards,
Thomas
_______________________________________________
freebsd-current@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • RE: NT Backup inconsistent medium
    ... I would configure the backup with NTBackup and set a schedule from there. ... | directory's to a tape with NT backup it works fine. ... | @ECHO Off ... | REM Script gebaseerd op NT Backup ...
    (microsoft.public.windows.server.sbs)
  • Robocopy fails sporadically
    ... if errorlevel 16 echo FATAL ERROR on BACKUP of %~n0's DATA>> ... if errorlevel 8 echo FAILED COPIES on BACKUP of %~n0's DATA>> ...
    (microsoft.public.windows.server.general)
  • Re: [SLE] Bat files in Linux
    ... > I have a few commands that I would like to run at the command prompt. ... > In Windows I do this with a bat file that looks like the sample below. ... > ECHO Backup Menu ... > ECHO A. Flash Drive Backup J: ...
    (SuSE)
  • Re: Sanity check needed on an idea
    ... such-and-such server all day yesterday, why is it broken, and when is it going ... as well as a backup copy handy if you need 'em. ... # operate on that host. ... echo "Authenticating via SSH key id: ...
    (freebsd-questions)
  • Re: dos batch programmer help!!
    ... REM Set up your two serial numbers here and ensure dir C:\TEMP\ exists ... if errorlevel 1 goto otherdisk ... echo Wrong diskin H: -- swap it to allow backup to continue! ...
    (comp.os.msdos.programmer)

Loading