Crashing E220R(s)
- From: hume.spamfilter@xxxxxxx
- Date: Thu, 5 Jul 2007 12:39:46 +0000 (UTC)
I recently upgraded one of our workgroup servers from an E250 to an
E220R. The 220R had substantially more memory and CPU, so it was a
nice upgrade, even ignoring the fact that at the same time it move from
a barely-maintained Sol7 installation to Sol10.
However, after a couple of months of service the 220 started crashing
two or three times a day. Not much useful diagnostic output was found,
but some indications were that either a memory module or CPU module was
bad.
I swapped the disks over to an identical E220R we had about. I haven't
had a chance to run diagnostics on the old 220, although a simple run
through the OpenBoot tests didn't produce anything.
Now, after a few weeks in service, the *NEW* 220 has crashed today. Better
than the old, I actually got some log messages this time:
Jul 5 07:39:17 E220R-2 SUNW,UltraSPARC-II: [ID 677095 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU2 Data access at TL=0, errID 0x0006613d.f77cf492
Jul 5 07:39:17 E220R-2 AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.bf41e4f8
Jul 5 07:39:17 E220R-2 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1041228
Jul 5 07:39:17 E220R-2 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03
Jul 5 07:39:17 E220R-2 UDBL Syndrome 0x3 Memory Module U1001 U1002 U1003 U1004
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 761793 kern.warning] WARNING: [AFT1] errID 0x0006613d.f77cf492 Syndrome 0x3 indicates that this may not be a memory module problem
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 490361 kern.info] [AFT2] errID 0x0006613d.f77cf492 PA=0x00000000.bf41e4f8
Jul 5 07:39:18 E220R-2 E$tag 0x00000000.0fc017e8 E$State: Modified E$parity 0x07
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x00000040.00000000 *Bad* PSYND=0x00ff
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 301547 kern.warning] WARNING: [AFT1] Additional errors detected during error processing on CPU2 at TL=0, errID 0x0006613d.f77cf492
Jul 5 07:39:18 E220R-2 AFSR 0x00000000.008000ff<WP> AFAR 0x00000000.bf41e4f0
Jul 5 07:39:18 E220R-2 AFSR.PSYND 0x00ff(Score 05) AFSR.ETS 0x00 Fault_PC 0x1041228
Jul 5 07:39:18 E220R-2 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0003 UDBL.ESYND 0x03
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 940362 kern.warning] WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU0 (caused Data access error on CPU2), errID 0x0006613d.f77cf492
Am I right in thinking that this is a manifestation of the Ultra-II ecache
problems from years ago? These servers - spare equipment we obtained at
no cost, and thus no support contract - are the only ones of their type
we have. We essentially skipped over that entire generation, going from
the E250 straight to T2000s.
And... as I type this... it's just crashed again.
--
Brandon Hume - hume -> BOFH.Ca, http://WWW.BOFH.Ca/
.
- Follow-Ups:
- Re: Crashing E220R(s)
- From: Dean
- Re: Crashing E220R(s)
- From: Franco
- Re: Crashing E220R(s)
- Prev by Date: Re: vmstat output
- Next by Date: Re: Crashing E220R(s)
- Previous by thread: vmstat output
- Next by thread: Re: Crashing E220R(s)
- Index(es):
Relevant Pages
|