Re: Crashing E220R(s)
- From: "Mr. Johan Andersson" <johan@xxxxxxxxxxxxxx>
- Date: Fri, 6 Jul 2007 11:45:57 +0200
On Thu, 5 Jul 2007, Franco wrote:
On Jul 5, 1:39 pm, hume.spamfil...@xxxxxxx wrote:
I recently upgraded one of our workgroup servers from an E250 to an
E220R. The 220R had substantially more memory and CPU, so it was a
nice upgrade, even ignoring the fact that at the same time it move from
a barely-maintained Sol7 installation to Sol10.
However, after a couple of months of service the 220 started crashing
two or three times a day. Not much useful diagnostic output was found,
but some indications were that either a memory module or CPU module was
bad.
I swapped the disks over to an identical E220R we had about. I haven't
had a chance to run diagnostics on the old 220, although a simple run
through the OpenBoot tests didn't produce anything.
Now, after a few weeks in service, the *NEW* 220 has crashed today. Better
than the old, I actually got some log messages this time:
Jul 5 07:39:17 E220R-2 SUNW,UltraSPARC-II: [ID 677095 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU2 Data access at TL=0, errID 0x0006613d.f77cf492
Jul 5 07:39:17 E220R-2 AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.bf41e4f8
Jul 5 07:39:17 E220R-2 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1041228
Jul 5 07:39:17 E220R-2 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03
Jul 5 07:39:17 E220R-2 UDBL Syndrome 0x3 Memory Module U1001 U1002 U1003 U1004
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 761793 kern.warning] WARNING: [AFT1] errID 0x0006613d.f77cf492 Syndrome 0x3 indicates that this may not be a memory module problem
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 490361 kern.info] [AFT2] errID 0x0006613d.f77cf492 PA=0x00000000.bf41e4f8
Jul 5 07:39:18 E220R-2 E$tag 0x00000000.0fc017e8 E$State: Modified E$parity 0x07
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x00000040.00000000 *Bad* PSYND=0x00ff
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 301547 kern.warning] WARNING: [AFT1] Additional errors detected during error processing on CPU2 at TL=0, errID 0x0006613d.f77cf492
Jul 5 07:39:18 E220R-2 AFSR 0x00000000.008000ff<WP> AFAR 0x00000000.bf41e4f0
Jul 5 07:39:18 E220R-2 AFSR.PSYND 0x00ff(Score 05) AFSR.ETS 0x00 Fault_PC 0x1041228
Jul 5 07:39:18 E220R-2 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0003 UDBL.ESYND 0x03
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 940362 kern.warning] WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU0 (caused Data access error on CPU2), errID 0x0006613d.f77cf492
Am I right in thinking that this is a manifestation of the Ultra-II ecache
problems from years ago? These servers - spare equipment we obtained at
no cost, and thus no support contract - are the only ones of their type
we have. We essentially skipped over that entire generation, going from
the E250 straight to T2000s.
And... as I type this... it's just crashed again.
--
Brandon Hume - hume -> BOFH.Ca,http://WWW.BOFH.Ca/
Looks like a bad/faulty DIMM.
It might look like DIMM failure, but it probably isnt, the best message of
the whole lot is the last which basically explains what happened.
However, the system is not sure, as it only puts a score 5 on the message.
My guess would be to replace CPU0, if you can accept another crash, open
the system and swap them CPU0 <-> CPU2 and see if the next crash points to
the same CPU, whereever its placed.
Reseating/Reordering DIMMS's can also be a good way to find the faulty
SINGLE one when one has whole banks signalled, however... if you cant take
the hit, your better off just replacing it.
But as previously stated by others, sure looks like those old cache
problems.
/Johan A
.
- Follow-Ups:
- Re: Crashing E220R(s)
- From: hume . spamfilter
- Re: Crashing E220R(s)
- References:
- Crashing E220R(s)
- From: hume . spamfilter
- Re: Crashing E220R(s)
- From: Franco
- Crashing E220R(s)
- Prev by Date: Re: Patch management
- Next by Date: Solaris 10 Graphics
- Previous by thread: Re: Crashing E220R(s)
- Next by thread: Re: Crashing E220R(s)
- Index(es):
Relevant Pages
|
Loading