Re: Crashing E220R(s)





On Thu, 5 Jul 2007, Franco wrote:

On Jul 5, 1:39 pm, hume.spamfil...@xxxxxxx wrote:
I recently upgraded one of our workgroup servers from an E250 to an
E220R. The 220R had substantially more memory and CPU, so it was a
nice upgrade, even ignoring the fact that at the same time it move from
a barely-maintained Sol7 installation to Sol10.

However, after a couple of months of service the 220 started crashing
two or three times a day. Not much useful diagnostic output was found,
but some indications were that either a memory module or CPU module was
bad.

I swapped the disks over to an identical E220R we had about. I haven't
had a chance to run diagnostics on the old 220, although a simple run
through the OpenBoot tests didn't produce anything.

Now, after a few weeks in service, the *NEW* 220 has crashed today. Better
than the old, I actually got some log messages this time:

Jul 5 07:39:17 E220R-2 SUNW,UltraSPARC-II: [ID 677095 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU2 Data access at TL=0, errID 0x0006613d.f77cf492
Jul 5 07:39:17 E220R-2 AFSR 0x00000000.80200000<PRIV,UE> AFAR 0x00000000.bf41e4f8
Jul 5 07:39:17 E220R-2 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1041228
Jul 5 07:39:17 E220R-2 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0203<UE> UDBL.ESYND 0x03
Jul 5 07:39:17 E220R-2 UDBL Syndrome 0x3 Memory Module U1001 U1002 U1003 U1004
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 761793 kern.warning] WARNING: [AFT1] errID 0x0006613d.f77cf492 Syndrome 0x3 indicates that this may not be a memory module problem
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 490361 kern.info] [AFT2] errID 0x0006613d.f77cf492 PA=0x00000000.bf41e4f8
Jul 5 07:39:18 E220R-2 E$tag 0x00000000.0fc017e8 E$State: Modified E$parity 0x07
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x00000000.00000000
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x00000040.00000000 *Bad* PSYND=0x00ff
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 301547 kern.warning] WARNING: [AFT1] Additional errors detected during error processing on CPU2 at TL=0, errID 0x0006613d.f77cf492
Jul 5 07:39:18 E220R-2 AFSR 0x00000000.008000ff<WP> AFAR 0x00000000.bf41e4f0
Jul 5 07:39:18 E220R-2 AFSR.PSYND 0x00ff(Score 05) AFSR.ETS 0x00 Fault_PC 0x1041228
Jul 5 07:39:18 E220R-2 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0003 UDBL.ESYND 0x03
Jul 5 07:39:18 E220R-2 SUNW,UltraSPARC-II: [ID 940362 kern.warning] WARNING: [AFT1] AFAR was derived from UE report, CP event on CPU0 (caused Data access error on CPU2), errID 0x0006613d.f77cf492

Am I right in thinking that this is a manifestation of the Ultra-II ecache
problems from years ago? These servers - spare equipment we obtained at
no cost, and thus no support contract - are the only ones of their type
we have. We essentially skipped over that entire generation, going from
the E250 straight to T2000s.

And... as I type this... it's just crashed again.

--
Brandon Hume - hume -> BOFH.Ca,http://WWW.BOFH.Ca/

Looks like a bad/faulty DIMM.

It might look like DIMM failure, but it probably isnt, the best message of
the whole lot is the last which basically explains what happened.
However, the system is not sure, as it only puts a score 5 on the message.

My guess would be to replace CPU0, if you can accept another crash, open
the system and swap them CPU0 <-> CPU2 and see if the next crash points to
the same CPU, whereever its placed.

Reseating/Reordering DIMMS's can also be a good way to find the faulty
SINGLE one when one has whole banks signalled, however... if you cant take
the hit, your better off just replacing it.

But as previously stated by others, sure looks like those old cache
problems.

/Johan A

.



Relevant Pages

  • Next July 27: boot failure(hang) on x86_64 box.
    ... Freeing unused kernel memory: 1360k freed ... ACPI: PM-Timer IO Port: 0x488 ... CPU: L2 Cache: 1024K ... # AX.25 network device drivers ...
    (Linux-Kernel)
  • VB6.exe - Projects beyond a certain size may result in compiler cr
    ... This is related to another thread, "vb6.exe crash while debugging" by Process ... Then at least we may know the limits of VB6 as it is, ... memory could not be "written". ... Neither uses a HyperThreading CPU, and both are using AMD Athlon CPU's (2.5+ ...
    (microsoft.public.vb.bugs)
  • [PATCH] Document Linuxs memory barriers [try #3]
    ... The attached patch documents the Linux kernel's memory barriers. ... I've tried to get rid of the concept of memory accesses appearing on the bus; ... barring implicit enforcement by the CPU. ...
    (Linux-Kernel)
  • Oops in 2.6.28-rc9 and -rc8 -- mtrr issues / e1000e
    ... Bios 1.04beta did show correct memory sizing in dmidecode, ... I hope this is as simple as me doing something glaringly wrong in the kernel ... DMI present. ... CPU: L2 cache: 6144K ...
    (Linux-Kernel)
  • Re: read vs. mmap (or io vs. page faults)
    ... not fit in main memory, and there are overheads related to the heuristics ... But because the CPU is underutilized, ... reasonably sized user buffer). ... You have to measure the actual overhead to see what the actual cost is. ...
    (freebsd-questions)

Loading