V240 ECC errors

From: Chris Cameron (Chris.Cameron_at_NetThruPut.com)
Date: 10/27/04

  • Next message: Nathan Bardsley: "SUMMARY: permission denied changing passwd (no NIS)"
    To: sunmanagers@sunmanagers.org
    Date: Wed, 27 Oct 2004 11:28:35 -0600
    
    

    Have a V240 that isn't happy. We have a hardware support contract
    through a 3rd party (bad idea), and they're insisting that the error
    messages aren't pointing to any given component in the server. Because
    of this they're dragging their feet on doing anything.

    I'm certain the error is in memory, and I'm 99% sure that the error
    message points to (at least) the bank that the error is coming from.

    Could someone give me their interpertation of this information?

    And just to pilfer more information from this post; have many people
    here experienced RAM going bad on its own? This server has been working
    fine for 8 months now and its older V240 brother hasn't had any
    problems.

    Thanks,
    Chris

    SUNWvts fails on memory with the error:

    10/26/04 17:42:43 prod2 SunVTS5.1ps2: VTSID 6002 pmemtest.ERROR mem: "2
    persistent errors on MB/P1/B0/D0: B0/D0."
    10/26/04 17:42:43 prod2 SunVTS5.1ps2: VTSID 7012 vtsk.INFO : *Failed
    test*
     mem(pmemtest) passes: 56 errors: 1
    10/26/04 17:43:23 prod2 SunVTS5.1ps2: VTSID 7005 vtsk.INFO : *Stop all
    tests*
     System Passes: 30, Cumulative Errors: 1, Elapsed Test Time: 000:47:53
     cpu-unit0(iutest) passes: 3493 errors: 0
     cpu-unit0(iutest).1 passes: 3486 errors: 0
     cpu-unit0(fputest) passes: 134 errors: 0
     cpu-unit0(fputest).1 passes: 134 errors: 0
     cpu-unit1(iutest) passes: 3463 errors: 0
     cpu-unit1(iutest).1 passes: 3468 errors: 0
     cpu-unit1(fputest) passes: 135 errors: 0
     cpu-unit1(fputest).1 passes: 135 errors: 0
     kmem(vmemtest) passes: 31 errors: 0
     kmem(vmemtest).1 passes: 30 errors: 0
     mem(pmemtest) passes: 56 errors: 1
     mem(pmemtest).1 passes: 57 errors: 0

    A (much) shortened /var/adm/messages:

    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 147594 kern.info]
    NOTICE: [AFT0] Corrected memory (CE) Event detected by CPU1 at TL=0,
    errID 0x0000d080.af559597
    Oct 27 08:23:06 prod2 AFSR 0x00100002<PRIV,CE>.00000051 AFAR
    0x00000012.36eba6a0
    Oct 27 08:23:06 prod2 Fault_PC 0x1009a608 Esynd 0x0051 MB/P1/B0/D0:
    B0/D0
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 419725 kern.info] [AFT0]
    errID 0x0000d080.af559597 Corrected Memory Error on MB/P1/B0/D0: B0/D0
    is Persistent
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 557903 kern.info] [AFT0]
    errID 0x0000d080.af559597 Data Bit 44 was in error and corrected
    Oct 27 08:23:06 prod2 unix: [ID 596940 kern.warning] WARNING: [AFT0] 118
    soft errors in less than 24:00 (hh:mm) detected from Memory Module
    MB/P1/B0/D0: B0/D0
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 548377 kern.info] [AFT2]
    errID 0x0000d080.af559597 PA=0x00000012.36eba680
    Oct 27 08:23:06 prod2 E$tag 0x00000000.16048dba E$state Exclusive
    E$indx 1.000ba680
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x10) 0x00907466.030108c0 0x00000000.00000000 ECC 0x04e
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x20) 0x00000300.02f1c0a8 0x00000310.04b3b820 ECC 0x008
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x30) 0x00000310.04aba700 0x00000310.04aba640 ECC 0x1b0
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 929717 kern.info] [AFT2]
    D$ data not available
    Oct 27 08:23:06 prod2 SUNW,UltraSPARC-IIIi: [ID 335345 kern.info] [AFT2]
    I$ data not available
    Oct 27 08:23:12 prod2 SUNW,UltraSPARC-IIIi: [ID 460234 kern.info]
    NOTICE: [AFT0] Corrected memory (CE) Event detected by CPU1 at TL=0,
    errID 0x0000d082.149b28a7
    Oct 27 08:23:12 prod2 AFSR 0x00100002<PRIV,CE>.00000007 AFAR
    0x00000012.36ebb540
    Oct 27 08:23:12 prod2 Fault_PC <unknown> Esynd 0x0007 MB/P1/B0/D0:
    B0/D0
    Oct 27 08:23:12 prod2 SUNW,UltraSPARC-IIIi: [ID 983987 kern.info] [AFT0]
    errID 0x0000d082.149b28a7 Corrected Memory Error on MB/P1/B0/D0: B0/D0
    is Intermittent
    Oct 27 08:23:12 prod2 SUNW,UltraSPARC-IIIi: [ID 779590 kern.info] [AFT0]
    errID 0x0000d082.149b28a7 Data Bit 47 was in error and corrected
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 776888 kern.info]
    NOTICE: [AFT0] Corrected memory (FRC) Event detected by CPU1 at TL=0,
    errID 0x0000d752.e47752a2
    Oct 27 10:28:06 prod2 AFSR 0x00000000.10000051<FRC> AFAR
    0x00000012.36ebb540 INVALID
    Oct 27 10:28:06 prod2 Fault_PC 0x100456e8 Esynd 0x0051 J_AID 0
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 178586 kern.info] [AFT0]
    errID 0x0000d752.e47752a2 Data Bit 44 was in error and corrected
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 776888 kern.info]
    NOTICE: [AFT0] Corrected memory (FRC) Event detected by CPU1 at TL=0,
    errID 0x0000d752.e47752a2
    Oct 27 10:28:06 prod2 AFSR 0x00000000.10000051<FRC> AFAR
    0x00000012.36ebb540 INVALID
    Oct 27 10:28:06 prod2 Fault_PC 0x100456e8 Esynd 0x0051 J_AID 0
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 178586 kern.info] [AFT0]
    errID 0x0000d752.e47752a2 Data Bit 44 was in error and corrected
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 524728 kern.info]
    NOTICE: [AFT0] Corrected remote memory/cache (RCE) Event detected by
    CPU0 at TL=0, errID 0x0000d752.e4776390
    Oct 27 10:28:06 prod2 AFSR 0x00100000<PRIV>.81000000<RCE> AFAR
    0x00000012.36eba6a0
    Oct 27 10:28:06 prod2 Fault_PC 0x1009a608 J_REQ 1
    Oct 27 10:28:06 prod2 MB/P1/B0: B0/D0 B0/D1 (applicable only if
    corresponding FRC Event also logged)
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 470073 kern.info] [AFT2]
    errID 0x0000d752.e4776390 PA=0x00000012.36eba680
    Oct 27 10:28:06 prod2 E$tag 0x00000000.16048dba E$state Exclusive
    E$indx 0.000ba680
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x00) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x10) 0x00907466.030108c0 0x00000000.00000000 ECC 0x04e
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x20) 0x00000300.02f1c0a8 0x00000310.04b3b820 ECC 0x008
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2]
    E$Data (0x30) 0x00000310.04aba700 0x00000310.04aba640 ECC 0x1b0
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 929717 kern.info] [AFT2]
    D$ data not available
    Oct 27 10:28:06 prod2 SUNW,UltraSPARC-IIIi: [ID 335345 kern.info] [AFT2]
    I$ data not available
    Oct 27 10:28:12 prod2 SUNW,UltraSPARC-IIIi: [ID 416787 kern.info]
    NOTICE: [AFT0] Corrected memory (FRC) Event detected by CPU1 at TL=0,
    errID 0x0000d754.4a13c207
    Oct 27 10:28:12 prod2 AFSR 0x00000000.10000007<FRC> AFAR
    0x00000012.36ebb540 INVALID
    Oct 27 10:28:12 prod2 Fault_PC <unknown> Esynd 0x0007 J_AID 0
    Oct 27 10:28:12 prod2 SUNW,UltraSPARC-IIIi: [ID 696726 kern.info] [AFT0]
    errID 0x0000d754.4a13c207 Data Bit 47 was in error and corrected
    Oct 27 10:28:22 prod2 SUNW,UltraSPARC-IIIi: [ID 717637 kern.info]
    NOTICE: [AFT0] Corrected remote memory/cache (RCE) Event detected by
    CPU0 at TL=0, errID 0x0000d756.b7d82b5c
    Oct 27 10:28:22 prod2 AFSR 0x00100000<PRIV>.81000000<RCE> AFAR
    0x00000012.36ebb540
    Oct 27 10:28:22 prod2 Fault_PC <unknown> J_REQ 1
    Oct 27 10:28:22 prod2 MB/P1/B0: B0/D0 B0/D1 (applicable only if
    corresponding FRC Event also logged)
    _______________________________________________
    sunmanagers mailing list
    sunmanagers@sunmanagers.org
    http://www.sunmanagers.org/mailman/listinfo/sunmanagers


  • Next message: Nathan Bardsley: "SUMMARY: permission denied changing passwd (no NIS)"

    Relevant Pages

    • Sun Fire V 440 Error
      ... corresponding FRC Event also logged) ... errID 0x00000297.9a6f1830 PA=0x00000000.3e1c6c40 ... Jul 10 20:21:03 dbw E$tag 0x00000000.16000f87 E$state Exclusive ... 0x00000000.3e1c7fd0 INVALID ...
      (SunManagers)
    • intermittent memory error ?
      ... Now on Oct 2 I had a second errro but in a different memory module. ... Then on June and now on October I got 2 memory parity errors. ... errID 0x0000ed4d.9a2bb91b Corrected Memory Error on U0701 is Persistent ...
      (SunManagers)
    • v880 memory problem
      ... errID 0x00185cc9.e6b46ca0 Corrected Memory Error on Slot B: ... soft errors in less than 24:00 detected from Memory Module Slot B: ...
      (SunManagers)
    • How to find out the slot number for the bad memory in E420?
      ... I have a bad memory on my E420R and want to replace a memory module. ... Corrected Memory Error detected by CPU1, errID ...
      (comp.sys.sun.admin)
    • SUN ULTRA SPARC - II GIVING FREQUENT MEMORY PROBLEMS
      ... frequently giving me the following log message relating to the some ... Corrected Memory Error detected by CPU0, errID ... Error on U0303 is PersistentFeb 10 05:15:15 casper SUNW,UltraSPARC-II: ...
      (comp.sys.sun.admin)