UPDATE : Is this OS or Hardware?

From: Michael W (mikew_at_pvbb.net)
Date: 08/23/04

  • Next message: Dave Sill: "Administrivia: Tru64-UNIX-Managers information and policy statement"
    Date: Mon, 23 Aug 2004 09:48:40 -0700
    To: tru64-unix-managers@ornl.gov
    
    

    I have memexer_mp running on SRM on that server right now and no errors..

     ID Program Device Pass Hard/Soft Bytes Written Bytes
    Read
    -------- ------------ ------------ ------ --------- ------------- ----------

    ---
    00000001         idle system            0    0    0             0
    0
    000002d3      memtest memory         1527    0    0   12801015808
    12801015808
    000002dd      memtest memory         1373    0    0   11509170176
    11509170176
    000002e7      memtest memory         1370    0    0   11484004352
    11484004352
    000002f1      memtest memory         1373    0    0   11509170176
    11509170176
    Test CPU resulted in this though
    EV6 Correctable Dcache ECC Error on CPU 0
    EV6 Correctable Memory Fill ECC Error on CPU 0
    C_ADDR:         0000000028809E80
    C_SYNDROME_1:   0000000000000000
    C_SYNDROME_0:   00000000000000D3
    Bad CPU?
    > We just put this ES40 into prod on saturday night and now it has shut
    itself
    > down 3 times since then.  Does this look like software or hardware?
    >
    >
    >
    >  WARNING: too many Processor corrected errors detected on cpu 0. Reporting
    > suspended.
    > WARNING: too many Processor corrected errors detected on cpu 1. Reporting
    > suspended.
    > WARNING: too many Processor corrected errors detected on cpu 2. Reporting
    > suspended.
    > WARNING: too many Processor corrected errors detected on cpu 3. Reporting
    > suspended.
    > Machine Check Processor Fatal Abort
    > Machine check code = 0x100000098
    >         Ibox Status                             = 0000000000000000
    >         Dcache Status                           = 000000000000001c
    >         Cbox Address                            = 000000002112b580
    >         Fill Syndrome 1                         = 0000000000000000
    >         Fill Syndrome 0                         = 00000000000000d3
    >         Cbox Status                             = 0000000000000003
    >         EV6 captured status of Bcache mode      = 000000000000000d
    >         EV6 Exception Address                   = fffffc000066a298
    >         EV6 Interrupt Enablement and Current Processor mode =
    > 0000007ee0000000
    >         EV6 Interrupt Summary Register          = 0000000080000000
    >         EV6 TBmiss or Fault status              = 0000000000000290
    >         EV6 PAL Base Address                    = 0000000000018000
    >         EV6 Ibox control                        = fffffe0007304396
    >         EV6 Ibox Process_context                = 0000748000000004
    >         O/S Summary flag                        = 0000000000000004
    >         Cchip Base Address (phys)               = 00000f01a0000000
    >         Cchip Device Raw Interrupt Request      = 0000000000000000
    >             DRIR Register Decode:
    >                 Machine Check SYSTEM Fatal Abort
    > Machine check code = 0x100000202
    >         Ibox Status                             = 0000000000000000
    >         Dcache Status                           = 0000000000000000
    >         Cbox Address                            = 0000000000000000
    >         Fill Syndrome 1                         = 0000000000000000
    >         Fill Syndrome 0                         = 0000000000000000
    >         Cbox Status                             = 0000000000000000
    >         EV6 captured status of Bcache mode      = 0000000000000000
    >         EV6 Exception Address                   = fffffc00008cd140
    >         EV6 Interrupt Enablement and Current Processor mode =
    > 00000062e0000000
    >         EV6 Interrupt Summary Register          = 0000000200000000
    >         EV6 TBmiss or Fault status              = 0000000000000000
    >         EV6 PAL Base Address                    = 0000000000018000
    >         EV6 Ibox control                        = fffffe000f304396
    >         EV6 Ibox Process_context                = 0000000000000000
    >         O/S Summary flag                        = 0000000000000006
    >         Cchip Base Address (phys)               = 00000f01a0000000
    >         Cchip Device Raw Interrupt Request      = 2000000000000000
    >             DRIR Register Decode:
    >                 Bit 61: Error from Pchip 1
    >                 PCI Device Interrupt Mask       = 0000000000000000
    >         Cchip Miscellaneous Register            = 0000000800000030
    >             Misc Register Decode:
    >                 Bit 4: Interval Timer Intr Pending to CPU 0
    >                 Bit 5: Interval Timer Intr Pending to CPU 1
    >                 Bit 35: CChip Rev (Bit<35>)
    >                 Cchip Revision: 08
    >                 ID of CPU performing read: 00
    >         Pchip 0 Base Address (phys)             = 00000f0180000000
    >         Pchip 0 Error Register                  = 0000000000000000
    >             Pchip Error Register Decode:
    >                 PCI Xaction Start Address       = 0000000000000000
    >                 PCI Command: Interrupt Acknowledge
    >         Pchip 1 Base Address (phys)             = 00000f0380000000
    >         Pchip 1 Error Register                  = d300bd54f6200801
    >             Pchip Error Register Decode:
    >                 Bit 0: Lost Error
    >                 Bit 11: Correctable ECC Error
    >                 System Address          = 00000000bd54f620
    >                 Command: DMA Read
    >                 ECC Syndrome: d3
    > panic (cpu 0): System Uncorrectable Machine Check
    > Machine Check SYSTEM Fatal Abort
    > Machine check code = 0x100000202
    >         Ibox Status                             = 0000000000000000
    >         Dcache Status                           = 0000000000000000
    >         Cbox Address                            = 0000000000000000
    >         Fill Syndrome 1                         = 0000000000000000
    >         Fill Syndrome 0                         = 0000000000000000
    >         Cbox Status                             = 0000000000000000
    >         EV6 captured status of Bcache mode      = 0000000000000000
    >         EV6 Exception Address                   = fffffc00006ae004
    >         EV6 Interrupt Enablement and Current Processor mode =
    > 00000062e0000000
    >         EV6 Interrupt Summary Register          = 0000000200000000
    >         EV6 TBmiss or Fault status              = 0000000000000000
    >         EV6 PAL Base Address                    = 0000000000018000
    >         EV6 Ibox control                        = fffffe000f304396
    >         EV6 Ibox Process_context                = 0000000000000000
    >         O/S Summary flag                        = 0000000000000006
    >         Cchip Base Address (phys)               = 00000f01a0000000
    >         Cchip Device Raw Interrupt Request      = 2000000000000000
    >             DRIR Register Decode:
    >                 Bit 61: Error from Pchip 1
    >                 PCI Device Interrupt Mask       = 0000000000000000
    >         Cchip Miscellaneous Register            = 0000000800000ff0
    >             Misc Register Decode:
    >                 Bit 4: Interval Timer Intr Pending to CPU 0
    >                 Bit 5: Interval Timer Intr Pending to CPU 1
    >                 Bit 6: Interval Timer Intr Pending to CPU 2
    >                 Bit 7: Interval Timer Intr Pending to CPU 3
    >                 Bit 8: Interprocessor Intr Pending to CPU 0
    >                 Bit 9: Interprocessor Intr Pending to CPU 1
    >                 Bit 10: Interprocessor Intr Pending to CPU 2
    >                 Bit 11: Interprocessor Intr Pending to CPU 3
    >                 Bit 35: CChip Rev (Bit<35>)
    >                 Cchip Revision: 08
    >                 ID of CPU performing read: 00
    >         Pchip 0 Base Address (phys)             = 00000f0180000000
    >         Pchip 0 Error Register                  = 0000000000000000
    >             Pchip Error Register Decode:
    >                 PCI Xaction Start Address       = 0000000000000000
    >                 PCI Command: Interrupt Acknowledge
    >         Pchip 1 Base Address (phys)             = 00000f0380000000
    >         Pchip 1 Error Register                  = d300bd54fd200801
    >             Pchip Error Register Decode:
    >                 Bit 0: Lost Error
    >                 Bit 11: Correctable ECC Error
    >                 System Address          = 00000000bd54fd20
    >                 Command: DMA Read
    >                 ECC Syndrome: d3
    >
    > DUMP: blocks available:  1983962
    > DUMP: blocks wanted:      930642 (partial compressed dump) [OKAY]
    > DUMP: Device     Disk Blocks Available
    > DUMP: ------     ---------------------
    > DUMP: 0x1300013  122678 - 1983959 (of 1983960) [primary swap]
    > DUMP.prom: Open: dev 0x5100001, block 786432: SCSI 1 3 0 3 300 0 0
    > DUMP: Writing header... [1024 bytes at dev 0x1300013, block 1983960]
    > esMP: Writing data..Machine Check Proc
    >   soErV F6 atCoalrr Aecbortt
    > lMea chDicneac chehe EckCC c Eodrre or= 0 x1on00 C00PU00 198
    >
    > ta      Ibox S
    >   tEusV6                 C              or= re00c0t00ab00le00 M00em00or00y
    0
    > l       Dlca chECe C StEarturos r               on      =  C00PU00 100
    > Fi
    > 000000001Cc_
    > cD      DCR:bo  x  A dd re  ss                  00              00=
    > 00000000000000000740e8057
    >  80
    >         FiCll_S SYNynDRdrOomMEe _11     :                         =
    > 00000000000000000000000000000000
    >
    >         Fill SCyn_SdrYNomDRe OM0        E_              0:      =   0
    > 00000000000000000000000000d30
    > Cb
    > D
    > usox Stat
    >                         EV      =6  0Co00r00r0e00c0t00ab00l0e03
    > ac      EcVh6 e caECptC urEedr rostr atonus C oPUf  B3c         D
    > = he mode
    >   0E00V600 C00or00re00ct00a0b00le
    >         MEVe6m Eorxcy epFitillon  EAdCCdr Eesrsr                o       r=
    > ffofnff Cc0PU00 306
    > abf8c
    > Pr      CE_V6AD IDRnt:e rr  up  t   En  ab0l0em0en00t 0a0nd00
    C00ur0r7en48t
    > 0
    >   ocessor Cmo_deS =YN 0DR00OM00E0_621:e0 0 00 000000
    > u       00EV006 00In00te00rr                        00
    >  pt SummaCry_S RYeNgiDRstOMerE_         0=: 0  00 000000000080000000000000
    > 0       EVD6 3TB
    00
    > auss or F
    >   Elt Vst6 atCousrr             e=c 0t0a00bl00e 00Dc00ac00h0e28 E0
    > C        EVE6 rPArLo Bra seo nAd CdrPUes 2s                       C
    > 0               = 000000
    >  00EV0061 80Co00rr
    > ec      tEVa6 blIbe oxMe cmoonrytr oFl  il              l = ECffCf
    > ffEre0ro00r f3on04 C39PU6
    >
    > 2
    >         EV6 Ibox CPr_ocADesDRs_:c on  te  xt            =
    > 0000000000000000000000000074008
    >
    > 0
    > O/S SummCar_yS fYNlaDRg OM              E_= 10:00 0 00 0000000000000000004
    > Ba      C0ch00ip0
    00
    >   se AddreCs_sSY (NDphROysME)   _       0= :00 0 00 0f0010a0000000000000
    >  D      C0ch0Dip3                                                       00
    >   evice Raw Interrupt Request   = 0000000000000000
    > :           DRIR Register Decode
    >
    > E       V       P6C I CoDerrvicee ctInabtelerr uDptc aMchase k  E=C 0C
    > 00Er00ro00r 00o0n00 C00PU00 2
    >         C
    > e        chip Misc
    >  llEVan6 eoCousrr Recegtiastbelr        e       =M e00mo00ry00 F00il000l00
    > E00C0
    >  D      E  r r Moris oc n ReCgPisU te2r
    > C
    >   ecode:
    > C       _       CADchDRip:  R  ev i si  on  : 000000
    > r       0       I00D 00of1 CCPC0U C0pe
    >  forming Cre_SadY:N 0DR0
    > )       EP_1ch: ip   00 0Ba00se0 0Ad00dr0e0s0s 0(0ph00ys0
    >                 = 00000C_f0SY18ND00RO00ME00_00
    > r         Pc0h0ip00 000 E00rr00or00 R00egDi3ste
    >                         = 0000000000000000
    >             Pchip Error Register Decode:
    >                 PCI Xaction Start Address       = 0000000000000000
    >                 PCI Command: Interrupt Acknowledge
    >         Pchip 1 Base Address (phys)             = 00000f0380000000
    > 00      Pchip 1 Error Register                  = 000000
    >   E00V600 C00or00re
    > c       t  a b lPceh ipDc Earcroher  EReCCgi Estrerr orDe ocon deCP:
    > 3                                                                   U
    > ioI Xact
    >   En V6St Carort reAdctdrabeslse        = M 0em00o0r00y 00F0i00ll00 E00CC0
    > E               rPCroI r Coomnma CndPU:  I3nt
    > errupt ACck_AnoDDwlR:ed  ge
    >
    > D UM  P:0 0fi00rs00t 0c0ra00sh00 d76um8p0 f
    > 00led: atCt_emSYptNDinROg MmEem_1or: y   du00mp00..00.
    > 00000000
    > C_SYNDROME_0:   00000000000000D3
    >
    > EV6 Correctable Dcache ECC Error on CPU 2
    >
    > EV6 CorDrUMeP:ct caobmplere Msseminorg y9 30Fi64ll0K BE iCCnt Eo r76ro30r
    > 73on5K CB PUme 2mo
    > ry...
    > CDU_AMPDD: R S: ta r ti  ng   A d dr00es00s 00  00  00 E00nd7in4g80 A
    > Edress  C S_SizYNe(DRMBOM)
    > D1:UMP :   --00--00--00--00--00--00--0-0--00-
    >   -------C--_S--YN--DR--O-M--E_ -0:-- - -- 0--00
    > D00UM00P:0 00x00ff00ffD3fc
    > 00081f1c0
    > o - E0xV6ff Cffofrc0re03ctffabfflfeef D 8ca94c.h0 e (iECndC icEratroorr )
    > D UCMPP:U 0 3xf
    > f5ffc01f
    >   cE00V600  C- o0rxfreffctffabc0le1f Mffeem3foerf y10 F.1il (li ndECicaCto
    > Er)rr
    > owc om0n:  LCPinU k 3d
    >   n
    > C_ADDR:         00000000000070C0
    > C_SYNDROME_1:   0000000000000000
    > C_SYNDROME_0
    >
    >
    >
    

  • Next message: Dave Sill: "Administrivia: Tru64-UNIX-Managers information and policy statement"

    Relevant Pages

    • Is this OS or Hardware?
      ... too many Processor corrected errors detected on cpu 1. ... Reporting ... DUMP: blocks wanted: 930642 ...
      (Tru64-UNIX-Managers)
    • WMI Error
      ... ExServ has reported a Warning. ... CPU - OK ... reporting feature and not try to fix the errors? ...
      (microsoft.public.exchange.admin)
    • Re: 2.6.18-mm2 boot failure on x86-64
      ... WARNING: ... ACPI: HPET id: 0x10142201 base: 0xfde84000 ... PERCPU: Allocating 33920 bytes of per cpu data ... CPU: Trace cache: 12K uops, ...
      (Linux-Kernel)
    • Re: 2.6.18-mm2 boot failure on x86-64
      ... WARNING: ... ACPI: HPET id: 0x10142201 base: 0xfde84000 ... CPU: Trace cache: 12K uops, ...
      (Linux-Kernel)
    • Update: Strange Sbus ge network fault on E3500
      ... WARNING: SBus1 Secondary Timeout Error: AFSR 0x48000a60.00000000 AFAR ... Starting CPU ID 7 ... Probing UPA Slot at 2,0 sbus fhc ac environment flashprom eeprom ...
      (SunManagers)