Alphaserver ES40 crash

From: Vimal Upreti (vimal_at_iitcoman.com)
Date: 11/22/03

  • Next message: Peter.Stern_at_weizmann.ac.il: "Compaq C++ compiler"
    Date: Sat, 22 Nov 2003 11:53:51 +0400
    To: tru64-unix-managers@ornl.gov
    
    

    Hi all,
    One of the Alphaserver ES40 is crashing intermittently (one's in a day
    or 2 days). I checked all the logs, couldn't really pin-point the
    problem. Though it looks like a CPU or Motherboard problem. Following is
    the /var/adm/messages file:

    Nov 12 11:16:22 ldcems01 vmunix: Environmental Monitoring Subsystem
    Configured.
    Nov 12 11:16:35 ldcems01 vmunix: SuperLAT. Copyright 1994 Meridian
    Technology Corp. All rights reserved.
    Nov 15 08:34:52 ldcems01 vmunix: Machine Check SYSTEM Fatal Abort
    Nov 15 08:34:52 ldcems01 vmunix: Machine check code = 0x100000202
    Nov 15 08:34:52 ldcems01 vmunix: Ibox Status =
    0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: Dcache Status= 0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: Cbox Address= 0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: Fill Syndrome 1=
    0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: Fill Syndrome 0=
    0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: Cbox Status =
    0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: EV6 captured status of Bcache
    mode = 0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: EV6 Exception Address=
    fffffc00002e7f60
    Nov 15 08:34:52 ldcems01 vmunix: EV6 Interrupt Enablement and
    Current Processor mode = 0000007ee0000000
    Nov 15 08:34:52 ldcems01 vmunix: EV6 Interrupt Summary Register
    = 0000000200000000
    Nov 15 08:34:52 ldcems01 vmunix: EV6 TBmiss or Fault status =
    0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: EV6 PAL Base Address=
    0000000000018000
    Nov 15 08:34:52 ldcems01 vmunix: EV6 Ibox control =
    fffffe001e304396
    Nov 15 08:34:52 ldcems01 vmunix: EV6 Ibox Process_context=
    0000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: O/S Summary flag=
    0000000000000004
    Nov 15 08:34:52 ldcems01 vmunix: Cchip Base Address (phys)=
    00000f01a0000000
    Nov 15 08:34:52 ldcems01 vmunix: Cchip Device Raw Interrupt
    Request = 4000000000000000
    Nov 15 08:34:52 ldcems01 vmunix: DRIR Register Decode:
    Nov 15 08:34:52 ldcems01 vmunix: Bit 62: Error from Pchip
    0
    Nov 15 08:34:53 ldcems01 vmunix: PCI Device Interrupt
    Mask= 0000000000000000
    Nov 15 08:34:53 ldcems01 vmunix: Cchip Miscellaneous Register=
    00000008000000e0
    Nov 15 08:34:54 ldcems01 vmunix: Misc Register Decode:
    Nov 15 08:34:54 ldcems01 vmunix: Bit 5: Interval Timer
    Intr Pending to CPU 1
    Nov 15 08:34:54 ldcems01 vmunix: Bit 6: Interval Timer
    Intr Pending to CPU 2
    Nov 15 08:34:54 ldcems01 vmunix: Bit 7: Interval Timer
    Intr Pending to CPU 3
    Nov 15 08:34:54 ldcems01 vmunix: Bit 35: CChip Rev
    (Bit<35>)
    Nov 15 08:34:54 ldcems01 vmunix: Cchip Revision: 08
    Nov 15 08:34:54 ldcems01 vmunix: ID of CPU performing
    read: 00
    Nov 15 08:34:54 ldcems01 vmunix: Pchip 0 Base Address (phys)=
    00000f0180000000
    Nov 15 08:34:54 ldcems01 vmunix: Pchip 0 Error Register=
    553003c008200400
    Nov 15 08:34:54 ldcems01 vmunix: Pchip Error Register Decode:
    Nov 15 08:34:54 ldcems01 vmunix: Bit 10: Uncorrectable
    ECC Error
    Nov 15 08:34:54 ldcems01 vmunix: System Address
    = 0000000003c00820
    Nov 15 08:34:55 ldcems01 vmunix: Command: SGTE Read
    Nov 15 08:34:55 ldcems01 vmunix: ECC Syndrome: 55
    Nov 15 08:34:55 ldcems01 vmunix: Pchip 1 Base Address (phys)=
    00000f0380000000
    Nov 15 08:34:55 ldcems01 vmunix: Pchip 1 Error Register=
    0000000000000000
    Nov 15 08:34:55 ldcems01 vmunix: Pchip Error Register Decode:
    Nov 15 08:34:55 ldcems01 vmunix: PCI Xaction Start
    Address= 0000000000000000
    Nov 15 08:34:55 ldcems01 vmunix: PCI Command: Interrupt
    Acknowledge
    Nov 15 08:34:55 ldcems01 vmunix: panic (cpu 0): System Uncorrectable
    Machine Check

    Can you please check this file and suggest the solution.

    Thanks & regards.
    Vimal
     


  • Next message: Peter.Stern_at_weizmann.ac.il: "Compaq C++ compiler"

    Relevant Pages

    • Re: [ RFC, PATCH - 1/2, v2 ] x86-microcode: refactor microcode output messages
      ... no CPU was offline at this moment. ... microcode: CPU1 updated upon resume ... I'll send you full logs asap. ...
      (Linux-Kernel)
    • Re: Scheduled Server scan does not log events - Trend Micro WFBS 5.1
      ... reporting and logging facilities. ... Query-Exchange Server-Scan event logs. ... can set the cpu utilization to high, ... We have recently discovered that our Sunday morning Scheduled Server Scan ...
      (microsoft.public.windows.server.sbs)
    • VNC causes 90+% CPU load
      ... I have an Athlon 64 3700 with 1Gb RAM at home which used to run Windoze XP Pro and RealVNC server with no problem, and very little CPU load. ... I've recently installed 64-bit SUSE 10.1 on my home computer and I'm now using the packaged VNC server. ... Surely this can't be normal but as a relative newbie to linux, could anyone give me some pointers as to what might be wrong, or what logs I should be checking? ...
      (alt.os.linux.suse)
    • Re: Prelink eating all my resources
      ... In what logs should I start poking ... Anything using 85% of the CPU will make ... > But I really expect that it would spend most of its time reading files, ... Also, doing nothing, X sometimes jumps to 50% ...
      (Fedora)
    • Re: 2.6.19-rc1, timebomb?
      ... On Saturday 21 October 2006 02:08, Gene Heskett wrote: ... occurred when the CPU was working hard and the weather has been pretty ... The 'fam' that was using 99.3% of the cpu, ... My logs are littered with that message, ...
      (Linux-Kernel)