SUMMARY: V880 crashes

From: Grzegorz Bakalarski (G.Bakalarski_at_icm.edu.pl)
Date: 11/11/04

  • Next message: ManojGhag: "command to kill the process if we know the port number ( except lsof)"
    Date: Thu, 11 Nov 2004 12:48:39 +0100
    To: sunmanagers@sunmanagers.org
    
    

    Dear ALL

    Problem solved (I hope).
    I've got 15 answers (most of them during Satuday afternoon and Sunday ;-) ).
    Thanks to All - all your input was very helpfull in solving the problem.
    Almost all suggested it is hardware problem. So on Monday I opened
    ticket with local reseller's tech support. I provieded extended
    logs to them. Next day I got a call from SUN and SUN's engineer
    came and replaced two memory DIMMs (J2900 & J8100).
    In addition to what is written in the full summatry (see below)
    I found out the following:
    * the best way to diagnoze such error is from OBP (Solaris is
      multithreaded and memory is interleaved so from OS it is sometimes
      very hard to find out the right dimm)
    * single bit error are no problem (usually) and even two bits can
      be cured (I had two bit errors on J8100 dimm). More than two
      bit errors are usually fatal (I had four bit errors on J2900 dimm) -
      especially at low addresses (where kernel is loaded)
    * to make better dignostic one needs to set up OBP variables:
      setenv diag-switch? true
      setenv diag-level max
      or even set the key switch into diagnostic position.
    * it may help to log console messages to file (from serial console
      and xterm using script; or with hyperterm with logging to file).
      It is good idea to leave console logging permanent in such cases -
      this may help to catch the right info.
    * in urgent case one could just take the wrong board off machine
       and let it work in smaller configuration. Other temporary
       workaround may be use of .asr commands (OBP) in order to disable
      particular dimms or cpus

    Again BIG thanks to all. All the best!

    Grzegorz

    P.S. Original query and full summary follows....

    --------------------- Original Query --------------------------
    Dear Guru's

    Our production server: SUN Fire V880, 6x900MHz, 12GB, Solaris 9,
    crashed twice during last 48 hours. First time it did panic and
    successfully rebooted itself. Second time it did panic and died
    (I had to power off/on machine).

    Could anyone tell, what is the problem? Is it hardware or software?
    May recommended patches help somhow?

    On other hand I started machine in diagnostic mode and there was
    no errors. Also prtdiag does not show any failures.

    The machine is 2 years old so still is under hardware warranty ...
    What is strange the events occurred when load was low (less than 1;
    during daytime the load can be upto 40).

    Thanks for any help

    Grzegorz

    >info from /var/adm/messages
    ====================================== 1 ===============================
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 360866 kern.warning] WARNING: [AFT1] EDU:ST Event detected by CPU0 at TL=0, errID 0x0030e2ee.7ce2a5e4
    Nov 4 21:00:01 v880_sol9 AFSR 0x00000008<EDU>.00000152 AFAR 0x000000a0.3db88550
    Nov 4 21:00:01 v880_sol9 Fault_PC 0x1177184 Esynd 0x0152
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 360866 kern.warning] WARNING: [AFT1] EDU:ST Event detected by CPU0 at TL=0, errID 0x0030e2ee.7ce2a5e4
    Nov 4 21:00:01 v880_sol9 AFSR 0x00000008<EDU>.00000152 AFAR 0x000000a0.3db88550
    Nov 4 21:00:01 v880_sol9 Fault_PC 0x1177184 Esynd 0x0152
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 606810 kern.notice] [AFT1] errID 0x0030e2ee.7ce2a5e4 More than four Bits were in error
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 465517 kern.info] [AFT2] errID 0x0030e2ee.7ce2a5e4 PA=0x000000a0.3db88540
    Nov 4 21:00:01 v880_sol9 E$tag 0x00000280.f6020000 E$state_5 Modified
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00000300.08f1f440 0x00000000.00000000 ECC 0x0a3
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x10) 0x07000000.00000000 0xf0ff0fff.ffffffff ECC 0x100 *Bad* Esynd=0x152
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x80000002.00000000 0x00000000.00000000 ECC 0x099
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x30) 0xffffffff.00000000 0x01002000.00000000 ECC 0x1d5 *Bad* Esynd=0x071
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available
    Nov 4 21:00:01 v880_sol9 unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x000000a0.3db88000
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 209006 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x0030e2ee.7ce1b288
    Nov 4 21:00:01 v880_sol9 AFSR 0x00500000<DUE,PRIV>.00000152 AFAR 0x000000a0.3db88550
    Nov 4 21:00:01 v880_sol9 Fault_PC 0x1035ec4 Esynd 0x0152 Slot A: J7900 J7901 J8001 J8000
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 673850 kern.notice] [AFT1] errID 0x0030e2ee.7ce1b288 More than four Bits were in error
    Nov 4 21:00:01 v880_sol9 SUNW,UltraSPARC-III+: [ID 630565 kern.warning] WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU3 Privileged Data Access at TL=0, errID 0x0030e2ee.7ce38f18
    Nov 4 21:00:01 v880_sol9 AFSR 0x00100004<PRIV,UE>.000000b6 AFAR 0x000000a0.2e5ea340
    Nov 4 21:00:01 v880_sol9 Fault_PC 0x1090154 Esynd 0x00b6 Slot A: J7900 J7901 J8001 J8000
    Nov 4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 196182 kern.notice] [AFT1] errID 0x0030e2ee.7ce38f18 Three Bits were in error
    Nov 4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 828748 kern.info] [AFT2] errID 0x0030e2ee.7ce38f18 PA=0x000000a0.2e5ea340
    Nov 4 21:00:02 v880_sol9 E$tag 0x00000280.b9010000 E$state_5 Exclusive
    Nov 4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x00) 0x00000318.7da154b0 0x0c007000.00000000 ECC 0x07a *Bad* Esynd=0x0b6
    Nov 4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x00000700.10c6f5d0 0x03007300.3c50a108 ECC 0x074
    Nov 4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00000300.09542328 0x03007002.baddcafe ECC 0x1b9
    Nov 4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x30) 0x00000000.00000000 0x03006332.15542280 ECC 0x0a9 *Bad* Esynd=0x149
    Nov 4 21:00:02 v880_sol9 SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available
    Nov 4 21:00:02 v880_sol9 unix: [ID 836849 kern.notice]
    Nov 4 21:00:02 v880_sol9 ^Mpanic[cpu3]/thread=30003671520:
    Nov 4 21:00:02 v880_sol9 unix: [ID 640582 kern.notice] [AFT1] errID 0x0030e2ee.7ce38f18 UE Error(s)
    Nov 4 21:00:02 v880_sol9 See previous message(s) for details
    Nov 4 21:00:02 v880_sol9 unix: [ID 100000 kern.notice]
    Nov 4 21:00:02 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004969c0 SUNW,UltraSPARC-III+:cpu_aflt_log+5c0 (2a100496acb, 1, 2a100496cd8, 10, 117d180, 117d1a8)
    Nov 4 21:00:02 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 0000000001222d04 0000000000000010 0000000000000003 000002a100496cd8
    Nov 4 21:00:02 v880_sol9 %l4-7: 000000a02e5ea340 0000000000000000 000002a100496c08 000002a100496a7e
    Nov 4 21:00:02 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100496c10 SUNW,UltraSPARC-III+:cpu_deferred_error+4d4 (0, 1, 40100004032000b6, 40100004, a0, 6bc)
    Nov 4 21:00:02 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 000002a100496cd8 0000000400000000 40100004032000b6 000003000367d928
    Nov 4 21:00:02 v880_sol9 %l4-7: 0000000000000001 000002a100497220 0000030000010300 0000000080000000
    Nov 4 21:00:02 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497170 unix:ktl0+48 (30002f1b298, 0, 20, 0, 7092c300, 0)
    Nov 4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000005 0000000000001400 0000000080001604 0000000001171800
    Nov 4 21:00:03 v880_sol9 %l4-7: 0000000001446800 0000000001410478 0000000000000000 000002a100497220
    Nov 4 21:00:03 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004972c0 genunix:dnlc_purge_vfsp+8c (30002f1b298, 2a100497370, 144f400, 1495000, 2a100497440, 2a100497446)
    Nov 4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 00000300093ae008 0000000000000000 00000301f50e6540 0000000000000000
    Nov 4 21:00:03 v880_sol9 %l4-7: 0000000000000000 0000030002f1b288 0000030008b665b0 0000000001443ee0
    Nov 4 21:00:03 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004973b0 genunix:dounmount+c (30008b665b0, 0, 300003a5f28, 0, 30003671520, 0)
    Nov 4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 000003003cc565c0 000003000396d5e0 0000000000000000 0000000000000000
    Nov 4 21:00:03 v880_sol9 %l4-7: 000003000b3be100 0000030009387ab0 000003000b3be182 0000030009387b08
    Nov 4 21:00:03 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497460 namefs:nm_umountall+a8 (781ad4a0, 300003a5f28, 20, 2a1004975bc, 30003671520, 4)
    Nov 4 21:00:03 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 00000300045a3308 0000030008b665b0 0000000000000000 00000300038aa8c0
    Nov 4 21:00:03 v880_sol9 %l4-7: 0000000000000000 00000000781ad488 0000000000000088 00000000781ad540
    Nov 4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497510 namefs:nm_unmountall+10 (300038aa8c0, 300003a5f28, 20, 7bf, 0, 0)
    Nov 4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 00000300038aa8c0 00000300003a5f28 0000000000000001 0000000001499508
    Nov 4 21:00:04 v880_sol9 %l4-7: 0000000000000001 0000000000000000 0000030003963e38 000002a100497ba0
    Nov 4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004975c0 unix:stubs_common_code+70 (300038aa8c0, 300003a5f28, 20, 7bf, 0, 0)
    Nov 4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000000000000000 0000030009386910 0000000000000000
    Nov 4 21:00:04 v880_sol9 %l4-7: 00000000000000b0 0000000001410a10 0000030003963ce0 0000030009387b38
    Nov 4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497670 fifofs:fifo_close+2d8 (30003963dd0, 300038aa8ae, 1, 0, 300003a5f28, 3000367151c)
    Nov 4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 00000300038aa8a0 00000300038aa8c0 0000000000000003 0000000000000000
    Nov 4 21:00:04 v880_sol9 %l4-7: 00000300038aa9c0 000003000366f188 00000300038aa9c0 0000000000000000
    Nov 4 21:00:04 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497720 genunix:closef+54 (3000932d378, 0, 1, 0, 100c6ac, 0)
    Nov 4 21:00:04 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 0000000001340550 0000000000000001 00000300038aa9c0 000000000000000f
    Nov 4 21:00:04 v880_sol9 %l4-7: 0000000001495000 0000000000000000 000000000140e000 0000000000000001
    Nov 4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004977d0 genunix:closeall+30 (300036d1d10, 30003671520, 20, 0, 7092c300, 0)
    Nov 4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 00000300092faea8 0000000000000004 0000030000010680 0000000000000000
    Nov 4 21:00:05 v880_sol9 %l4-7: 0000030000010558 0000000001410478 0000030003671520 000000000000fffd
    Nov 4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497880 genunix:proc_exit+310 (3023596f798, 149c280, 30003671520, 300003a5f28, 0, 0)
    Nov 4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 000003000b3d84a0 00000300036d1440 000003000366f188 0000000000000002
    Nov 4 21:00:05 v880_sol9 %l4-7: 000000000000000f 0000000000000002 000000000000000f 0000000000000000
    Nov 4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a100497930 genunix:exit+8 (2, f, 300036d1554, 0, 30003671520, 0)
    Nov 4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 000000000000000f 0000000000000002 0000000000004000 00000300036d1440
    Nov 4 21:00:05 v880_sol9 %l4-7: 0000000000000000 000000000000000f 0000000000000070 0000000000000000
    Nov 4 21:00:05 v880_sol9 genunix: [ID 723222 kern.notice] 000002a1004979e0 genunix:post_syscall+3e0 (2a100497ba0, 3, 0, 1, 30003671520, 4)
    Nov 4 21:00:05 v880_sol9 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000004 00000300036d1440 000003000366f188 0000000000000000
    Nov 4 21:00:05 v880_sol9 %l4-7: 0000000000000000 0000000000000000 0000000000000004 00000000ffbffdf8
    Nov 4 21:00:06 v880_sol9 unix: [ID 100000 kern.notice]
    Nov 4 21:00:06 v880_sol9 genunix: [ID 672855 kern.notice] syncing file systems...
    Nov 4 21:00:06 v880_sol9 unix: [ID 836849 kern.notice]
    Nov 4 21:00:06 v880_sol9 ^Mpanic[cpu3]/thread=30003671520:
    Nov 4 21:00:06 v880_sol9 unix: [ID 340138 kern.notice] BAD TRAP: type=31 rp=1437f90 addr=a0 mmu_fsr=0 occurred in module "genunix" due to a NULL pointer dereference
    Nov 4 21:00:06 v880_sol9 unix: [ID 100000 kern.notice]
    Nov 4 21:00:06 v880_sol9 genunix: [ID 111219 kern.notice] dumping to /dev/dsk/c1t0d0s1, offset 644022272, content: kernel
    Nov 4 21:01:30 v880_sol9 genunix: [ID 409368 kern.notice] ^M100% done: 160398 pages dumped, compression ratio 2.45,
    Nov 4 21:01:31 v880_sol9 genunix: [ID 851671 kern.notice] dump succeeded
    Nov 4 21:02:16 v880_sol9 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.9 Version Generic_117171-02 64-bit

    ================================== 2 =======================================
    Nov 6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 621593 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x00008286.8959bc20
    Nov 6 12:53:19 v880_sol9 AFSR 0x00500000<DUE,PRIV>.000000e2 AFAR 0x000000a0.6c7ec0c0
    Nov 6 12:53:19 v880_sol9 Fault_PC 0x117bb00 Esynd 0x00e2 Slot A: J8100 J8101 J8201 J8200
    Nov 6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 300719 kern.notice] [AFT1] errID 0x00008286.8959bc20 Two Bits were in error
    Nov 6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 978170 kern.info] [AFT2] errID 0x00008286.8959bc20 PA=0x000000a0.6c7ec0c0
    Nov 6 12:53:19 v880_sol9 E$tag 0x00000281.b1000001 E$state_3 Invalid
    Nov 6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00000000.00000000 0x00714fb0.00000000 ECC 0x123
    Nov 6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x00000000.00000000 0x00000000.00000000 ECC 0x000
    Nov 6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00714fb0.00000000 0x00000039.00000000 ECC 0x185
    Nov 6 12:53:19 v880_sol9 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x00717188.00718018 0xff2fa7e8.00000000 ECC 0x032
    Nov 6 13:42:53 v880_sol9 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.9 Version Generic_117171-02 64-bit

    ========================== ANSWERS =========================================

    ************** Answer 1
    Date: Sat, 06 Nov 2004 10:59:25 -0500
    From: Bill Voight <bvoight at patriot.net>

    Some of the errors appear hardware related. First thing is to patch to
    current levels. Second, call support. They may reseat memory and
    CPU's. That might be tough with a production server, but it's worth a
    try. They also have diagnostic software that might be useful. If you
    don't run explorer, you might install it. It can send Sun info they
    need to diagnose the problem.

    We've had some mysterious 880 problems like yours and in one case,
    patching did the trick. The problem recently resurfaced, but we may
    have traced it to a shaky Oracle table. Let me know how you do.

    BV

    **************** ANSWER 2
    Sat, 6 Nov 2004 17:07:37 +0100
    From: "joe_fletcher" <joe_fletcher at btconnect.com>

    This sort of thing is quite common on V880s. Since it's
    indicating there's a bank of DIMMs with errors you may be
    looking at a replacement system board. Log it with SUN and
    they will either try replacing just the DIMMS or they will
    do the whole board.

    ******************ANSWER 3
    Date: Sat, 6 Nov 2004 17:48:42 +0100
    From: Stephane Tsacas <stephane.tsacas at gmail.com>

    I could be wrong, but I think that :

    - you have an hardware problem, probably memory related (either memory
    itself or bus).

    => remove cards, push memory with thumbs, put cards back in, power on
    and see what happens.

    => put some cpus offline. It's possible that only one CPU is causing
    the problem. However the machine itself should have remove it if it
    detected a bad cpu, but who knows.
    I'll start by disabling cpu3 and cpu0, see error message.

    => still crashing ? call Sun ASAP.

    Good luck ;)
    Stephane

    *************** ANSWER 4
    Date: Sat, 6 Nov 2004 09:44:30 -0800
    From: Webpro <aielloster at gmail.com>

    Looks similar to a problem I had with some bad memory. I left a
    console connected and sent the output to Sun who came out and replace
    a memory module.

    Joe

    -- 
    "Despite the hight cost of living, it still remains popular!"
    ********************** ANSWER 5
    Date: Sat, 6 Nov 2004 11:27:16 -0800
    From: "Jon Hudson" <jon.hudson at finisar.com>
    I would say a cpu/cache issue. While it  complains about memory
    Fault_PC 0x1035ec4 Esynd 0x0152 Slot A: J7900 J7901 J8001 J8000
    it's unlikely that so many parts would actually fail.
    If you want to test it without opening a ticket with sun, pull the board with cpu0 on it and see if the problem returns. If so, then it could be something deeper, if not then it's safe to say it's cpu0 and/or cpu0 components. 
    I would just open up a case with sun, they can debug that error dump a lot more carefully than any of us can.
    ******************* ANSWER 6
    Date: Sat, 6 Nov 2004 13:10:47 -0800 (PST)
    From: "sunsa_tx at yahoo.com"
    You have to open a case with SUN and give them the
    core dump file or the log you included in this email.
    I looked at your log and it looks like SUN needs to
    replace the DIMMs J7900 J7901 J8001 J8000 as they had
    multiple bits error. They may need to replace the
    system board too.
    ***************ANSWER 7
    Date: Sun, 07 Nov 2004 08:27:33 +1100
    From: Tim Tuck <tim.tuck at penrith.net>
    You have faulty memory  on the primary system board in locations
    J7900 J7901 J8001 J8000
    Tim
    ************ ANSWER 8
    Date: Sun, 07 Nov 2004 08:58:34 -0200
    From: "Ghassan Qanzu'a" <ghassan at sts.com.ps>
    It seems that you have a bad memory at J7900 J7901 J8001 J8000  J8100 J8101 J8201 J8200
    at the CPU0 borad (the first one starting from down).
        To be sure of that, you can check it by removing this board and replacing it with the third board and starting the system with just 4 proc's, 8 GB and observe the behavior and if it did not crash
    during two day's then the diagnostic above is right.
    Ghassan
    ***************** ANSWER 9
    Date: Sun, 07 Nov 2004 20:44:14 +1100
    From: Jeff Allison <jeff.allison at allygray.2y.net>
    Don't know what the dump means but we have 2 v880's that have crashed 
    due to dodgey memory (Samsung) if I remember correctly. Call them out 
    and get it checked..
    Jeff
    **************** ANSWER 10
    Date: Sun, 07 Nov 2004 21:51:36 -0500
    From: Prasanth Mudundi <Prasanth_mudundi at comcast.net>
    looks like there is are memory errors.... but they seem to come from 
    different memory dimms.
    i would start with most common one.  then move on to replace entire 
    bank, if that does not work
    replace system board.
    try running vts for memory/CPU and it will not fail right away.... days 
    with out an issue,. but when
    it crashes while vts is running you have your bad boy. since you have 
    warrenty let sun do the analysis
    for you.
    prasanth
    ******************** ANSWER 11
    From: "Michael Horton" <Michael.Horton at acntv.com>
    Date: Mon, 8 Nov 2004 07:46:36 -0500
    since the v880 is still under warranty support, cal sun support for
    help.
    at first glance, you have cpu0 reporting memory errors in a specific
    bank of memory slots.
    ****************** ANSWER 12
    Date: Mon, 08 Nov 2004 09:46:19 -0500
    From: Tim Chipman <chipman at ecopiabio.com>
    If the machine is still under sun warranty/support, get them involved 
    ASAP.  The type of failure you describe is consistent with "fairly 
    serious hardware failure" although it isn't inconcievable it is a 
    software issue.  Usually a trivial way to distinguish the two would be, 
    boot from an installer CDRom and leave the machine thrashing (copying 
    junk data back and forth between 2 slices in an infinite loop or 
    something) for a few hours.  If it crashes thus, booted from a clean OS 
    of the installer CD, it would support the "hardware failure" 
    hypothesis.  However, I expect you already have enough hints in the logs 
    below for sun support to have a strong candidate "smoking gun".
    Tim
    ************ ANSWER 13
    Date: Mon, 8 Nov 2004 11:35:57 -0500
    From: "Eric Paul" <epaul at profitlogic.com>
    This is a hardware problem.  Contact Sun immediately for service.
    ******************** ANSWER 14
    Date: Mon, 8 Nov 2004 10:07:58 -0800 (PST)
    From: David Foster <foster@ncmir.ucsd.edu>
    Reply-To: David Foster <foster at ncmir.ucsd.edu>
    Install most recent recommended patch cluster from SunSolve,
    in particular latest kernel updates. Install latest PROM patch
    112186-15 (OBP 4.13.2)
    (or later)
    Download SUN VTS, install it and run it to check for hardware
    errors. Look at /var/adm/messages* files and check for error
    messages. If you have support download and install Sun Explorer
    program and run it, then open hardware case with Sun tech support
    and email them the output (a .tar.gz file)...they can check it
    for config and hardware problems.
    Dave Foster
    ****************** ANSWER 15
    From: "Loukinas, Jeremy" <Jeremy.Loukinas at evenflo.com>
    Date: Mon, 8 Nov 2004 13:27:11 -0500 
    You probably just need to upgrade your Openboot version...
    prtdiag -v | grep OBP
    ----------------------------------- END OF SUMMARY --------------------------
    _______________________________________________
    sunmanagers mailing list
    sunmanagers@sunmanagers.org
    http://www.sunmanagers.org/mailman/listinfo/sunmanagers
    

  • Next message: ManojGhag: "command to kill the process if we know the port number ( except lsof)"

    Relevant Pages

    • Re: Linux 2.6.20-rc2
      ... corruption (either hardware or kernel induced) that could cause this. ... So my guess would still be memory corruption of some sort, ... # ACPI Support ...
      (Linux-Kernel)
    • SUMMARY: 3rd party Sun Maintenance providers
      ... > providers such as carefactor. ... > our Silver Sun Spectrum contracts. ... Consider moving your older Sun hardware to 3rd parties. ... Sun said they would only support Sun branded parts. ...
      (SunManagers)
    • Re: Fujitsu Siemens Primepower
      ... > we are currently looking into buying new hardware, ... Does anybody have recent experiences with these servers? ... The SPARC64V do support VIS: ... Sun Microsystems sun4us Fujitsu Siemens ...
      (comp.unix.solaris)
    • Re: Anyone else worried about the future of Sun?
      ... I'm not really proclaiming that Sun is doing a terrible job here. ... doesn't match Linux. ... hardware, software and configuration works (and quite poor supporting ... but it's hardware support is too limited and you ...
      (comp.sys.sun.admin)
    • Anyone else worried about the future of Sun?
      ... For starters I have been a Sun guy as far back as the old 100-lb Sparc ... IPX systems and I still believe they make the best hardware on the ... our 3-year support contracts have expired and support for some of our ... The only thing holding Linux ...
      (comp.sys.sun.admin)