SUMMARY: UPDATE/Alpha particles and cosmic rays - Bcache Tag Parity Error

David.Knight_at_clubcorp.com
Date: 09/10/04

  • Next message: Praveen Patle: "Changing the UID and GID"
    Date: Fri, 10 Sep 2004 10:03:35 -0500
    To: tru64-unix-managers@ornl.gov
    
    

    Managers,
    This is just an update on out on going issue with the BCACHE tag parity
    errors that we have been encountering. We recently had a conference call
    with HP about our errors and have now been told from HP Engineering that
    our problem may be due to "Alpha particles and cosmic rays". Has any one
    else encountered cosmic problems? any input on the topic would be
    wonderful.

    Thanks in advance,
    David

    ----- Forwarded by David Knight/CLUBCORP/US on 09/10/2004 09:55 AM -----

    David Knight
    08/11/2004 02:24 PM

     
            To: tru64-unix-managers@ornl.gov
            cc:
            Subject: SUMMARY : "What processors are prone to the "Bcache Tag Parity Error"

    Managers,
            Below is the two responses that I received from my post. I also
    got some info back from my TAM @ HP witch stats the Engineering advisory
    that was issued for this issue covers the following:
    SCOPE
    The following Systems and CPUs may be affected.
    DS20E - 54-30482-01/02
    DS20L - 3X-81BAA, 3X-81AAA kernals
    ES40/SC40 - 54-30362-B3
    ES45/SC45 - 54-30466-03/04
    GS80/160/320 - B4166-AA

    Thanks to Peter Reynolds and Phil Baldwin for your time and knowledge on this topic!

    -David Knight

    _______
    Hi David,

     we've also seen them on
    The alpha EV6.7 (21264A) processor operates at 731 MHz,
      has a cache size of 4194304 bytes

    maybe once or twice per year (at most on a GS320 with 16 cpus). Seems to
    be
    OK after a boot...

    __________

    Judging on past performance the problem - which is a hardware problem -
    was most prevalent on processors in the Alphaserver 1000 series. A number
    of years ago, I was involved in an installation project which involved 68
    systems and just over half of them had a failure within the first year.
    However DEC, who were the supplier at the time, eventually admitted that
    there was a bad batch of static RAM chips used for the cache on the CPU
    modules. As for today, we possess 2 GS80 systems each with four 6/731
    processors and we have had no failures in the last two years. I also have
    a number of 4000/4100 systems and these have also been very reliable, with
    only two recorded failures in the last 5 years, both involving 5/600
    processors. There is also an Alphaserver 1200 (totally reliable on the CPU
    front), a DS20e (also very reliable, although it does run very hot), and
    an ES40 (twin 68/833 CPUs and very reliable). I also have an Alphaserver
    1000 and a 1000A, and these have also been pretty reliable, but don't get
    used all that often.
     
    The error in question is not a function of the CPU itself as the B-cache
    in question is made up of a number of static RAM chips on the module, and
    by their nature these run hot. The fix for the problem, if it stops the
    system running, is to replace the offending CPU module. However I have
    also known similar problems to be caused by main memory, as the cache
    entry is a mirror of what is in the particular memory location. If they
    don't agree, for whatever reason, you will get this error. On GS160/320
    systems the problem is compounded by the fact that CPUs can address other
    CPUs cache, and errors can be caused by the switch module. It should,
    though, be possible to track down the offending item using the registers
    from any error message you get
     


  • Next message: Praveen Patle: "Changing the UID and GID"

    Relevant Pages

    • Re: New libc malloc patch
      ... > someone steps up to change the way mmap and brk interact within the ... > be allocated with brk. ... we already have systems running with enough CPUs that this is an issue. ... > address space and the cache: the mapping of logical pages (what you ...
      (freebsd-current)
    • Re: Purchasing the correct hardware: dual-core intel? Big cache?
      ... there's not enough IO to stress the disk subsystem. ... with more CPUs by getting true dual-core pentiums. ... The question this all pivots on is will 8M of cache be a significant ... We're looking hard at getting either Intel dual-core procs, ...
      (freebsd-questions)
    • Re: Atmel releasing FLASH AVR32 ?
      ... a cache doesn't impact other accesses to non-cacheable ... Branch prediction cost is chasing an ever eluding target. ... There are few wasted cycles on modern embedded CPUs. ... Hardware multithreading doesn't give much performance on a high ...
      (comp.arch.embedded)
    • Virtual alias cache coherency results (was: x86, ARM, PARISC, PPC, MIPS and Sparc folks please run t
      ... this test is only for one kind of cache coherency: ... CPUs or with the i-cache, are not the subject of this test). ... have a performance penalty when there is virtual aliasing. ... performance penalty for virtual aliases that aren't a multiple of 1MB ...
      (Linux-Kernel)
    • Re: Possible instruction pipelining problem between HTs on the same die ? (fwd)
      ... > write ordering between physical cpus, not between logical HT cpus, ... independence from the fence instructions: ... execution of a memory-reading instruction is quite different from ... speculative prefetch into a cache. ...
      (freebsd-hackers)