SUMMARY: Redundant power supplies aren't

From: Skip Hammack (skip_at_hammack.com)
Date: 06/22/04

  • Next message: OS: "tiff to hpgl"
    Date: Tue, 22 Jun 2004 13:01:30 -0400
    To: sunmanagers@sunmanagers.org
    
    

    First, thanks to Eric Cortes Trujillo, Casper ***, Tim Chapman,
    Chris Hoogendyk, Dan Green, Ryan Krenzischek and Lar Hecking.

    I have my latest SunFire 280R back in service at this time.

    Sun replaced both power supplies, stating that both had failed at the same
    time. Guess its not going to be redundant when there isn't anything to be
    redundant with. The new power supplies are not the same part number,
    not sure if thats a Sun thing or just a different batch.

    My biggest issues were with the lack of error messages and a warning that
    my power supplies failed/were failing, along with a prtdiag that was very
    incomplete on the 4 failed servers. There were also differing packages for
    power in /var/sadm/pkg. I also found out that many of my other 280R
    servers had incomplete prtdiags along with differing power packages..

    All servers were built from a jumpstart and identical except for additional
    disk space layouts. All servers were patched with the Solaris8 recommended
    cluster patches.

    Since I hadn't turned this latest failed server back over to the team
    yet, I went
    ahead and rebuilt it using the jumpstart and then applying the cluster
    patches again.
    My results were mixed, with an incomplete prtdiag and patch level. I
    applied the
    patch cluster again looking for failures or successes and checked
    again. Better
    but not identical to the good servers. My logs showed success but the
    patches
    were incomplete. Casper recommended not using the install cluster due
    to problems
    seen with loading these cluster patches. I went ahead and did another
    jumpstart, and
    a clean build, then applied the patches without the install script.
    Voila! Success.

    I then pulled the primary power supply and got loads of errors, but the
    server
    stayed up. prtdiag showed the failed power supply. I then put the
    power supply
    back in, maintenance light went out, prtdiag showed clean, no failures.
    I then
    pulled the second power supply and the server shutdown. I checked the
    power
    cords, made sure everything was installed correctly and hit the power
    button. The
    server came up fine to single user and all my errors were that the
    second power
    supply was bad. I then put the second power supply in again and all
    errors cleared.
    Its been up and running since early this morning. I've been told that I
    took the
    power supply out too quickly while testing and that I didn't leave
    sufficient time to
    allow the system to catch up. I haven't had time to check it. Also of
    note, with
    the system up and the second power supply pulled, there wasn't a maintenance
    light on the front panel.

    I built another 280R using the 02/02 release and the patch cluster and
    it seems to
    be working out so far. I don't think its an issue with the jumpstart
    since some of
    the servers are fine, but am leaning towards patch cluster issues at
    this time.
    I tried on the previous build to download another recommended and run
    it, but
    it appeared to be identical issues with installing as a cluster.

    Summary: I will be applying patches across the board to all servers and
    ensure that
    they are all identical and have the same packages and patch level.

    Also for those using BigBrother, with everything correct on the server,
    BB picked
    up the failed disk when it was pulled.

    thanks again to all.
    Skip
    _______________________________________________
    sunmanagers mailing list
    sunmanagers@sunmanagers.org
    http://www.sunmanagers.org/mailman/listinfo/sunmanagers


  • Next message: OS: "tiff to hpgl"