[HPADM] SUMMARY, seeking exercises for replaced hardware

From: Mark Fearer (mfearer_at_cox.net)
Date: 10/05/05

  • Next message: Ben Le: "[HPADM] HP Technology Forum Conference"
    To: <hpux-admin@dutchworks.nl>
    Date: Wed, 5 Oct 2005 06:41:11 -0400
    
    
    

     Hello, list. Thanks to all who responded. Several expressed interest in
    seeing a summary, so here goes. Also, I want to apologize if the wording of
    my original question was vague, as I received many responses about how to
    re-engineer my data center. Unfortunately, I am not in a position to do
    that.

     

    Original question:

     

    Admins. Hello. I am in a data center that has roughly 200 HP servers. They
    are A, D, K, L and N class and are running 11.0 or 11.11. They are mission
    critical and require an uptime of 19x7 with a maintenance window from
    midnight to 5:00am. We are constantly replacing failed hardware.
    I wish to ask of the list some good verification exercises for testing
    replaced hardware. Specifically:

    hard drives - ioscan? diskinfo?
    memory - stm?
    cpu - stm? anything in the BCH?
    tape devices - tar? ioscan? stm?
    mother boards - firmware?
    firmware - GSP? BCH?
    fiber channel cards - ioscan? fcmutil?

    Again, I am asking if anyone recommends some good testing for replaced
    hardware mentioned above

     

     

     

    Summary:

     

    Disks

    Memory

    CPU

    Tape devices

    Network cards (NICs)

    Mother boards

    Firmware (GSP, PDC)

    Fiber channel cards

     

     

     

    DISKS

     

    commands: ioscan, diskinfo, pvdisplay, vgdisplay, lvdisplay, lvlnboot,
    setboot, stm, cstm, dmesg

    files: /etc/fstab, /etc/lvmtab, /etc/lvmconf/vg##.conf, syslog.log, etc.

     

    If a disk is replaced, two good commands for verifying that the system sees
    the new disk are 'diskinfo' and 'ioscan -fnC disk'.

     

    Doing an 'ioscan -fnC disk' command first should display all of the storage
    devices that are attached to a server. You should see the H/W Type as
    'DEVICE' and the S/W State as 'CLAIMED'. You should also see the H/W Path
    and the device path displayed.

     

    Example:

    [root@sparky: ]# ioscan -fnC disk

    Class I H/W Path Driver S/W State H/W Type Description

    ============================================================

    disk 0 0/0/1/1.15.0 sdisk CLAIMED DEVICE HP
    36.4GST336753LC

                                /dev/dsk/c1t15d0 /dev/rdsk/c1t15d0

    disk 1 0/0/2/1.15.0 sdisk CLAIMED DEVICE HP
    36.4GST336753LC

                                /dev/dsk/c3t15d0 /dev/rdsk/c3t15d0

    A 'diskinfo' command followed by the raw character device path should give
    you the basic technical specifications of the disk.

     

    Example:

    [root@sparky: ]# diskinfo /dev/rdsk/c1t15d0

    SCSI describe of /dev/rdsk/c1t15d0:

                 vendor: HP 36.4G

             product id: ST336753LC

                   type: direct access

                   size: 35566480 Kbytes

       bytes per sector: 512

     

    You can test a disk for bad sectors and read errors with the following
    exercise:

     

    dd in=/dev/rdsk/<your_disk> of=/dev/null bs=64k

     

    Example:

    [root@sparky: ]# dd if=/dev/rdsk/c1t15d0 of=/dev/null bs=64k

    555726+1 records in

    555726+1 records out

     

    Another exercise that will test your disk is 'stm'. From the command line,
    enter 'stm'. This will invoke a GUI or a TUI. From this menu, you can
    highlight your device, then choose 'Tools', 'Exercise', 'Run'.

     

    Example:

     

    If you are getting replacement parts from a vendor, I would try and make
    sure you can burn them in somewhere for at least a couple of days. You
    might also look into keeping track of any serial numbers or unique ids on
    the parts.

     

     

     

    MEMORY

     

    commands: stm, cstm, dmesg, top, glance

    files: /var/adm/messages

     

    An exercise that will test your memory is 'stm'. From the command line,
    enter 'stm'. This will invoke a GUI or a TUI. From this menu, you can
    highlight your device, then choose 'Tools', 'Exercise', 'Run'.

     

    Example:

     

    Another exercise that will test your memory is 'top'. From the command
    line, enter 'top'. This will invoke a text output graph of the top processes
    that are using real time resources and shows the current usage of memory and
    cpu.

     

    Example:

    top

    System: sparky Mon Oct 3 15:47:03
    2005

    Load averages: 0.16, 1.03, 0.88

    167 processes: 148 sleeping, 19 running

    Cpu states:

    CPU LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS

     0 0.15 0.2% 0.0% 0.2% 99.6% 0.0% 0.0% 0.0% 0.0%

     1 0.18 0.2% 0.0% 0.0% 99.8% 0.0% 0.0% 0.0% 0.0%

    --- ---- ----- ----- ----- ----- ----- ----- ----- -----

    avg 0.16 0.2% 0.0% 0.2% 99.6% 0.0% 0.0% 0.0% 0.0%

     

    Memory: 147732K (89772K) real, 231180K (143120K) virtual, 957008K free
    Page# 1/

    16

     

    CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU
    COMMAND

     0 ? 37 root 152 20 6464K 6464K run 0:14 0.43 0.43 vxfsd

     1 ? 2111 root 152 20 14804K 4040K run 0:01 0.20 0.20 dmisp

     0 ? 3755 root 152 20 37980K 6944K run 0:13 0.18 0.18
    mhragent

     

    Still another exercise that will test your memory is Measureware's 'glance'.
    From the command line, enter 'glance'. This will invoke a text output graph
    of the top processes that are using real time resources and shows the
    current usage of memory and cpu.

     

    Example:

    B3692A GlancePlus C.03.72.00 16:14:43 sparky 9000/800 Current Avg
    High

    ----------------------------------------------------------------------------

    ----
    CPU  Util   SA                                                 |  3%  3%
    3%
    Disk Util   FV                                                 |  3%  3%
    3%
    Mem  Util   S           SU   UB        B                       | 54%   54%
    54%
    Swap Util   U     URR                                          | 16%   16%
    16%
    ----------------------------------------------------------------------------
    ----
                                      PROCESS LIST                      Users=
    2
                                  User      CPU Util     Cum     Disk
    Thd
    Process Name   PID   PPID Pri Name   (  200% max)    CPU   IO Rate    RSS
    Cnt
    ----------------------------------------------------------------------------
    ----
    glance        8899   8018 168 root      4.0/ 4.6     0.1  6.0/30.9   6.6mb
    1
    swapper          0      0 127 root      0.0/ 0.0     1.3  1.3/ 0.2    32kb
    1
    vxfsd           37      0 138 root      0.0/ 0.1    17.9  1.3/ 9.6   6.3mb
    16
     
     
     
    CPU
     
    commands: stm, cstm, top, glance
     
    You might consider opening a several windows and running 
    recursive jobs such as 'find', 'll -R', 'cp'.  You can generate good cpu
    loads and make sure you are getting all cpu's by monitoring with a 'top' or
    a 'glance'.
     
    Example of a 'top':
    System: sparky                                        Tue Oct  4 13:29:04
    2005
    Load averages: 0.00, 0.02, 0.04
    183 processes: 163 sleeping, 20 running
    Cpu states:
    CPU   LOAD   USER   NICE    SYS   IDLE  BLOCK  SWAIT   INTR   SSYS
     0    0.00   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
     1    0.00   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
    ---   ----  -----  -----  -----  -----  -----  -----  -----  -----
    avg   0.00   0.0%   0.0%   0.0% 100.0%   0.0%   0.0%   0.0%   0.0%
     
    Memory: 159532K (94984K) real, 244976K (146336K) virtual, 879716K free
    Page# 1/
    17
     
    CPU TTY    PID USERNAME PRI NI   SIZE    RES STATE    TIME %WCPU  %CPU
    COMMAND
     1   ?      37 root     152 20  6464K  6464K run      1:04  0.31  0.31 vxfsd
     
    Example of a 'glance':
    B3692A GlancePlus C.03.72.00    13:31:52   sparky 9000/800    Current  Avg
    High
    ----------------------------------------------------------------------------
    ----
    CPU  Util   S                                                  |  1%    1%
    2%
    Disk Util   F                                                  |  2%    2%
    2%
    Mem  Util   S             SU   UB        B                     | 58%   58%
    58%
    Swap Util   U     URR                                          | 17%   17%
    17%
    ----------------------------------------------------------------------------
    ----
                                      PROCESS LIST                      Users=
    5
                                  User      CPU Util     Cum     Disk
    Thd
    Process Name   PID   PPID Pri Name   (  200% max)    CPU   IO Rate    RSS
    Cnt
    ----------------------------------------------------------------------------
    ----
    glance       15070  14158 154 root      1.1/ 2.3     0.2  0.1/ 9.4   6.8mb
    1
    swapper          0      0 127 root      0.0/ 0.0     2.7  1.3/ 0.2    32kb
    1
    vxfsd           37      0 138 root      0.0/ 0.1    63.8  4.0/ 3.8   6.3mb
    18
     
    Of course, there is 'stm', too, for stress testing a CPU.
     
    Example:
     
    You can disable a CPU by rebooting and pausing at the firmware prompts. You
    can then go to 
    the COnfigure section and enable or disable CPU's from the firmware.
     
    If you are getting replacement parts from a vendor, I would try and make
    sure you can burn them in somewhere for at least a couple of days.  You
    might also look into keeping track of any serial numbers or unique ids on
    the parts.
     
     
     
    TAPE DEVICES
     
    Commands: ioscan, stm, tar
     
    Upon replacement of a tape device, do an 'ioscan -fnC tape' or an 'ioscan
    -fn' to assure that the new device is seen on the bus and that it is
    claimed.
     
    Example:
    [root@smith: ]# ioscan -fnC tape
    Class     I  H/W Path     Driver S/W State   H/W Type     Description
    =====================================================================
    tape      0  0/0/1/0.3.0  stape CLAIMED     DEVICE       HP      C1537A
                             /dev/rmt/0m            /dev/rmt/c0t3d0BESTn
                             /dev/rmt/0mb           /dev/rmt/c0t3d0BESTnb
                             /dev/rmt/0mn           /dev/rmt/c0t3d0DDS
                             /dev/rmt/0mnb          /dev/rmt/c0t3d0DDSb
                             /dev/rmt/c0t3d0BEST    /dev/rmt/c0t3d0DDSn
                             /dev/rmt/c0t3d0BESTb   /dev/rmt/c0t3d0DDSnb
     
    'ioscan' will show you if it's on the bus, but won't tell you about the 
    condition of the heads.  Do some write/read to the tape. 
     
    Do some 'tar' tests. For example, you may want to search the /tmp directory
    for some medium sized file to 'tar' to (cvf) and then extract from (xvf) the
    new tape device. 
     
    Example:
    [root@smith: tmp]# cd /tmp
    [root@smith: tmp]# ll
    -rw-r--r--   1 oracle     dba          42350 May  8 10:05 DB_Check.sql
     
    [root@smith: tmp]# tar cvf /dev/rmt/0m DB_Check.sql
    a DB_Check.sql 83 blocks
     
    Check the 'tar' to tape with a 'tvf' (table of contents).
     
    [root@smith: tmp]# tar tvf /dev/rmt/0m
    rw-r--r-- 103/102  42350 May  8 10:05 2005 DB_Check.sql
     
    Compare the original file with the file that was tarred to and then
    extracted from tape with use of the 'cksum' command.
     
    [root@smith: tmp]# pwd
    /tmp
    [root@smith: tmp]# mkdir test
    [root@smith: tmp]# cd test
    [root@smith: test]# tar xvf /dev/rmt/0m
    x DB_Check.sql, 42350 bytes, 83 tape blocks
     
    [root@smith: test]# cksum DB_Check.sql
    4182244438 42350 DB_Check.sql
    [root@smith: test]# cd ..
    [root@smith: tmp]# cksum DB_Check.sql
    4182244438 42350 DB_Check.sql
     
    And, of course, there is 'stm' which can be used to stress test a tape
    device. Assure that you have a tape loaded that can be written to (with no
    valuable data on it, as it will be overwritten).
     
    Example:
     
    There may be cases of 'tar' failing with an error message 'end of tape
    reached' in a few seconds from starting the backup. It would happen even
    with a fresh tape. Eventually I found out that the tape drive (DDS3) wasn't
    recognizing only 125m tapes (which it has always been recognizing). The same
    works fine with 90 or 120m tapes.
     
    If you are getting replacement parts from a vendor, I would try and make
    sure you can burn them in somewhere for at least a couple of days.  You
    might also look into keeping track of any serial numbers or unique ids on
    the parts.
     
     
     
    NETWORK CARDS (NICS)
     
    Commands: lanadmin, lanscan, netstat -in
     
    Find the driver:
    'ioscan -fkC lan' will display the btlan driver loaded in the kernel.
     
    Example:
    [root@sparky: tmp]# ioscan -fkC lan
    Class     I  H/W Path  Driver S/W State   H/W Type     Description
    ==================================================================
    lan       0  0/0/0/0   btlan CLAIMED     INTERFACE    HP PCI 10/100Base-TX
    Core
    lan       1  0/2/0/0   btlan CLAIMED     INTERFACE    HP A5230A/B5509BA PCI
    10/100Base-TX Addon
    lan       2  0/4/0/0   btlan CLAIMED     INTERFACE    HP A5230A/B5509BA PCI
    10/100Base-TX Addon
     
    'lanadmin'
     
    Example:
    [root@sparky: tmp]# lanadmin -x 1
    Current Config                   = 100 Full-Duplex MANUAL
     
    'lanscan'
     
    Example:
    [root@sparky: tmp]# lanscan
    Hardware Station        Crd  Hdw   Net-Interface    NM   MAC       HP-DLPI
    DLPI
    Path     Address        In#  State NamePPA          ID   Type      Support
    Mjr#
    0/0/0/0  0x00306E1CC054 0    UP    lan0 snap0       1    ETHER       Yes
    119
    LinkAgg0 0x001083F67C58 900  UP    lan900 snap900   5    ETHER       Yes
    119
    LinkAgg1 0x000000000000 901  DOWN  lan901 snap901   6    ETHER       Yes
    119
    LinkAgg2 0x000000000000 902  DOWN  lan902 snap902   7    ETHER       Yes
    119
    LinkAgg3 0x000000000000 903  DOWN  lan903 snap903   8    ETHER       Yes
    119
    LinkAgg4 0x000000000000 904  DOWN  lan904 snap904   9    ETHER       Yes
    119
     
    'netstat -in'
     
    Example:
    [root@sparky: tmp]# netstat -in
    Name      Mtu  Network         Address         Ipkts   Ierrs Opkts   Oerrs
    Coll
    lo0       4136 127.0.0.0       127.0.0.1       2935    0     2935    0     0
    lan900    1500 10.112.151.0    10.112.151.142  43284362 25    16608331 0
    0
     
     
    If you use APA (Auto Port Aggregation), the command to display the output of
    APA is a 'lanscan -q'.
     
    Example:
    [root@sparky: tmp]# lanscan -q
    0
    900        1 2
    901
    902
    903
    904
     
    You can also test the speed of the network connection by choosing a medium
    sized file and doing a secure copy to some remote host.
     
    Example:
    [root@sparky: tmp]# scp sparky-2.log etc-k09:/tmp
    Password:
    sparky-2.log                      100%  203KB   3.2MB/s   00:00
     
    Driver Configuration file. Use this to match the lan found with a 'lanscan'
    or a 'netstat -in' with its proper driver.
     
     name in /etc/rc.config.d/
     Driver  duplex file     Description                     Software
     ----------------------------------------------------------------
     btlan    hpbtlanconf
     btlan0  hpeisabtconf
     btlan1  hpbasetconf
     btlan3  hpbase100conf   Internal (core) LAN card
     btlan4  hpgsc100conf
     btlan5  hppci100conf    "external" PCI card             A5230A
     btlan6  hpsppci100conf  V-class PCI lan card
     gelan   hpgelanconf     GigaBit card                    A4926A
     
    If you are getting replacement parts from a vendor, I would try and make
    sure you can burn them in somewhere for at least a couple of days.  You
    might also look into keeping track of any serial numbers or unique ids on
    the parts.
     
     
     
    MOTHER BOARDS
     
    log files on /var/adm/diag 
     
    Regardless of which vendor might replace a motherboard, HP will most likely
    have to be present to flash proprietary firmware and set the CPU ID.
     
    The cpu tests would probably get most of your mother board cache issues.
     
    If you are getting replacement parts from a vendor, I would try and make
    sure you can burn them in somewhere for at least a couple of days.  You
    might also look into keeping track of any serial numbers or unique ids on
    the parts.
     
     
     
    FIRMWARE (GSP, PDC)
     
    You can cpu firmware at boot and upgrade as needed.  'stm' can often give 
    you firmware on other components.
     
    Here is an example of a GSP firmware upgrade:
     
    The procedure for upgrading GSP Firmware (from B.02.07 to B.02.20):
     
    1) If Firmware is older than B.02.15 then it must be updated in step
    sequence to revision B.02.15 first before it can be updated to B.02.20.
    Install GSP Firmware B.02.15 patch PF_CCANGSPB0215  (best done by HP CE as
    laptop needed to set up private network to talk to GSP card)
     
    2) While HP-UX patches are not technically required they are highly
    recommended to be installed at B.02.15 update (network patch requires a
    reboot).
     
    For 11.00:
      1. PHNE_27393, Cumulative Mux and Pty Patch, superseded by PHNE_27788,
    then by PHNE_29519 
      2. PHCO_27370, mksf cumulative patch 
     
    For 11.11:
      1. PHNE_26326, Cumulative Mux and Pty Patch, superseded by PHNE_32616 
      2. PHCO_27243, mksf cumulative patch, 7/23/2002, superseded by PHCO_32202 
     
    3) Install A6696B GSP Firmware B.02.20 patch PF_CCANGSPB0220  (5/19/2003)
     
    4) Server does not have to be shut down during the GSP Firmware update
    itself, but at the end of the GSP Firmware update the server needs to be
    rebooted in order for PDC Online Diagnostics to communicate again with GSP
    (using new GSP Firmware version).
     
    Here is an example of a PDC firmware upgrade:
     
    Upgrading PDC Firmware (requires reboot):
     
    For the RP5430 (L1500) and RP5470 (L3000):
     
    Install Firmware 44.12 patch PF_CARIW4412, which is a user-installable
    patch:
     
    For 11.00:
      PHSS_30631, Warning, superseded by PHSS_30839 (9/15/2004)
     
    For 11.11:
      PHSS_30632 (4/23/2004)
     
    For the RP7405, RP7410 and RP8400:
     
    Install Firmware 6.4 [17.009] (from current 5.0 [16.011]) patch
    PF_CKEYMAT0604, (created 4/4/2005)
     
     
     
    FIBER CHANNEL CARDS
     
    commands: ioscan, fcmsutil, dmesg
     
    The 'fcmsutil' command is a diagnostic tool to be used for the TACHYON
    A3740A, A3591A, A3404A, A3636A and TACHYON TL A5158A, A6684A, A6685A,
    TACHYON XL2 A6795A, DUAL PORT A6826A and FC/GigE Combo A9782A/A9784A
    Fiber Channel Host Bus Adapters. This command provides the ability to
    perform Fibre Channel Test and Echo functionality, read the card's
    registers, etc. This command requires the use of a device file to indicate
    the interface over which the requested command needs to be performed.
     
    Examples (from the man page):
    Print a short listing of the statistics maintained by the driver, with
    /dev/td1 as the device file.
     
         fcmsutil /dev/td1 stat -s
     
    Send 5 echo packets of 200 bytes each to a remote N_Port with loop_id 4,
    with /dev/td1 as the device file
     
               fcmsutil /dev/td1 echo -l 4 200 5
     
    Print a short listing of the statistics of the device whose
    remote-N-Port-ID is 0x02ae4 and with /dev/td1 as the device file
     
               fcmsutil /dev/td1 devstat 0x02ae4
     
    Clear the device statistics of the device whose wwn is 0x100000e002219f45
    and with /dev/td1 as the device file
     
               fcmsutil  /dev/td1 clear_devstat -w 0x100000e002219f45
     
    Perform a Internal loopback test, sending 1000 packets with /dev/fcd2 as the
    device file
     
               fcmsutil /dev/fcd2 lb int crpat 1000
     
    You can use 'stm' to test a fiber card.
     
    Example:
     
    If you are getting replacement parts from a vendor, I would try and make
    sure you can burn them in somewhere for at least a couple of days.  You
    might also look into keeping track of any serial numbers or unique ids on
    the parts.
     
     
     
    Q4 Procedure for Analyzing System Crashes
     
     
     
    Fault codes
     
                K class
     
    The initial FAULT CBF0 means "High PRiority Machine Check occurred".
     
    FLT 5007 means "Directed Error" on the Processor Memory Bus, with a =
    Master ID of 0 (implies CPU 0) and processor memory bus 0. 
     
    FLT 5408 means "Broad Error", master ID of 4 means on IOA 0 (system =
    board).
     
    FLT 5508 means "Broad Error", master ID of 5 means on IOA 1 (system =
    board).
     
    FLT CBFC means "PDCE HPMC processiung complete, failed to branch to OS =
    HPMC, halt CPU", ACTION: FOllow the action for other HPMC codes beginning
    with numerics
     
    FLT CBF1 means "OS did not replace IVA". ACTION: FOllow the action for =
    other HPMC codes beginning with numerics
     
    
    
    
    
    
    
    

    --
                 ---> Please post QUESTIONS and SUMMARIES only!! <---
            To subscribe/unsubscribe to this list, contact majordomo@dutchworks.nl
           Name: hpux-admin@dutchworks.nl     Owner: owner-hpux-admin@dutchworks.nl
     
     Archives:  ftp.dutchworks.nl:/pub/digests/hpux-admin       (FTP, browse only)
                http://www.dutchworks.nl/htbin/hpsysadmin   (Web, browse & search)
    

    image006.jpg
    image005.jpg
    image004.jpg
    image003.jpg
    image002.jpg
    image001.jpg
  • Next message: Ben Le: "[HPADM] HP Technology Forum Conference"

    Relevant Pages