EMC Clariion CX-400 and Solaris - critical, advice needed...

From: Michael Gleibman (Michael.Gleibman_at_sanmina-sci.com)
Date: 06/26/03

  • Next message: Ratliff, Charlotte: "Summary:Configure Network interfaces for best utilization"
    Date: Thu, 26 Jun 2003 08:04:20 -0500
    To: <sunmanagers@sunmanagers.org>
    
    

    Managers,

    Good day to all. We have an EMC Clariion CX-400 connected to two
    SUN-Fire 480R boxes - one runs Oracle server and uses LUN 0 on EMC,
    another performs as NFS server and uses LUN 1 on the EMC.
    Now - since some time ago the EMC started doing weird things - bypassed
    LUNs between SPs, failed disks, restored them and so on. After all, we
    even lost one of the LUNs and had to restore some data from backups.
    Now, both SPs in the box have been replaced to the latest h/w revision,

    the firmware has been upgraded to the latest release, one disk has been
    replaced.
    Since that, the box itself hasn't rebooted yet, but we still have some
    weird things - once or twice a day message like this appears in both
    Solaris server's messages files:

    <QUOTE>

     Jun 25 18:33:21 server1 lpfc: [ID 803620 kern.info] NOTICE:
    lpfc0:031:Link Down Event received Data: 6 6 0 20
    Jun 25 18:33:52 server1 emcp: [ID 801593 kern.notice] Error: Path Bus 0
    Tgt 1 Lun 1 to APM00023700476 is dead.
    Jun 25 18:33:52 server1 emcp: [ID 801593 kern.notice] Error: Killing
    bus 0 to CLARiiON APM00023700476 port B0.
    Jun 25 18:33:52 server1 emcp: [ID 801593 kern.notice] Error: Path Bus 0
    Tgt 1 Lun 0 to APM00023700476 is dead.
    Jun 25 18:34:22 server1 emcp: [ID 801593 kern.notice] Info: Volume
    600601EB540A00009252A4E9D727D711 followed to SPA
    Jun 25 18:35:41 server1 lpfc: [ID 242157 kern.info] NOTICE:
    lpfc0:031:Link Up Event received Data: 7 7 1 20
    Jun 25 18:38:52 server1 emcp: [ID 801593 kern.notice] Info: Path Bus 0
    Tgt 1 Lun 0 to APM00023700476 is alive.
    Jun 25 18:38:52 server1 emcp: [ID 801593 kern.notice] Info: Volume
    600601EB540A00009252A4E9D727D711 followed to SPB
    Jun 25 18:38:52 server1 emcp: [ID 801593 kern.notice] Info: Path Bus 0
    Tgt 1 Lun 1 to APM00023700476 is alive.

    </QUOTE>

    in the Clariion log, following messages appear:

    <QUOTE>

    06/25/2003 18:28:01 (2580)Storage Array Faulted

    06/25/2003 18:27:35 (71310007)CMID Transport
    Device 0: 0 gate(s) found.
     00 00 04 00 03 00 4e 00 d3 04 00 00 07 00 31 61 07 00 31 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 31 00 07 cmid
    06/25/2003 18:27:35 (71170008)Fibre Channel loop
    down on logical port 3.
     00 00 04 00 02 00 56 00 d3 04 00 00 08 00 17 61 08 00 17 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 17 00 08 scsitarg
    06/25/2003 18:27:35 (71310007)CMID Transport
    Device 1: 0 gate(s) found.
     00 00 04 00 03 00 4e 00 d3 04 00 00 07 00 31 61 07 00 31 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 31 00 07 cmid
    06/25/2003 18:27:35 (71170008)Fibre Channel loop
    down on logical port 2.
     00 00 04 00 02 00 56 00 d3 04 00 00 08 00 17 61 08 00 17 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 17 00 08 scsitarg
    06/25/2003 18:27:35 (71180009)CMI Transport Device
    0: 0 gate(s) found.
     00 00 04 00 03 00 4c 00 d3 04 00 00 09 00 18 61 09 00 18 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 18 00 09 cmi
    06/25/2003 18:27:35 (71180009)CMI Transport Device
    1: 0 gate(s) found.
     00 00 04 00 03 00 4c 00 d3 04 00 00 09 00 18 61 09 00 18 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 18 00 09 cmi
    06/25/2003 18:27:35 SP A (908) Fault - Cache Disabling
                   [0x00] 0
           0
    06/25/2003 18:27:47 (71100001)Lost contact with
    7027208860010650:1 on conduit
     3.
     00 00 04 00 04 00 4c 00 d3 04 00 00 01 00 10 61 01 00 10 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 10 00 01 mps
    06/25/2003 18:27:47 (71100001)Lost contact with
    7027208860010650:1 on conduit
     14.
     00 00 04 00 04 00 4c 00 d3 04 00 00 01 00 10 61 01 00 10 61 00 00 00
    00 00 00 00 00 00 00 00 00 0
    0 00 00 00 00 00 00 00 71 10 00 01 mps

    06/25/2003 18:27:47 SP A (944) Hard Peer Bus Error
                   [0x02] 0
           0
    06/25/2003 18:27:47 SP A (944) Hard Peer Bus Error
                   [0x01] 0
           0
    06/25/2003 18:27:47 SP A (654) Cache Dumping
                   [0xdd] 0
           0
    06/25/2003 18:27:47 (40004001)#THREADO: Peer died
    in Run: 1073774611
     40 00 40 01 MessageDispatcher
    06/25/2003 18:27:47 (40000001)Entering Main Alert
    Handler
     40 00 00 01 MessageDispatcher
    06/25/2003 18:27:47 (40000001)#THREADTL:
    Processing translog on peer death
     40 00 00 01 MessageDispatcher
    06/25/2003 18:27:48 SP B (a11) SP Removed
                   [0x04] 0
           0
    06/25/2003 18:27:48 (40000001)Attempting to
    reconnect to peer
     40 00 00 01 MessageDispatcher
    06/25/2003 18:27:51 SP A (657) Cache Dump Completed
                   [0xdc] 0
           0

    </QUOTE>

    After that, SP is being found again, initialized, LUN is bypassed back
    to the default, all looks OK... Until the next time.
    Looks like one of the SPs reboots without any particular reason...
    Has anyone encountered problem like this? What can be the possible
    reason? EMC support is involved in this, of course, but i'd like to ask
    for another admin's thoughts too...
    Solaris boxes are Solaris 8, 108528-18; the HBA is Emulex LightPulse FC
    SCSI/IP 5.01b
    All thoughts and advices are highly appreciated... I will summarize, of
    course.
        Thanks,
            Michael
    _______________________________________________
    sunmanagers mailing list
    sunmanagers@sunmanagers.org
    http://www.sunmanagers.org/mailman/listinfo/sunmanagers


  • Next message: Ratliff, Charlotte: "Summary:Configure Network interfaces for best utilization"

    Relevant Pages

    • SUMMARY(?): EMC Clariion CX-400 and Solaris
      ... Clariion CX-400 disk storage array and, therefore, 2 Solaris servers ... EMC completely replaced the box with new one, same model, but newer ... SUN-Fire 480R boxes - one runs Oracle server and uses LUN 0 on EMC, ...
      (SunManagers)
    • Summary: emc lun does not show after the server reboot
      ... It wasn't cable or HBA issue from the host. ... I ended up calling EMC support and was able to resolve ... it by traspass the lun to different SP port. ... Mail has the best spam protection around ...
      (SunManagers)
    • Re: Oracle 10g using SAN
      ... >> What I am not sure is do I need to divide the first LUN into 3 volumes ... >> Or have 3 LUN, each will be one volume to simulate my D, E, F drives? ... >Ask EMC to help you. ... As I read several posts about Storage people would usually configure ...
      (comp.databases.oracle.server)
    • Re: Migration from SCSI to EMC
      ... > My suspicion is that the problem is not an EMC issue, ... According to EMC they support VMS. ... wich in fact is the command console lun. ... gatekeeper disk on lun 0, and that is a very small disk partition of 60MB or so. ...
      (comp.os.vms)
    • Re: Optimize the script
      ... CLARiiON ID=CK200061101100 ... Logical device ID=600601601496160012288D48703EDB11 [LUN 415] ... Owner: default=SP B, current=SP B ... So you are still running Data General hardware? ...
      (perl.beginners)