RE: OT: Sparc not dead yet




-----Original Message-----
From: AlexNOSPAMDaniels@xxxxxxxxxxxxx
[mailto:alexdaniels@xxxxxxxxxxxxx]
Sent: April 24, 2006 5:35 PM
To: Info-VAX@xxxxxxxxxxxx
Subject: Re: OT: Sparc not dead yet

Main, Kerry wrote:

If you have an active-active OpenVMS cluster with a DLM
managing access
to cluster files and print/batch jobs, then you can schedule entire
systems to be brought down for proactive maint and/or upgrades - all
with zero impact on application availability.

Simply set a flag at the OS level which causes all new
connections to go
to other servers and when all current connections have
finished on that
one server, simply shut the server down for the planned fix and/or
upgrade. When the server reboots, it again starts taking
its connection
and processing loads. [Batch jobs need to be considered as well]

I've implimented Load Broker and supplimented it with LAN/IP failover
for more clusters than I can count.

However you still need to wait for all the existing users to log off
and batch jobs to finish. Performance and/or availablity is reduced
during this time.Being able to hot-swap a PCI-X card, would
be, I would
image, perferable.


Yes, you do have to wait for users to log off, connections finish, and
batch jobs to complete, but for most environments, this is not an issue.
If you have long running batch jobs (multiple days), then that obviously
needs to be considered, but imho, and I know every applic is different,
but that also would require batch job design anyway as anything running
that long should have some checkpointing anyway.

Would you be willing to pay extra for this hot swap capability vs using
VMS servers on std HW (cheaper) and simply shutdown servers proactively
later that day? Since replacing a card in a set might only take 5-10
minutes and since PCI cards do not fail that often and since notifying
users is not required with this solution (use load balancing to achieve
zero availability impact), would this not be a more effective overall
solution?

On one engagement, I setup a 3 node very mission mission critical ES45
OpenVMS cluster whereby:
- 2 prod nodes designed to carry 100% of peak interactive loads in case
1 was not available. One node was batch node for background reporting.
Two nodes were production for end users.
- NIC Teaming (used TCPware V5.6-2 mind you on this project) for 3
separate VLANS on each server. Used dual NIC cards with each VLAN port
being on different card and PCI channel than its mate. Provided
transparent fail-over + transmit load balancing.
- dual SAN FC cards (2Gb cards were less than 10% taxed even when
servers were extremely busy with CPU and IO loads.
- all NIC and SAN card loads split across multiple PCI buses.
- dual Cisco routers trunked with one port of each VLAN connected to
each.


If you have this capability, local HW hot-swapping becomes
much less of
an issue. Especially if FC/NIC adapters are implemented with teaming
design. If one adapter fails, simply schedule a time for
the system to
be taken down and have it replaced. Since you do not need
to tell end
users about this server going, they will not care.

Been working with peecee's lately? I didn't know the "teaming" phrase
had made it to VMS. I thought in terms of NICs we said LAN
failover and
Failsafe IP?


See above. With that project I worked on, using TCPware, we were able to
telnet to a server, start a monitor system type command to generate
activity, pull the associated NIC cable and the session continued
running after only about a 2-3 second pause. No errors and no loss of
data. With TCPware, we could see that the NIC failed over properly. This
also supported fail-back as well i.e. when we re-connected the NIC cable
after a 2-3 second pause, the connection automatically failed back to
the original port.


As I'm sure your aware FailSafe IP also gives you added performance
benfits for your outgoing traffic, if there in the same subnet. Losing
a card drops your performance. And yes the obvious workaround is to
deploy FailSafe on top of LAN Failover, with the added cost.


As you indicated, the loss is in transmit only i.e. perhaps if you had a
big FTP background load, there might be some loss, but remember that the
overall load is also split across the other servers in the cluster as
well. I doubt end users would notice the impact of one NIC failing on
one server in the cluster when its mate takes over.

FC cards same issue, you lose performance, possible path
changes across
the cluster et al.


Heck, the 2Gb cards I have used for the last 2 years or so have not been
pushed at all - even with high IO loads. Yes, there might be some path
changes, but that is transparent to applications and end users.


So yes one can wait for the users/batch jobs to finish, then drop a
node, but it's not ideal and customers are left having to puchase more
hardware to mitigate for these times.


That's what mission critical Customers do as part of their base design.
In most Cust environments I have seen, normal users tend to logout at
5-8:00pm or they get disconnected for security reasons i.e. forgot to
logout before heading home.

In another Cust environment, using this same strategy i.e. using std
Alpha servers in a 3 node cluster, they would set the interactive flags
and batch jobs up before going home and then shut that one server down
at 9:00AM the next day for the planned quick maint or upgrade i.e. in
prime time just after the all the users have logged on (7-8:30 was that
time). Their reasoning was that this reduced overtime and provided a
better work environment for their IT staff as they did not need them
working late or on weekends all the time.

Perhaps shutting servers down in prime time is not for everyone, but
this Cust really loved not having to tell end users a server was going
down because planned downtime that impacted end users was next to
impossible to schedule.

Regards

Kerry Main
Senior Consultant
HP Services Canada
Voice: 613-592-4660
Fax: 613-591-4477
kerryDOTmainAThpDOTcom
(remove the DOT's and AT)

OpenVMS - the secure, multi-site OS that just works.
.



Relevant Pages

  • Re: How do you recover an SBS2003 server with a failed internal NIC
    ... uninstallation of hidden devices and loopback device. ... start and having seen the state of a server ... You know the various IP ranges of the NICs during change, ... Prem) where the internal network card was removed the the server was ...
    (microsoft.public.windows.server.sbs)
  • Re: Long log in time
    ... you activate the second network card and attach the router to the second ... to ensure the LAN nic is at the top (I usually rename the nics LAN and ... >From your IP configs I believe your router is 10.0.0.1 and server is ... The only thing you could do with the 2nd card is to enable the SBS ...
    (microsoft.public.windows.server.sbs)
  • Re: How do you recover an SBS2003 server with a failed internal NIC
    ... start and having seen the state of a server ... You know the various IP ranges of the NICs during change, ... Prem) where the internal network card was removed the the server was ... CIECW was run, along with the Change server IP wizard, which did get the ...
    (microsoft.public.windows.server.sbs)
  • Re: Web NLB
    ... that will distribute the load. ... free from the Windows Server 2003 Resource Kit. ... a load balance "cluster" using your instructions. ... I installed Network Load Balancing on each of the "public" NICs and set ...
    (microsoft.public.windows.server.clustering)
  • Re: Multi homed terminal server
    ... Nics below ... > Be sure you network uses the right subnet mask,...the correct one should ... > Microsoft Internet Security & Acceleration Server: ... >> The second card is connected to an ADSL router (people from outside can ...
    (microsoft.public.win2000.networking)