Re: High Availability Options

From: Will Hartung (willh_at_msoft.com)
Date: 12/06/03


Date: Fri, 5 Dec 2003 16:47:20 -0800


"Tom Combs" <combs@magnet.fsu.edu> wrote in message
news:bqqq9m$72k$1@news.fsu.edu...
> I have been looking at the www.linux-ha.org web site as
> a possible solution. Though this is geared to Linux, it
> says that the software will run under Solaris. I'm not quite
> ready to switch my email server to Linux, even though it is
> tempting, and would really like to get a Solaris solution
> if it is cost competitive.

Are you talking about an SMTP/POP/IMAP server, or simply incoming SMTP?

I ask because the nature of the problem domain changes. With an SMTP only
server, downtime isn't as necessarily critical compared to a POP/IMAP server
that is directly serving users. The protocol handles lapses in service
fairly well (i.e. you don't need 10 second failover, several minutes would
hardly be noticed).

> Does anyone have any experience with High availability/fail over
> setups? Has anyone used the HA software from linux-ha.org on
> Solaris?

For a pair of application servers, we wrote our own clustering system.

We simply added a new, private network between the two machine using a
crossover cable, and set up a shared IP that either machine would use if
they were "online".

The secondary machine would monitor the primary machine over both the main
network (that clients used to connect to the machines) and the private
network.

We only used the main network as out indicator of "failure" of the primary
machine (that is, we would not fail over simply because something happened
to the private network). This is because only the main network affects the
actual client.

We felt there were 2 ways for the system to fail:
o Machine failure -- system simply dies (power, cpu, etc)
o Network failure -- the connectivity fails (NIC, cable, intervening
switch/hub/etc)

We weren't concerned with actual application failure, simply host failure.

We used a simple ping test from the secondary to the primary to determine
"liveness".

If the secondary detected that the primary had failed, it would do three
things.

First, it would try and use the private network to rsh a command to unplumb
the shared IP. For, perhaps only a part of the network failed, but not the
actual machine.

Second, it would plumb the shared IP and start up the appropriate servers.

Finally, if the original attempt to unplumb the IP on the primary failed, it
would continually try and contact the primary machine with repeated commands
to unplumb it's shared IP.

The issue is that perhaps the failure is only intermittent (such as someone
unplugging it from the switch), and we wanted to ensure that any fight for
the shared IP would be minimal.

The other thing we decided was that this failover was one way. Once we
triggered failover, it would not automatically return to the primary. We
felt we didn't need it to be that sophisticated. If the primary machine went
down for a reason, someone needed to figure that out anyway, so they can
easily restart the services and reset the cluster as necessary.

Finally, we don't have either system start their important services
automatically. The primary machine won't plumb the shared IP except in the
script that starts up its service, and that needs to be done manually.
Again, we felt that since ideally the machines are "never" down, should they
be brought down we can simply manually start the services. This helps
prevent the issue of, say, the primary losing power, then getting it back
and restarting and then having to do the whole negotiating "who's got the
ball" thing.

Now, these machines didn't have any disk sharing issues, as they relied on
the back end database. Getting drive failover (from, say, a SAN) is a
completely different ball game, and that where things like Veritas and such
come in to play.

Our simple scheme has so far worked for us, and didn't cost more than a
couple hours of man page reading, perl scripting, and testing.

The other systems seem very sophisticated, with a lot of features, and those
features are important for a general purpose solution, but most folks only
have one or two scenarios that really affect them.

Regards,

Will Hartung
(willh@msoft.com)



Relevant Pages

  • Windows 2003 SP1 - Many problems on ISA 2000 server
    ... On one server, Microsoft Firewall failed to start with Event 11011: ... Network Address Translation because the system call PNATInit failed. ... Use the source location 308.1151.3.0.1200.365 to report the failure. ... This failure may be due to the Internet Connection Firewall ...
    (microsoft.public.isa.configuration)
  • Re: DNS server fails after re-start
    ... Microsoft MVP (Windows Server System: ... We do not have full time network ... > Directory for this zone and is unable to load the zone ... find the message about the failure that caused the later ...
    (microsoft.public.windows.server.dns)
  • Re: Lose remote access when enabling NAT in RRAS
    ... How to Setup Windows, Network, VPN & Remote Access on ... access from INSIDE the server, and the private network can access the server ... Now at this point I can access the server with remote access across the ...
    (microsoft.public.windows.server.networking)
  • Re: Event ID 1094
    ... Review the following information to ensure that your network configuration is correct. ... Correctly set up your server cluster's networks. ... Reserve one network exclusively for internal node-to-node communication (the private network). ... Do not use teaming network adapters on the private networks. ...
    (microsoft.public.windows.server.clustering)
  • DHCPd / dhcrelay problems.
    ... I have setup the masquarade server that I detailed in my other post thus:- ... On our bit of the campus network I have two dhcp servers setup on ... to do is use this server to configure the clients on the private network ...
    (comp.os.linux.networking)