Re: WVNETcluster uptime reaches 10 years...



Main, Kerry wrote:
-----Original Message-----
From: Bill Todd [mailto:billtodd@xxxxxxxxxxxxx] Sent: January 8, 2006 11:55 PM
To: Info-VAX@xxxxxxxxxxxx
Subject: Re: WVNETcluster uptime reaches 10 years...


Main, Kerry wrote:

-----Original Message-----
From: bill@xxxxxxxxxxxxxxxxxxxx

...


I have multiple servers with shared file systems so that

any update on

any system is universal. I can (and do) do rolling updates so that
system availability is continuous. There are only two

things missing.

A "cluster uptime" value and thinking it matteried enough to care.



And could you let us know what happens to the incoming

writes when the

system hosting the writes for other systems via the network

file sharing

you are talking about has to be rebooted or just plain halts or is
powered off?

Why should he bother? I let you know multiple times several years ago back when NFS was the *only* cluster file system supported in Solaris (the client times out waiting for the server's response, resubmits the write which is of course idempotent, and hits the fail-over partner which has taken over the IP address previously used by the failed server - seeing no problem save for what appears to have been a random connection hiccup which the retry worked around), but you just sailed on obliviously spewing the same incompetent drivel - and are obviously still doing so.




Ok, so a backup server takes over the personality of the primary server.

Sort of: this is another part of what I explained to you multiple times years ago but which apparently went completely over (or through) your head.


Now, what happens to the stuff that was running on the backup server?

It keeps running - just like things on a VMS cluster node keep running when that node has to pick up some of the slack from a failed node.


Since it has a new IP address, what happens to all those connections to
stuff running on the backup server?

It doesn't have a *different* new IP address, it has an *additional* new IP address. So the stuff that was already running there continues to run using the 'old' address, and the new stuff fails over using the 'new' one.



In addition, since one server is hosting all the writes,

One server is hosting all the writes for *one* filesystem out of potentially many (perhaps aggregated into a single directory hierarchy via mount points).


what happens if
the primary server gets overloaded, but has not literally "failed"?.

Well, to start with, Solaris is a mature enterprise OS that takes graceful degradation under load as seriously as VMS does, so the situation you describe should be comparably rare in any well-managed system.


Since that one server is the master for all the other system writes,

As I just noted above, that premise is false, hence any conclusion you attempt to draw from it is garbage. Many servers in the cluster can each concurrently be hosting many writes, each to one portion of the overall file resource.


And that's just with the old NFS approach. My (somewhat vague - it's been a while since I paid any attention to these details) recollection is that the real cluster file system which Sun has had for at least 3 - 4 years now allows direct writes from client servers to shared devices (like Tru64's: a metadata hosting coordinator makes sure the clients don't trip over each other).

....

 Most folks think a DLM is required to do direct IO's from each

system.

Most folks would be wrong, then: a central lock manager can handle that just fine (and does in quite a few commercially-successful systems - SANergy having been one of the earlier ones).




A distributed lock manager spreads the cluster traffic around each
server such that a single node becoming very busy does not negatively
impact the performance on secondary systems.

Tut, tut: that's a completely different statement from your original one (still right up there above). Not to mention being something of a red herring: do you have any idea just how intense file activity (and actual contention) has to get before lock-management scaling actually starts to become a serious issue? And once again don't forget that there's not a single central lock manager for the entire system: each individual (sub-)filesystem has its own, and that hosting load can be distributed around the cluster necessary.


The VMS DLM is a truly wondrous beastie, but almost any real-world scaling issues it solves are more likely to be for the manifold uses it is put to *other* than file-access coordination.



Or perhaps you could expand on how you would shut one

entire site down

without telling the end users in a multi-site config and not impact
application availability?

Applications (and application groups) in Solaris have for several years been able to be bound to fail-overable IP addresses just as file systems have - again something which I've explained to you in the past.




Yes, fail-over which means you either have a pile of HW sitting at the
remote site either doing nothing waiting for a problem to occur on the
primary or running some secondary loads and/or connections which get
thrown out when the fail-over happens.

Wrong again: see above. Whether failing over to another local node or to a remote site, the node taking up the load can continue doing the things it was already doing, just as is the case in a VMS cluster when the survivors take up the load from a failed member.


Furthermore, the various services provided by the failed node can be distributed across multiple survivors - though IIRC via predetermined static assignment on a per-service basis (which is somewhat less elegant but appears to meet real-world needs fairly well).

Which of course means that systems and sites can back each other up in a fully active/active configuration.


In addition, how does one differentiate between a failed primary and one that is just extremely slow due to the loads on that primary system - what process on the solution makes the call to fail-over the application?

That I'd have to do some actual digging to ascertain, and I see little reason to do your homework for you given how little you've managed to benefit from it when I've done so in the past. Besides, if you didn't fancy the answer you'd likely just ignore it and go on spewing the same kind of false FUD you've been spewing for years.


- bill
.



Relevant Pages

  • RE: VMS cluster behind a *NIX firewall
    ... VMS cluster behind a *NIX firewall ... In terms of the Linux proxy-like server, ... The problem with simple load balancing schemes is that they are to ... As a SPOF issue, it's pretty easy to address in the UNIX world, there are ...
    (comp.os.vms)
  • Re: Web NLB
    ... that will distribute the load. ... free from the Windows Server 2003 Resource Kit. ... a load balance "cluster" using your instructions. ... I installed Network Load Balancing on each of the "public" NICs and set ...
    (microsoft.public.windows.server.clustering)
  • RE: Please Help
    ... Neither NLB nor failover cluster will distribute the load among a single ... server name and distribute the load across servers. ...
    (microsoft.public.windows.server.clustering)
  • Re: SQL Load Balancing
    ... Paul may correct me if I am wrong, but I don't think you can truly load ... balance a SQL server on a cluster. ... > We are currently using SQL 2000 in an Active/Passive ...
    (microsoft.public.sqlserver.replication)
  • Re: Exchnage 2007 Clustering
    ... Sounds like you are mixing cluster technologies here since you mentioned 'load balanced' and then a Geographically Dispersed cluster scenario. ... Windows Server 2008 Readiness Team ... balanced enviornment. ...
    (microsoft.public.windows.server.clustering)