SUMMARY: TSM and Sun Cluster, or how to create a resource that is a script in Sun Cluster
- From: Markus Mayer <mymaillists@xxxxxx>
- Date: Tue, 2 Sep 2008 16:42:43 +0200
In the end the only reply I got was from our Sun partner, Martin Pre_laber,
and thankfully through his several further suggestions we found an answer.
To get a script in the cluster framework, specifically in our case one that
starts and stops TSM's dsm scheduler, several steps were needed. The most
critical for me was to stop following the tsm manual where it was telling me
that all scripts for starting and stopping the tsm scheduler plus all
configurations files *must* be on shared storage. This simply doesn't work.
The dsm.opt file for each TSM node (note that a TSM node is different to, and
*not* a cluster node!) can and generally should be on shared storage, mainly
for consistency. The scripts for starting, stopping and probing the tsm
services however need to be local and present on every node at all times.
This availability of the scripts is what the cluster framework needs in order
to add the resource into the cluster. If the script wasn't available on all
nodes when I tried to create the resource, cluster spat the dummy...
After setting up the scripts and manually testing the tsm client to make sure
the configuration is correct on all nodes, it is possible to add a new
resource to the cluster of type SUNW.gds - a general data service. To add the
scripts as a gds resource into the cluster, the following command does the
job:
# clrs create -g www-rg -t SUNW.gds -p
Start_command="/etc/init.d/dsm.scheduler.cluster.sh /zones/webdata/tsm/dsm.opt
start" -p Probe_command="/etc/init.d/dsm.scheduler.cluster.sh webdata probe"
-p Stop_command="/etc/init.d/dsm.scheduler.cluster.sh webdata stop" -p
Network_aware=false webdata-backup-rs
So in this example, the script /etc/init.d/dsm.scheduler.cluster.sh is on
local storage on all nodes and is identical across all nodes. The script is
below. The file /zones/webdata/tsm/dsm.opt is on shared storage and switches
between nodes in the event of a failover. When the rg starts on a different
node, the script is run and the resource comes online. Curiously, the dsmcad
daemon process doesn't need to be killed in the event of a failover, the
cluster framework seems to take care of this, killing the process and allowing
a clean failover. Also, making the resource not network aware removed the
need for a logical hostname for the resource group.
The script to start, stop, and probe the dsm client is below. It could
definitely be done better, however it works. Also, what I've noticed, it may
also be possible to directly start and stop the scheduler process, dsmc, using
the script. I haven't tried this, however I'm sure it would work. Note that
I include this script for informational purposes only, I don't promise that
it will work for you ;-)
#!/bin/ksh
# Generally, we should start up with something like this:
# /opt/tivoli/tsm/client/ba/bin/dsmcad
-optfile=/zstorage/build-test/tsm/dsm.opt
# set the necessary environment variables so that TSM doesn't vomit
LC_CTYPE="en_US"
export LC_CTYPE
LANG="en_US"
export LANG
LC_LANG="en_US"
export LC_LANG
LC_ALL="en_US"
export LC_ALL
# work out which argument is the command and which the config file
case "$1" in
'start'|'stop'|'probe')
COMMAND=$1
DSM_CONFIG=$2
;;
*)
COMMAND=$2
DSM_CONFIG=$1
esac
# now check what we want to do.
case "$COMMAND" in
'start')
# echo "starting"
# There has to be a better way to do this test.......
if test -f $DSM_CONFIG ; then
true
else
echo "Config file $DSM_CONFIG does not exist, exiting."
exit 1
fi
export DSM_CONFIG
# Check if there is already a dsmcad process running, if so, ignore
the start command
PS=`ps -ef | grep -v grep | grep -v vi | grep -v probe | grep -v
zoneadmd | grep -v "dsm.scheduler.cluster.sh" | grep -c "$DSM_CONFIG"`
if test "$PS" -eq "1" ; then
echo "dsmcad is already started for $DSM_CONFIG, will not
start another."
ps -ef | grep -v grep | grep -v vi | grep -v probe | grep -v
zoneadmd | grep -v "dsm.scheduler.cluster.sh" | grep "$DSM_CONFIG"
exit 0
elif test "$PS" -gt "1" ; then
echo "Seems to be too many processes running for dsmcad for
$DSM_CONFIG, please check it."
exit 1
fi
/opt/tivoli/tsm/client/ba/bin/dsmcad -optfile=$DSM_CONFIG
if test "$?" -ne "0" ; then
echo "Failed to start the dsm scheduler, exiting"
exit 1
fi
;;
'stop')
# echo "stopping"
# For the most part, we ignore a stop command as the dsmcad should
work out itself
# that it has to stop it's child process when the directory with it's
password
# isn't available.
exit 0
;;
'probe')
# echo "probing"
# WARNING: The following would produce a bug if "vi" is in the
arguments...
# So make sure you avoid it, OK?
PS=`ps -ef | grep -v grep | grep -v vi | grep -v probe | grep -v
zoneadmd | grep -c "$DSM_CONFIG"`
if test "$PS" -gt "0" ; then
# echo "Found $PS processes"
exit 0
else
echo "Found no processes"
exit 1
fi
;;
*)
# otherwise an invalid command was received, vomit.
echo "options { start | stop | probe }"
exit 1
esac
So I hope I've written something that is useful. If anyone has questions,
feel free to contact me.
regards
On Thursday 14 August 2008, 17:07 Markus Mayer wrote:
Hi all,_______________________________________________
I've been pulling my hair out on this one for a few days now, even with
support from our Sun partner, we have not come up with a solution.
I have a cluster, Sun Cluster 3.2 on two V445's, five resource groups each
containing an own zpool, and a number of zones. Each zpool and the zones
are configured as a resouce within the group, as is necessary for cluster.
Each resource group is configured for failover operations. From the
cluster view, everything works as it should.
Enter the desire to make a backup with TSM. Backup services will be run
from the global zone. According to the TSM manual (IBM TSM for unix and
linux, backup-archive clients installation and user's guide, page 543-549)
we need to have an own TSM server node for each shared disk resource to
back up the shared resources. This is configured. Each TSM client node
will backup the data only on the shared disks within each resource group.
From the client side, cluster, we need a simple script that runs as a
resource
within the resource group. This script meets the requirements of cluster,
having exit values of 0, 100 and 201 depending on circumstances, and the
functions start, stop, and probe. As required by TSM, this script resides
on shared storage that switches between nodes, in our case an own zfs file
system on the zpool. When a failover occurs, the script should be started
(backup service/resource brought online) in the same way that any other
resource within the group would be started or brought online.
Therein lies the problem. How can I define a resource that is a simple
shell script or program, which should then be added to an existing resource
group in cluster? It sounds simple enough, but it would seem it's not
so...
Our Sun partner gave me the following link to follow, which I did.
http://docs.sun.com/app/docs/doc/819-2972/gds-25?a=view
In short, it says enbable SUNW.gds (already done), create a resourcegroup
that will contain the resource and failover service itself, create a
logical hostname, then the resource. This is where some confusion comes in
for me.
I already have resource groups defined, one being comms-rg containing two
resources, comms-storage-rs and commssuite-zone-rs. The "backup" resource,
named for example comms-backup-rs, from my point of view should then come
into this resource group. If I try to add a logical hostname to this
resourcegroup, I get an error:
# clreslogicalhostname create -g comms-rg commslhname
clreslogicalhostname: commslhname cannot be mapped to an IP address.
So as suggested by our Sun patner, I tried adding an IP address for the
logical host name and putting it in the /etc/inet/hosts files on both
nodes. The result was:
# clreslogicalhostname create -g comms-rg commslhname
clreslogicalhostname: specified hostname(s) cannot be hosted by any
adapter on wallaby
clreslogicalhostname: Hostname(s): commslhname
getent returned valid information on both nodes.
# getent hosts 172.16.241.54
172.16.241.54 commslhname commslhname.nowhere.nothing.invalid
OK, so it seems that I have to define a new resource group especially for
this one resource which contains one simple script, which makes no sence to
me because I already have a resource group into which the resource should
go. Why then can't I add this new script as a resource in an existing
resource group? The problem here is too, that I need to define an
additional resource group for every other resource group that I have,
currently five, meaning a total of ten resource groups, all of which need
affinities in order to correctly fail over and start the resources.
Additionally, the backup resource needs, according to the manual, to have
network resources defined, and a port list defines, although it needs only
to start a shell script.
It seems much more complicated than it should be. I find nothing else in
the documentation about this, but it has to be simple, I can't imagine that
it could be so complicated....
The alternative, should such a resource definition not be possible, is to
have a TSM client in every zone, and one in the global zone of each node.
This is however not what I'm looking for.
Could it be that I'm barking up the wrong tree here? Does anyone have any
suggestions as to how I can achieve this?
Thanks
Markus
_______________________________________________
sunmanagers mailing list
sunmanagers@xxxxxxxxxxxxxxx
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
sunmanagers mailing list
sunmanagers@xxxxxxxxxxxxxxx
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
- Prev by Date: jumpstart - sysidcfg with two interfaces
- Next by Date: Installing
- Previous by thread: jumpstart - sysidcfg with two interfaces
- Next by thread: Installing
- Index(es):
Relevant Pages
|