VERITAS Cluster Server FAQ


Last Updated: 2003/01/03



Maintained by: Brendan Choi bchoi@sun-veritas.net

1. Do I have to run "hastart" on each node to manually startup VCS?

2. How do I start VCS when one node is down?

3. How can I shutdown VCS without shutting down my applications?

4. How do I failover a Service Group?

5. Is offlining a Service Group the same thing as failing it over? What does offline mean?

6. What's the difference between Agents and Resources?

7. Does each Service Group have its own IP and DiskGroup?

8. Should I put everything in one Service Group, or should I have more than one Service Group?

9. How do I add another Service Group to main.cf?

10. Can I use vi to edit main.cf?

11. Can different Resources have the same name if they are in different Service Groups?

12. What does autodisable mean? Why did VCS autodisable my Service Group?

13. Does VCS require license keys to run? Did VCS 1.3 require license keys?

14. Do I need to create the same VxVM DiskGroup on both machines?

15. Can I run different versions of VCS in the same cluster?

16. Does VCS require "shared disks"?

17. What is the difference between freezing the system and freezing a Group? Which is better for maintenance?

18. I just added a DiskGroup to VCS, and VCS offlined everything. Why?

19. I need to play with a resource inside a Service Group, but I don't want to cause the Group to fault. What do I need to do?

20. After someone started up some process on the other node, VCS reports a "Concurrency Violation", and tries to offline that process. What is this, and is it bad?

1. Do I have to run "hastart" on each node to manually startup VCS?

Yes. If you need to manually startup the cluster, you have to run "hastart" on each node. There is no command to startup VCS on every node. If you only execute "hastart" on one node, VCS will come up, but it probably won't startup your Service Groups. VCS has to probe each machine the Service Group can online on, and it can't do that if VCS isn't running on one of the nodes.

2. How do I start VCS when one node is down?

Normally, VCS has to seed all the nodes in your cluster before becoming fully operational. VCS may actually startup, but none of the commands will work. If one of your nodes is down, and you need to start VCS on the other nodes, then you must manually seed the other node(s). Run this command on each node that is up: /sbin/gabconfig -cx VCS should then be starting up. You may have to online some Service Groups manually: hagrp -online {Service Group} -sys {hostname} If the gabconfig command doesn't work, reconfigure GAB and LLT and try again. Do the following on *both* nodes: 1) Make sure had and hashadow are not in the process table. Check "ps -ef" and kill them if you have to. 2) /sbin/gabconfig -U 3) /sbin/lltconfig -U (answer yes) 4) /sbin/lltconfig -c 5) /sbin/gabconfig -cx 6) hastart VCS should then startup on each node that is up.

3. How can I shutdown VCS without shutting down my applications?

Use the "hastop -force" option. (1) hastop -all -force (shuts down VCS on all nodes) (2) hastop -local -force (shuts down VCS on the local node only) WARNING: Always make the cluster read-only before doing a force shutdown. haconf -dump -makero If you force stop a cluster while it is in read-write mode, you will get a stale configuration error upon VCS restart. To see if your cluster is in read-only mode, run "haclus -display". The "ReadOnly" attribute should have a value of 1. If not, then run "haconf -dump -makero" to make it read-only. If you start VCS and get a stale configuration error, you have mainly 2 choices. (1) Run "hastop -all -force", check main.cf on your nodes for any inconsistencies, remove any .stale files in /etc/VRTSvcs/conf/config/, and restart VCS. If you see no .stale files, then your main.cf's might have a syntax error. Execute this command to see where the syntax errors are: cd /etc/VRTSvcs/conf/config/ hacf -verify . (2) Continue to start VCS by running "hasys -force {hostname}". Pick the hostname of the machine you want VCS to load the main.cf from. Usually you would choose the 2nd option if the cluster is not in production or if you're confident the main.cf on the specified machine is good enough.

4. How do I failover a Service Group?

You can manually failover a Service Group two ways: (1) hagrp -switch {Service Group} -to {target node} (2) hagrp -offline {Service Group} -sys {current node} hagrp -online {Service Group} -sys {target node} The second way simply gives you more control. After you offline the Group, you can online it anywhere when you want to. The first way is for an immediate "handsoff" failover. VCS can automatically failover Groups if you do the following: (1) Execute "init 6" or "shutdown -ry 0" (2) Execute "reboot" (3) Switch off the machine's power (4) Pull out all heartbeat cables simultaneously (5) Cause a "fault", i.e. manually shutdown some service or resource in your Service Group. (6) Panic the machine.

5. Is offlining a Service Group the same thing as failing it over? What does offline mean?

No, when you offline a Group you are shutting down all the services in the group, but you are not onlining it anywhere else. You can online the Group at any time if you want. Offline for a Group means the services in that group are unavailable to any node in the cluster. A failover is when a Group offlines from one node and onlines on another.

6. What's the difference between Agents and Resources?

Agents are VCS processes that control and monitor the Resources. Resources are all those objects in your Service Group, and they all require Agents. For example, all your filesystems are resources, and they all use the Mount Agent. Your virtual IP address is a resource, and it uses the IP or IPMultiNIC Agent. The Veritas Volume Manager Disk Group is a resource, and it uses the DiskGroup Agent. Some Agents, such as the Oracle Enterprise Agent, have to be purchased separately.

7. Does each Service Group have its own IP and DiskGroup?

Usually, Service Groups have their own IP and DiskGroup resources, but this is not technically required; it all depends on your applications. All a Service Group really needs is some resource. Most resources, however, cannot be shared across Service Groups. That is why Service Groups usually do have their own IP's, DiskGroup, filesystems, etc. Groups can share certain resources like NIC and MultiNICA, although the resource has a unique name in each Group.

8. Should I put everything in one Service Group, or should I have more than one Service Group?

Usually people try to separate different applications as much as possible. Service Groups serve as logical divisions for your applications. You don't want a failure of one application to cause a failover of all your applications if its unnecessary. If all your applications are using the same Group, then a failure in that Group can cause all your applications to fail. The goal of high availability is to try to minimize single points of failure. That is why separate applications in a cluster usually means separate Service Groups are recommended.

9. How do I add another Service Group to main.cf?

You can add a Service Group using the VCS GUI, using VCS commands, or editing the main.cf file.

10. Can I use vi to edit main.cf?

You can edit main.cf only when VCS is shutdown. VCS does not read main.cf when it is already running. VCS only reads from the configuration it has in memory. If you edit main.cf while VCS is running, VCS will not read your updates, and it will overwrite the main.cf with the configuration it has in memory. You can always edit a copy of main.cf and shutdown VCS, move the new main.cf into /etc/VRTSvcs/conf/config/ and restart VCS. Here's an example... 1) haconf -dump 2) cp main.cf /tmp/main.cf 3) vi /tmp/main.cf 4) haconf -dump -makero 5) hastop -all -force 6) cp /tmp/main.cf /etc/VRTSvcs/conf/config/ 7) hastart

11. Can different Resources have the same name if they are in different Service Groups?

No, two resources in the cluster cannot have the same name, even if they are in different Service Groups. Resource names must be unique in the entire cluster.

12. What does autodisable mean? Why did VCS autodisable my Service Group?

VCS does not allow failovers or online operation of a Service Group if it is autodisabled. VCS has to autodisable a Service Group when VCS on a particular node shuts down *but* the GAB heartbeat is still running. Once GAB is unloaded, e.g. when the node actually shuts down to PROM level, reboots, or powers off, VCS on the other nodes can automatically clear the autodisable flag. During the time interval a Group is autodisable, VCS won't allow that Group to failover or be onlined anywhere within the cluster. This is a safety feature to protect against "split brains", when more than one machine is using the same resources, like the same filesystems and virtual IP at the same time. Once a node leaves the cluster, VCS has to assume that machine can be user-controlled before it goes down, that theoretically someone can login to that machine and manually startup services. It is for that reason that VCS autodisables a Group within the existing cluster. But VCS does let you clear the autodisable flag yourself. Once you're sure that the node that left the cluster doesn't have any services running, you can clear the autodisable flag with this command: hagrp -autoenable {name of Group} -sys {name of node} Repeat the command for each Group that has been autodisabled. The Groups that are autodisabled and the nodes they are autodisabled for can be found with this command: hastatus -sum Most of the time VCS autodisables a Group for a short period of time and then clears the autodisable flag without you knowing it. If the node that leaves the cluster actually shuts down, the GAB module is also unloaded, and VCS running on the other nodes will assume that node has shutdown. VCS will then automatically clear the autodisable flags for you. There's one catch...by default VCS on the running cluster requires GAB to be unloaded within 60 seconds after VCS on that node is stopped. After 60 seconds, if GAB still isn't unloaded, VCS on the existing cluster will assume that node isn't shutting down, and will keep the autodisable flags until the administrator clears them. To increase the 60 second window to 120 seconds, run this: hasys -modify ShutdownTimeout 120 For large systems that take a long time to shutdown, it is a good idea to increase ShutdownTimeout. Read the VCS User's Guide for more information on autodisable.

13. Does VCS require license keys to run? Did VCS 1.3 require license keys?

The latest versions of VCS require license keys to run. VCS 1.3 and before did not.

14. Do I need to create the same VxVM DiskGroup on both machines?

No, when you create a Volume Manager DiskGroup, just pick one machine to create the DiskGroup on. You do not create the same DiskGroup on both nodes. After you create a DiskGroup, you can add it as a Resource to your VCS configuration. VCS will then use VxVM commands to import and deport the DiskGroup between the systems during Service Group online, offline, or failover.

15. Can I run different versions of VCS in the same cluster?

No, absolutely not! Different versions of VCS, and even different patch levels of VCS, cannot run at the same time in the same cluster. Therefore, when you install VCS patches, you must install them on *all* nodes at the same time! The cluster will have to be partially or completely shutdown during upgrades or patching. Of course, you can shutdown VCS without shutting down your services.

16. Does VCS require "shared disks"?

No, VCS does not require that your nodes are connected to shared disks. However, most people like to have storage that can be deported and imported during failover. If your applications do not need this, then you do not need shared storage. The VCS installation and setup will not ask if you have shared storage. This is great for people who don't have the shared storage ready, but still want to try out or test VCS.

17. What is the difference between freezing the system and freezing a Group? Which is better for maintenance?

Freezing a system prevents VCS from onlining a Service Group onto that system. This is usually done when a machine in the cluster is unstable or undergoing maintenance, and you don't want VCS to try to failover a Group to that machine. However, if a Group is already online on a frozen system, VCS can still offline that Group. Freezing a Service Group is the most common practice when maintenance needs to be done on the nodes while VCS is still running. When you freeze a Group, VCS will take no action on that Group or its Resources no matter what happens to the resources. That means you can take down your services, like IP's, filesystems, databases and applications, and VCS won't do anything. VCS won't offline the Group, or offline any resources. VCS also won't online anything in that Group, and it won't online that Group anywhere. This basically "locks" the Group on a node, or prevents it from onlining until you unfreeze the Group. One thing that sometimes surprises people is that VCS will still monitor a frozen Group and its resources. So, during maintenance, VCS might tell you that your resources have faulted, or the Group is offline. If you manually bring everything back up after maintenance, VCS monitoring should refresh and see all your resources and the Group are online again. This is a good thing, since it is best to know if VCS thinks your Group and its resources are online before you unfreeze the Group. To freeze a Group: haconf -makerw hagrp -freeze {Group name} -persistent haconf -dump -makero To unfreeze a Group: haconf -makerw hagrp -unfreeze {Group name} -persistent haconf -dump -makero

18. I just added a DiskGroup to VCS, and VCS offlined everything. Why?

The diskgroup you added was probably already imported manually or through VMSA, and without the "-t" option. vxdg import {disk group} VCS imports diskgroups using "-t", which sets the diskgroup's noautoimport flag to "on". vxdg -t import {disk group} So, when you added the diskgroup to VCS, VCS detected the new diskgroup was imported outside of VCS because the noautoimport flag was set to "off". This is considered a violation, and the DiskGroup Agent monitor script will then offline the entire Service Group. This is a precaution to prevent split brain. You can see a diskgroup's noautoimport flag by doing: vxprint -at {disk group} If you've imported a new diskgroup, and have not yet added it to VCS, you can deport the diskgroup first, and then add it to VCS. You do not need to import a diskgroup to add it to VCS.

19. I need to play with a resource inside a Service Group, but I don't want to cause the Group to fault. What do I need to do?

You should first make the resource non-critical. hares -modify {resource name} Critical 0 By making the resource non-critical, VCS will not offline the Group if it thinks this resource faulted. You must also make any Parents of this resource non-critical. Run this to check if there are any parents for this resource: hares -dep If you don't want VCS to monitor your resource, you can disable monitoring by doing this: hares -modify {resource name} Enabled 0 This prevents VCS from monitoring the state of this resource, so it won't fault the Group no matter what you do to the resource, even if it has Critical=1. If the Group is in production, you might want to freeze the Group just to be safe.

20. After someone started up some process on the other node, VCS reports a "Concurrency Violation", and tries to offline that process. What is this, and is it bad?

A Concurrency Volation is reported when the Agent of a resource detects that same resource or process is running on another node. The Agent will then try to run the offline script for that resource on that other node. This is to prevent split brain. If the Agent cannot offline the process on the other node, then you may want to manually offline the process or change the Agent's monitoring. Sometimes a Concurrency Violation is more or less a "false alarm", because it has a lot to do with how good your monitoring is. You need to find out from your Agent, how exactly is it monitoring? If it is an Application Agent resource, look at the MonitorProgram script, or look at MonitorProcesses. If it looks like the Agent is just monitoring for something very superficial, then just change the monitoring. If you are changing the monitoring in production, you may want to freeze the Service Group or make the resource non-Critical.