Linux-HA and Pacemaker have supported managing Xen DomUs as resources for a long time; this allows the cluster to start, stop, monitor, and migrate the guests, providing high-availability through fail-over for arbitrary virtualized services, even including some monitoring hooks into the guest (see XenResource).
Running a cluster within the virtual guests however is desirable as well:
- gain access to clustered filesystems such as OCFS2 within the DomUs,
- provide fail-over capabilities and service monitoring within the guests,
(This is straight-forward, but requires a very clear terminology to not be confusing. DomU cluster refers to the cluster running within the virtual machines. Dom0 cluster refers to the cluster running at the physical layer.)
However, running a Linux-HA or Pacemaker cluster within the DomU faces some special challenges.
While this has been supported for testing for a long time, greatly reducing the requirements on hardware for non-production work, productive settings require
- That the virtual machines be distributed across several physical nodes.
- There may be a clustering solution running in Dom0, moving the guests around as required for fail-over, load-balancing, and administrative requests.
- The Dom0 cluster must be aware if the DomU cluster fences a DomU node, lest it treat this as a failure and restart the DomU, causing data corruption.
- Reliable STONITH mechanisms for error recovery and to protect data integrity; the guest must be disabled and acknowledged.
It is the combination of these points which cause some non-obvious issues. In case the host for a guest becomes unreachable, the DomU cluster can no longer achieve successful fencing; this split-brain scenario needs escalation to the Dom0 layer.
The solution is to integrate the two layers of clusters, in particular with regard to to STONITH.
- Instead of calling out to a physical STONITH device, the DomU cluster instead instructs the Dom0 cluster that it wants a specific node restarted or turned off.
- The Dom0 cluster knows - at all times, even with migration - on which physical node the DomU is currently hosted, and can comply with this request.
- Further, through this integration, Dom0 is made aware of the DomU intents, and its own monitoring and recovery strategy can take this into account.
- In case there is a DomU stop failure (due to a crashed hypervisor), split-brain or physical node failure, the Dom0 cluster can properly interact with the physical STONITH devices to faciliate recovery.
- If you use the Dom0 cluster to stop and start DomUs, the DomU nodes will cleanly sign-out of the DomU cluster and not trigger a fencing operation.
Only this integration delivers the maximum reliability.
- Both clusters must be a fully configured Linux-HA CRM/Pacemaker cluster.
- The Dom0 cluster must have physical STONITH configured.
- The Dom0 cluster must have redundant communication media.
- Configure the Dom0 cluster like you would configure any other production-ready system, paying particular attention to the STONITH setup.
- Configure the Xen DomUs as XenResource resources.
- Special requirement: The resource id must match the hostname (uname) of the DomU within the DomU cluster!
- You can set the meta_attribute allow-migrate as you prefer.
- For fastest recovery, set shutdown_timeout to 0 on the Xen resource. This forces an immediate destroy; as this is an error escalation, this is likely what you want.
- Configure an IP address resource and remember this (Dom0 Cluster IP address).
- Ensure that the DomU nodes are spread over several physical nodes, otherwise you will have no real redundancy.
- Configure the DomU cluster like any other cluster.
- Virtualized environments are slightly more timing sensitive than physical systems, especially under load or during migration. It is strongly recommended to set the deadtime higher than 5 seconds.
- Use the external/xen0-ha STONITH plugin:
- Set the dom0_cluster_ip to the IP address configured in the Dom0 cluster.
- Set the hostlist to all nodes within the cluster. Again, these hostnames/unames must match the ids of the XenResource objects configured in Dom0!
- Ensure that all DomU nodes can login to the Dom0 nodes as root via ssh.
- Make sure the clusters do not use the same port numbers and/or mcast address. Otherwise, your logs will be flooded with authentication errors, or worse, if autojoin is enabled, you will have the two layers join into a big cluster and this will utterly and completely fail.
- Several DomU clusters sharing the same Dom0 cluster are not a problem per se. However, in case one of the guests becomes fatally stuck and no longer responds to xm destroy, the other guests on the physical node will also be affected and moved elsewhere. Since this indicates a bug in the hypervisor, this is beneficial, but it means the clusters are not completely independent.
- In case of doubt, please ask on the linux-ha list (or the Linux support vendor of your choice) whether your configuration is fine. The intricacies are not always obvious.
Document if, and how, mixed P/V environments are supportable.