
Release 2[1] has a number of components which implement its high-availability cluster management capabilities. This page provides an overview of the architecuture into which these components fit.
The components are as follows:
heartbeat - strongly authenticated communications module
CRM - Cluster Resource Manager
CCM - Strongly Connected Consensus Cluster Membership
LRM - Local Resource Manager
Stonith Daemon - provides node reset services
logd - non-blocking logging daemon
apphbd - application level watchdog timer service
Recovery Manager - application recovery service
CTS - Cluster Testing System - cluster stress tests
Special Glib notes
Function: startup, shutdown, strongly authenticated communication.
Provides locally-ordered multicast messaging over basically any media - IP-based or not.
Currently talks over the following media plugin types:
unicast UDP[2] ipv4
broadcast UDP[3] ipv4
multicast UDP[4] ipv4
serial[5] links (non-IP) - great for use with firewalls, etc. that play with iptables.
openais[6] - uses OpenAIS[7]' evs communication layer as a communication medium.
ping[8] - pings an individual router - allows us to treat it as a pseudo-member.
ping_group[9] - similar to ping except if any in group is up, then the group is up
hbaping[10] - "pings" a fiber channel disk for connectivity
Heartbeat can detect node failure reliably in less than a half-second. With the low-latency patches, (and maybe a bug fix or so), that time could be lowered significantly.
It will register with the system watchdog timer if configured to do so.
The heartbeat layer has an API which provides the following classes of services:
The Cluster Resource Manager[11] (CRM) is the brains of Linux-HA. It maintains the resource configuration, decides what resources should run where, how to move from the current state into the state where they are running where they ought to be. In addition, it supervises the LRM in accomplishing all these things. The CRM interacts with every component in the system:
It uses heartbeat for communication
It receives membership updates from the CCM.
It directs the work of and receives notifications from the LRM.
It tells the Stonith Daemon when and what to reset
In effect, the PE, TE, and CIB can be viewed as components of the CRM.
The primary goal of the policy engine[12] is to compute a transition graph[13] from the current state of the world. This transition graph takes into account the current location and state of resources, availability of nodes and current static configuration (aka the CurrentClusterState[14]).
The Transition Engine effectively executes the transition graph[13] produced by the PE - to make it's dreams a reality. That is, it takes a computed next state for the cluster and the list of actions and attempts to reach it by instructing the LRM on remote nodes to start and stop resources.
The CIB[15] is an automatically replicated repository for cluster resource and node information as seen by the CRM. This information includes static information such as dependencies, and more dynamic information such as what resources are running where, and what their states are.
All the information in the CIB is represented in XML for maximum usability. We have an annotated DTD[16] which defines all the information we currently manage in the CIB.
Because of this, many of the features of the CRM can be understood by reading the annotated DTD.
Provides strongly connected consensus cluster membership services. Ensures that every node in a computed membership can talk to every other node in this same membership. Implements both an OCF draft membership API, and the SAF AIS membership API. Typically it computes membership in sub-second time.
The Local Resource Manager[17] is basically is a resource agent abstraction. It starts, stops and monitors resources as directed by the CRM.
It has a plugin architecture, and can support many classes of resource agents. The types currently supported include:
be used by the StonithDaemon[18].
Other types are readily added for compatibility with other systems.
The STONITH[19] daemon provides cluster-wide node reset facilities using the improved release 2 stonith API.
The Stonith library includes support for around a dozen types of 'C' STONITH plugins and native support for script-based plugins - allowing scripts in any scripting language to be treated as full-fledged STONITH plugins.
The Stonith Daemon locks itself into memory, and 'C' based plugins are pre-loaded by the stonith daemon so that no disk I/O is required for a STONITH operation using 'C' methods.
The Stonith Daemon provides full support for asymmetric STONITH configurations - allowing for the possibility that certain STONITH devices may be accessible from only a subset of the nodes in the cluster.
Note that it is not currently intended that the Stonith Daemon be used for providing resource-granular fencing. Current thinking regarding resource-granular fencing calls for such fencing to be done by clone resource agents. The resources which need fencing will be dependent on these other resources. Clone resource agents are notified when their various peer resources start and stop.
Can log to syslog or files or both. logd never blocks.
Messages are lost if they get too far behind in preference to blocking. Count of messages lost is printed next time we can output messages. Queue sizes are controllable per-application - and overall.
The application heartbeat daemon is a general service which provides watchdog timer facilities for individual HA-aware applications. When applications fail to check in with it in their prescribed time, interested parties are notified and (presumably) recovery actions taken. This daemon is a simple as we can make it, so it can be the most reliable component in the system. Many Linux-HA system components tie into it, but it is not commonly enabled in the field at this writing. It will register with the system watchdog timer if requested to do so.
The recovery manager daemon is notified by apphbd when a process fails to heartbeat or exits unexpectedly. It then takes actions to (kill and) restart the application.
A key part of our success in implementation comes from having a consistent, flexible, reliable and general infrastructure underneath all the major components.
This infrastructure has several key elements:
Use of Glib mainloop[20] as our uniform dispatching (scheduling) and event processing method
PILS[21] provides a very general plugin loading system which is used extensively in Linux-HA.
This allows for great flexibility and power in the system, while minimizing the size of the running system. This tends to improve the architecture of those subsystems which use plugins. In addition, their power is increased, while minimizing the resource usage of the Linux-HA system on the host servers.
An unexpected benefit of plugins is that almost all the contributions from non-core members come in the area of plugins.
Plugins are currently used in the following areas: communication, authentication, stonith, resource agents, compression, apphbd notification methods.
All interprocess communication is performed using a very general IPC library which provides non-blocking access to IPC using a flexible queueing strategy, and includes integrated flow control. This IPC API does not require sockets, but the currently available implementations use UNIX (Local) Domain sockets.
This API also includes built in authentication and authorization of peer processes, and is portable to most POSIX-like OSes.
Although use of mainloop with these APIs is not required, we provide simple and convenient integration with mainloop.
The Cluster plumbing library is a collection of very useful functions which provide a variety of services used by many of our main components. A few of the major objects provided by this library include:
Mainloop[20] integration for IPC, plain file descriptors, signals, etc. This means that all these different event sources are managed and dispatched consistently.
There are two kinds of bugs one finds in reports from users - those that one would not expect to find during testing and those which really should have gotten caught during testing. Linux-HA has a very low bug rate overall, and an extremely low number of bugs of the "should have gotten caught" category.
The Cluster testing system (CTS[22]) is the primary cause for these low bug rates.
CTS is an automated cluster testing system which runs random stress tests on the cluster. Although it is in most ways a modest system with what seem like largely straightforward tests, it has proven extremely effective in practice.
It's basic strategy is: beat the software to death. Such testing has sometimes been called Bamm-Bamm[23] testing.
CTS is an example of a system where the whole is more than simply the sum of its parts.
The Linux-HA project makes extensive use version 2 of the Gnome Glib library and special use is made of the mainloop[20] event processing structure.
The use of mainloop has made many things much easier and more uniform, and has allowed us (so far) to completely avoid threads and their attendant portability and debugging difficulties.
All communication starts with the heartbeat layer, and every component which communicates with other cluster members does it through the heartbeat layer. In addition, the heartbeat layer provides connectivity information indicating when communication with another node is lost, and when it's restored.
| [1] | http://www.linux-ha.org/NewHeartbeatDesign |
| [2] | http://www.linux-ha.org/ha.cf/UcastDirective |
| [3] | http://www.linux-ha.org/ha.cf/BcastDirective |
| [4] | http://www.linux-ha.org/ha.cf/McastDirective |
| [5] | http://www.linux-ha.org/ha.cf/SerialDirective |
| [6] | http://www.linux-ha.org/ha.cf/OpenaisDirective |
| [7] | http://www.linux-ha.org/OpenAIS |
| [8] | http://www.linux-ha.org/ha.cf/PingDirective |
| [9] | http://www.linux-ha.org/ha.cf/PingGroupDirective |
| [10] | http://www.linux-ha.org/ha.cf/HbapingDirective |
| [11] | http://www.linux-ha.org/ClusterResourceManager |
| [12] | http://www.linux-ha.org/PolicyEngine |
| [13] | http://www.linux-ha.org/TransitionGraph |
| [14] | http://www.linux-ha.org/CurrentClusterState |
| [15] | http://www.linux-ha.org/ClusterInformationBase |
| [16] | http://www.linux-ha.org/ClusterResourceManager/DTD1.0/Annotated |
| [17] | http://www.linux-ha.org/LocalResourceManager |
| [18] | http://www.linux-ha.org/StonithDaemon |
| [19] | http://www.linux-ha.org/STONITH |
| [20] | http://developer.gnome.org/doc/API/2.0/glib/glib-The-Main-Event-Loop.html |
| [21] | http://linux-ha.org/_cache/TechnicalPapers__pils.pdf |
| [22] | http://www.linux-ha.org/CTS |
| [23] | http://en.wikipedia.org/wiki/Bamm_Bamm |
This information provided courtesy of the Linux-HA project at http://linux-ha.org/