Linux-HA Logo

Cluster Resource Management (CRM) Daemon

Overview

The ClusterResourceManagerDaemon co-ordinates the actions of all other ClusterResourceManager[1] subsystems on a given ClusterNode[2]. One of these is elected as the DesignatedCoordinator[3] based on a list of FullyConnected[4] ClusterNodes[5] from the ClusterConcensusMembership[6].

Regular instances of the ClusterResourceManagerDaemon are slaves to the DesignatedCoordinator[3] and effectively end up as glorified routers. Passing messages to and from the DesignatedCoordinator[3] for sub-systems such as the ClusterInformationBase[7] and the LocalResourceManager[8].

The actions of the ClusterResourceManagerDaemon are controlled by a FiniteStateMachine[9] which is described here[10].

Startup

1. Local Startup

Join Protocol

  1. ClusterNode[2] Announces Presence to the DesignatedCoordinator[3]
    This action is initiated upon receiving a DC_Heartbeat[13] from the DesignatedCoordinator[3].

  2. DesignatedCoordinator[3] Verifies Announce Message
    At this step, the DesignatedCoordinator[3] will check if the node has established layer2 membership (Ie. is in the list supplied by the ClusterConcensusMembership[6]) and if not, discards the announcement and aborts the join protocol for that node.

  3. DesignatedCoordinator[3] Issues Invitation to Join the CRM Layer (layer3)
    NOTE: Upon election, the DesignatedCoordinator[3] will also send this invitation to all ClusterNodes[5] with layer2 membership.

  4. ClusterNode[2] Receives Membership Offer

  5. ClusterNode[2] Updates its ClusterInformationBase[7] "status" Entry Locally
    The information being updated is primarily from the LocalResourceManager[8] about what resources are currently running under its control. However, ClusterNode[2]'s copy may contain information about nodes that the DesignatedCoordinator[3] is not aware of yet which much be preserved.

  6. ClusterNode[2] Acknowledges the DesignatedCoordinator[3]
    The ClusterNode[2]'s updated status section is attached to the acknowledgement as data.

  7. DesignatedCoordinator[3] Receives the ClusterNode[2]'s Acknowledgement
    The DesignatedCoordinator[3] uses the status section attached to the acknowedgement to update the ClusterInformationBase[7] but does not yet replicate the changes. Updates are not taken "verbatum" but are completed based on the timestamps of entries contained in the update.

  8. DesignatedCoordinator[3] Updates the ClusterNode[2]'s ClusterInformationBase[7] Entry
    The node's state in the ClusterInformationBase[7] is updated to "active" after which it is able to be considered by the PolicyEngine[14] when allocating resources.

  9. DesignatedCoordinator[3] Distributes the Updated ClusterInformationBase[7]

  10. The PolicyEngine[14] is Invoked by the DesignatedCoordinator[3]
    The PolicyEngine[14] is invoked to make sure existing resources are allocated correctly and determine if new ones are able to be started.

  11. The Join Protocol is Complete

DesignatedCoordinator Election

The election algorithm has been adapted from http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR521[15]

Loosely known as the Bully Algorithm, its major points (with some local substitutions) are:

Triggers

  1. The DC Timer
    If any node goes without receiving any DC_Heartbeat[13]s in the interval in which (currently) 4 should have been received, it means that there is no DesignatedCoordinator[3] and an election is required.

  2. Receiving a DC_Heartbeat[13] from another DesignatedCoordinator[3]
    If one DesignatedCoordinator[3] receives a heartbeat from another DesignatedCoordinator[3] (such as after a cluster partition has healed), an election is also triggered.

  3. User Intervention
    Currently there is no facility to manually instigate an election. This may be considered at a later date depending on demand.

Voting

Nodes must satisfy the following criteria in order to be eligible to vote

Votes are sent as broadcast HeartbeatMessages[16] to all other nodes.

Winner Algorithm

The winner of the election in a PerfectWorld[17] is the ClusterNode[2] in the list supplied by the ClusterConcensusMembership[6] with the lowest value of born_on. born_on was chosen as it indicates how long a ClusterNode[2] has been in the cluster for and it is desirable that the DesignatedCoordinator[3] should be placed on a stable ClusterNode[2] as migrating it requires resources and time.

If there are more than one ClusterNode[2] with the same born_on (which is likely), nodes are differentiated by their uuid. uuid was chosen as it is constant to a node and globally unique. The node with the lowest uuid (as determined by strcmp()) is deemed to be the winner.

In a LessThanPerfectWorld[18], the expected winner might be sick or shutting down and thus will not be able to cast their vote and claim the election.

In this case, the winner is the node that, of those that cast their vote, has the lowest born_on and uuid.

Vote Counting

When Node1 counts a vote from Node2, the Winner Algothim is consulted to determine which of the two nodes (the current node or the node casting the vote) would win the election. If Node2 would win, then Node1 has lost the election and returns to a pending state, waiting for contact from the DesignatedCoordinator[3]. However if Node1 would win, then it broadcasts (possibly not for the first time) its own vote. Next, Node1 then checks to see if it is the PerfectWorld[17] election winner. If so, Node1 cancels the election and proceeds to take over the role of DesignatedCoordinator[3]. Otherwise, Node1 must wait for the election expiration timer to go off before concluding it has won the election.

NOTE: At some point the voting process may be augmented to include information sent as part of the voting process (ie. version information or whether the node is currently a DesignatedCoordinator[3]). Should this be the case, then a PerfectWorld[17] election winner cannot be calculated and the winner must always wait for the election expiration timer to go off.

Becoming the DesignatedCoordinator

  1. Start the PolicyEngine[14]

  2. Start the Transitioner[19]

  3. Set Appropriate "Register" Flags
  4. We are now the DesignatedCoordinator[3] Further actions are:

  5. Instigate ClusterNode[2] discovery
    This is achieved by (re)-initiating the join protocol for every node with layer 2 membership. See Join Protocol above for details.

  6. Resource Allocation - Invoking the PolicyEngine[14]
    Once the JoinProtocol[12] has completed for all layer2 nodes, the PolicyEngine[14] is invoked to determine the next stable state of the cluster. We wait for all nodes to join to allow the cluster to stabilize as much as is possible before the expensive process of reallocating resources is initiated. Since not all layer2 nodes may be able to be promoted to layer3, a timeout exists which initiates the resource allocation process.

  7. Reaching the Next Stable State - Invoking the Transitioner

Shutdown

DesignatedCoordinator Only

  1. Stop sending DC_Heartbeats[13]
    This will lead to one of the other nodes triggering an election. Eventually the DC may be able to trigger an election itself. However at the moment it can't do that without voting and if it votes then it is more than likely to win this election for the same reasons it won last time.

  2. Release DC status
  3. Stop the PolicyEngine[14]

  4. Stop the Transitioner[19]

  5. Drop back into the pending state
  6. Wait for a new DesignatedCoordinator[3] to be elected

  7. Complete the join protocol as a slave
  8. Shut down as a slave node (Below)

    NOTE: See Stopping the LastNode for variations on this sequence.

Slave Instances of the ClusterResourceManagerDaemon

  1. Request the DesignatedCoordinator[3] shut us down

  2. Let the DesignatedCoordinator[3] / PolicyEngine[14] / Transitioner[19] do their work

  3. Obey the Transitioner's last command (Ie. shutdown)
  4. Disconnect from the LocalResourceManager[8]

  5. Disconnect from the ClusterConcensusMembership[6]

  6. Disconnect from Heartbeat
  7. Stop the ClusterInformationBase[7] and write it out to disk.

  8. exit(0)

    NOTE: Might require a third expected state "shut_me_down" so that the PE doesnt just STONITH it.

Escalation Process

If any stage of the shutdown process stalls, an escalation process is initiated.
The escalation process is two-phased.

NOTE: Currently only phase 2 is implemented and it is broken as the timers are not working as expected.

Special Corner Cases

Starting the FirstNode

This case is defined as trying to start a ClusterNode[2] when no other ClusterNodes[5] have layer2 membership or better (as determined from consulting the ClusterConcensusMembership[6] list).

Starting the FirstNode[20] will take longer than following nodes for two reasons.

  1. Waiting for ClusterConcensusMembership[6]
    Currently connections to the ClusterConcensusManager[21] will fail (it appears) until it has had a chance to establish a view of the cluster. Ie. timed out all the other nodes. Because of this, it would not be unusual for Heartbeat to respawn the ClusterResourceManagerDaemon a number of times before startup is successful.
    This is not desirable behaviour and is likely to be addressed soon.

  2. Waiting for the DesignatedCoordinator[3]
    The FirstNode[20] currently must wait for the "DesignatedCoordinator[3] timeout" before it realises that it there is no-one in charge. This could be avoided by initiating an election straight away if there is only one node listed in the ClusterConcensusMembership[6] list. However this is currently not a high priority.

Losing an Election

  1. Slave Nodes
    Slave nodes have no tasks to perform except to move to the S_PENDING state and await contact from the DesignatedCoordinator[3].

    1. Previous DesignatedCoordinator[3] Nodes
      Any nodes that, prior to the election were the DesignatedCoordinator[3] but are not anymore have a number of tasks to perform.

    2. Release DesignatedCoordinator[3] Status
      This should be done as soon as possible to avoid having multiple nodes responding DC requests.

    3. Stop the PolicyEngine[14]

    4. Stop the Transitioner
      NOTE: It is unclear to me right now if this completely is safe. It would seem to be since:

      • As a result of step 1, any replies to outstanding requests will only be processed by the new DesignatedCoordinator[3].

      • It is unlikely any further actions should be instigated by an old DesignatedCoordinator[3].

Shutting Down the LastNode

This case is defined as trying to shutdown a ClusterNode[2] when no other ClusterNodes[5] have layer2 membership or better (as determined from consulting the ClusterConcensusMembership[6] list).

If we are the last active node, then there is little point starting the escalation process to and we know ahead of time that there is no-one to take over as DesignatedCoordinator[3]. In this case, once we have reached the S_PENDING state we should continue shutting down by:

  1. Stop local resources
  2. Disconnect from the LocalResourceManager[8]

  3. Disconnect from the ClusterConcensusMembership[6]

  4. Disconnect from Heartbeat
  5. Stop the ClusterInformationBase[7] and write it out to disk

  6. exit(0)

Healing Cluster Partitions

Immediately after two or more cluster partitions heal together there will be more than one DesignatedCoordinator[3]. This will be detected within the period of one DC_Heartbeat[13]. The case where one DesignatedCoordinator[3] receives a DC_Heartbeat[13] from another DesignatedCoordinator[3] is a trigger for an election after which only one ClusterNode[2] will remain the DesignatedCoordinator[3].

See Losing an Election for information on the actions of losing nodes.

InFlight Operations

When a Cluster Enters a new epoch (ClusterNodes[5] joining or leaving, elections, ...) there may be a number of requests in-transit or pending replies.

NOTE: This section is still, more than most sections anyway, a work in progress.

Starting Facts

Inferred Facts

Design

  1. The current epoch is broadcast via DC_Heartbeat[13] messages

  2. Requests sent to the DesignatedCoordinator[3] should be kept for possible resending (because the DesignatedCoordinator[3] may change)

  3. Responses from NodeA with epoch X - 1 will be processed as normal until the first response from NodeA with epoch X is received
  4. Votes from any epoch should be processed.

    Otherwise, it is possible that the old node will crown itself DesignatedCoordinator[3]. Thought: Does this open us up to a possible DoS attack? I.e constantly in an election.

Consequences

The following outlines why these possible replies would not be of interest to a newly elected DesignatedCoordinator[3]:

Questions

  1. Can the the order of messages from NodeA to NodeB where the messages are a mixture of boardcast and directed messages be guarenteed? Its not totally obvious from the Code.

References

[1]http://www.linux-ha.org/ClusterResourceManager
[2]http://www.linux-ha.org/ClusterNode
[3]http://www.linux-ha.org/DesignatedCoordinator
[4]http://www.linux-ha.org/FullyConnected
[5]http://www.linux-ha.org/ClusterNodes
[6]http://www.linux-ha.org/ClusterConcensusMembership
[7]http://www.linux-ha.org/ClusterInformationBase
[8]http://www.linux-ha.org/LocalResourceManager
[9]http://www.linux-ha.org/FiniteStateMachine
[10]http://www.linux-ha.org/ClusterResourceManagerDaemon/FSA
[11]http://www.linux-ha.org/HeartbeatProgram
[12]http://www.linux-ha.org/JoinProtocol
[13]http://www.linux-ha.org/DC_Heartbeat
[14]http://www.linux-ha.org/PolicyEngine
[15]http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR521
[16]http://www.linux-ha.org/HeartbeatMessages
[17]http://www.linux-ha.org/PerfectWorld
[18]http://www.linux-ha.org/LessThanPerfectWorld
[19]http://www.linux-ha.org/Transitioner
[20]http://www.linux-ha.org/FirstNode
[21]http://www.linux-ha.org/ClusterConcensusManager


This information provided courtesy of the Linux-HA project at http://linux-ha.org/