This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

21 December 2007 Heartbeat release 2.1.3 is now out Download it and install it!

11 October 2007 NEW educational HA/DR Blog hosted by Alan Robertson

9 April 2007 Check out the Cool Heartbeat Screencasts: Installation, Intro to the GUI Part of the Heartbeat Education project

Last site update:
2008-07-25 12:25:52

Cluster Resource Management (CRM) Daemon

Overview

The ClusterResourceManagerDaemon co-ordinates the actions of all other ClusterResourceManager subsystems on a given ClusterNode. One of these is elected as the DesignatedCoordinator based on a list of FullyConnected ClusterNodes from the ClusterConcensusMembership.

Regular instances of the ClusterResourceManagerDaemon are slaves to the DesignatedCoordinator and effectively end up as glorified routers. Passing messages to and from the DesignatedCoordinator for sub-systems such as the ClusterInformationBase and the LocalResourceManager.

The actions of the ClusterResourceManagerDaemon are controlled by a FiniteStateMachine which is described here.

Startup

1. Local Startup

  • If any startup actions fail, the ClusterResourceManagerDaemon will terminate as it considers itself unhealthy. Therefore the HeartbeatProgram and the LocalResourceManager must all have been started prior to the ClusterResourceManagerDaemon. In particular the ClusterConcensusMembership daemon must not only have been started but also needs to have established itself in the cluster.

    NOTE: For more information, see Starting the FirstNode.

  • Start DC Timer
    The DC Timer tracks how long it has been since this node heard from the DesignatedCoordinator and is used to trigger elections (see below).

  • Wait for an event
    Events include being asked to shutdown, an election being triggered (locally or by another node), or receiving communication from the DesignatedCoordinator.

    At this point the node is in a pending state. Eligible to participate in the resource layer of the cluster but not yet with permission to do so. Such permission can only be granted by the DesignatedCoordinator and is done so as part of the JoinProtocol outlined below.

Join Protocol

  1. ClusterNode Announces Presence to the DesignatedCoordinator
    This action is initiated upon receiving a DC_Heartbeat from the DesignatedCoordinator.

  2. DesignatedCoordinator Verifies Announce Message
    At this step, the DesignatedCoordinator will check if the node has established layer2 membership (Ie. is in the list supplied by the ClusterConcensusMembership) and if not, discards the announcement and aborts the join protocol for that node.

  3. DesignatedCoordinator Issues Invitation to Join the CRM Layer (layer3)
    NOTE: Upon election, the DesignatedCoordinator will also send this invitation to all ClusterNodes with layer2 membership.

  4. ClusterNode Receives Membership Offer

  5. ClusterNode Updates its ClusterInformationBase "status" Entry Locally
    The information being updated is primarily from the LocalResourceManager about what resources are currently running under its control. However, ClusterNode's copy may contain information about nodes that the DesignatedCoordinator is not aware of yet which much be preserved.

  6. ClusterNode Acknowledges the DesignatedCoordinator
    The ClusterNode's updated status section is attached to the acknowledgement as data.

  7. DesignatedCoordinator Receives the ClusterNode's Acknowledgement
    The DesignatedCoordinator uses the status section attached to the acknowedgement to update the ClusterInformationBase but does not yet replicate the changes. Updates are not taken "verbatum" but are completed based on the timestamps of entries contained in the update.

  8. DesignatedCoordinator Updates the ClusterNode's ClusterInformationBase Entry
    The node's state in the ClusterInformationBase is updated to "active" after which it is able to be considered by the PolicyEngine when allocating resources.

  9. DesignatedCoordinator Distributes the Updated ClusterInformationBase

  10. The PolicyEngine is Invoked by the DesignatedCoordinator
    The PolicyEngine is invoked to make sure existing resources are allocated correctly and determine if new ones are able to be started.

  11. The Join Protocol is Complete

DesignatedCoordinator Election

The election algorithm has been adapted from http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR521

Loosely known as the Bully Algorithm, its major points (with some local substitutions) are:

  • Election is initiated by any node (N) which notices that the coordinator is no longer responding
  • Concurrent multiple elections are possible
  • Algorithm
    • N sends ELECTION messages to all nodes that occur earlier in the CCM's membership list.
    • If no one responds, N wins and becomes coordinator
    • N sends out COORDINATOR messages to all other nodes in the partition
    • If one of higher-ups answers, it takes over. N is done.

Triggers

  1. The DC Timer
    If any node goes without receiving any DC_Heartbeats in the interval in which (currently) 4 should have been received, it means that there is no DesignatedCoordinator and an election is required.

  2. Receiving a DC_Heartbeat from another DesignatedCoordinator
    If one DesignatedCoordinator receives a heartbeat from another DesignatedCoordinator (such as after a cluster partition has healed), an election is also triggered.

  3. User Intervention
    Currently there is no facility to manually instigate an election. This may be considered at a later date depending on demand.

Voting

Nodes must satisfy the following criteria in order to be eligible to vote

  • healthy (ie. not in a recovery state)
  • not shutting down
  • successfully completed the local startup actions

Votes are sent as broadcast HeartbeatMessages to all other nodes.

Winner Algorithm

The winner of the election in a PerfectWorld is the ClusterNode in the list supplied by the ClusterConcensusMembership with the lowest value of born_on. born_on was chosen as it indicates how long a ClusterNode has been in the cluster for and it is desirable that the DesignatedCoordinator should be placed on a stable ClusterNode as migrating it requires resources and time.

If there are more than one ClusterNode with the same born_on (which is likely), nodes are differentiated by their uuid. uuid was chosen as it is constant to a node and globally unique. The node with the lowest uuid (as determined by strcmp()) is deemed to be the winner.

In a LessThanPerfectWorld, the expected winner might be sick or shutting down and thus will not be able to cast their vote and claim the election.

In this case, the winner is the node that, of those that cast their vote, has the lowest born_on and uuid.

Vote Counting

When Node1 counts a vote from Node2, the Winner Algothim is consulted to determine which of the two nodes (the current node or the node casting the vote) would win the election. If Node2 would win, then Node1 has lost the election and returns to a pending state, waiting for contact from the DesignatedCoordinator. However if Node1 would win, then it broadcasts (possibly not for the first time) its own vote. Next, Node1 then checks to see if it is the PerfectWorld election winner. If so, Node1 cancels the election and proceeds to take over the role of DesignatedCoordinator. Otherwise, Node1 must wait for the election expiration timer to go off before concluding it has won the election.

NOTE: At some point the voting process may be augmented to include information sent as part of the voting process (ie. version information or whether the node is currently a DesignatedCoordinator). Should this be the case, then a PerfectWorld election winner cannot be calculated and the winner must always wait for the election expiration timer to go off.

Becoming the DesignatedCoordinator

  1. Start the PolicyEngine

  2. Start the Transitioner

  3. Set Appropriate "Register" Flags
  4. We are now the DesignatedCoordinator Further actions are:

  5. Instigate ClusterNode discovery
    This is achieved by (re)-initiating the join protocol for every node with layer 2 membership. See Join Protocol above for details.

  6. Resource Allocation - Invoking the PolicyEngine
    Once the JoinProtocol has completed for all layer2 nodes, the PolicyEngine is invoked to determine the next stable state of the cluster. We wait for all nodes to join to allow the cluster to stabilize as much as is possible before the expensive process of reallocating resources is initiated. Since not all layer2 nodes may be able to be promoted to layer3, a timeout exists which initiates the resource allocation process.

  7. Reaching the Next Stable State - Invoking the Transitioner

Shutdown

DesignatedCoordinator Only

  1. Stop sending DC_Heartbeats
    This will lead to one of the other nodes triggering an election. Eventually the DC may be able to trigger an election itself. However at the moment it can't do that without voting and if it votes then it is more than likely to win this election for the same reasons it won last time.

  2. Release DC status
  3. Stop the PolicyEngine

  4. Stop the Transitioner

  5. Drop back into the pending state
  6. Wait for a new DesignatedCoordinator to be elected

  7. Complete the join protocol as a slave
  8. Shut down as a slave node (Below)

    NOTE: See Stopping the LastNode for variations on this sequence.

Slave Instances of the ClusterResourceManagerDaemon

  1. Request the DesignatedCoordinator shut us down

  2. Let the DesignatedCoordinator / PolicyEngine / Transitioner do their work

  3. Obey the Transitioner's last command (Ie. shutdown)
  4. Disconnect from the LocalResourceManager

  5. Disconnect from the ClusterConcensusMembership

  6. Disconnect from Heartbeat
  7. Stop the ClusterInformationBase and write it out to disk.

  8. exit(0)

    NOTE: Might require a third expected state "shut_me_down" so that the PE doesnt just STONITH it.

Escalation Process

If any stage of the shutdown process stalls, an escalation process is initiated.
The escalation process is two-phased.

  • Phase 1 is to re-issue the request to the DesignatedCoordinator.

  • Phase 2 is to shutdown local resources and exit anyway.

NOTE: Currently only phase 2 is implemented and it is broken as the timers are not working as expected.

Special Corner Cases

Starting the FirstNode

This case is defined as trying to start a ClusterNode when no other ClusterNodes have layer2 membership or better (as determined from consulting the ClusterConcensusMembership list).

Starting the FirstNode will take longer than following nodes for two reasons.

  1. Waiting for ClusterConcensusMembership
    Currently connections to the ClusterConcensusManager will fail (it appears) until it has had a chance to establish a view of the cluster. Ie. timed out all the other nodes. Because of this, it would not be unusual for Heartbeat to respawn the ClusterResourceManagerDaemon a number of times before startup is successful.
    This is not desirable behaviour and is likely to be addressed soon.

  2. Waiting for the DesignatedCoordinator
    The FirstNode currently must wait for the "DesignatedCoordinator timeout" before it realises that it there is no-one in charge. This could be avoided by initiating an election straight away if there is only one node listed in the ClusterConcensusMembership list. However this is currently not a high priority.

Losing an Election

  1. Slave Nodes
    Slave nodes have no tasks to perform except to move to the S_PENDING state and await contact from the DesignatedCoordinator.

    1. Previous DesignatedCoordinator Nodes
      Any nodes that, prior to the election were the DesignatedCoordinator but are not anymore have a number of tasks to perform.

    2. Release DesignatedCoordinator Status
      This should be done as soon as possible to avoid having multiple nodes responding DC requests.

    3. Stop the PolicyEngine

    4. Stop the Transitioner
      NOTE: It is unclear to me right now if this completely is safe. It would seem to be since:

Shutting Down the LastNode

This case is defined as trying to shutdown a ClusterNode when no other ClusterNodes have layer2 membership or better (as determined from consulting the ClusterConcensusMembership list).

If we are the last active node, then there is little point starting the escalation process to and we know ahead of time that there is no-one to take over as DesignatedCoordinator. In this case, once we have reached the S_PENDING state we should continue shutting down by:

  1. Stop local resources
  2. Disconnect from the LocalResourceManager

  3. Disconnect from the ClusterConcensusMembership

  4. Disconnect from Heartbeat
  5. Stop the ClusterInformationBase and write it out to disk

  6. exit(0)

Healing Cluster Partitions

Immediately after two or more cluster partitions heal together there will be more than one DesignatedCoordinator. This will be detected within the period of one DC_Heartbeat. The case where one DesignatedCoordinator receives a DC_Heartbeat from another DesignatedCoordinator is a trigger for an election after which only one ClusterNode will remain the DesignatedCoordinator.

See Losing an Election for information on the actions of losing nodes.

InFlight Operations

When a Cluster Enters a new epoch (ClusterNodes joining or leaving, elections, ...) there may be a number of requests in-transit or pending replies.

NOTE: This section is still, more than most sections anyway, a work in progress.

Starting Facts

  • The order of messages from NodeA to NodeB can be guarenteed
  • The order of broadcast messages from NodeA can be guarenteed
  • The order of messages from NodeA to NodeB where the messages are a mixture of boardcast and directed messages can be guarenteed (See questions below)
  • The order of messages from NodeA to NodeB and from NodeB to NodeA can NOT be guarenteed

  • The internal operations in the ClusterResourceManager are "push" rather than "pull"

  • The DesignatedCoordinator in epoch X may not be the DesignatedCoordinator in epoch X + 1

Inferred Facts

  • All requests for epoch X from the DesignatedCoordinator will be received before a new epoch is declared

  • Responses to requests made (to or from the DesignatedCoordinator) during epoch X may be received during epoch X + 1

Design

  1. The current epoch is broadcast via DC_Heartbeat messages

  2. Requests sent to the DesignatedCoordinator should be kept for possible resending (because the DesignatedCoordinator may change)

  3. Responses from NodeA with epoch X - 1 will be processed as normal until the first response from NodeA with epoch X is received
  4. Votes from any epoch should be processed.

    Otherwise, it is possible that the old node will crown itself DesignatedCoordinator. Thought: Does this open us up to a possible DoS attack? I.e constantly in an election.

Consequences

  • Requests that are not from the "current" epoch should be discarded (with the expection of "vote" as mentioned above).
  • "dc_beat" requests from older epochs should also be discarded (just stating it explicitly)
  • The DesignatedCoordinator must keep track of which epoch its slave nodes are up to

  • Replies from newer epochs should be discarded by the DesignatedCoordinator

  • Replies from newer epochs from the DesignatedCoordinator should be processed as normal by slave nodes

  • After an election of a new DesignatedCoordinator (as apposed to a re-election of a DesignatedCoordinator), NodeA's "current" epoch should be 0. Thought: Does this open us up to a possible replay attack?

The following outlines why these possible replies would not be of interest to a newly elected DesignatedCoordinator:

  • lrm_op
    These can safely be ignored as any update to resources will be discovered during the JoinProtocol when the slave node updates it's status section.

  • shutdown
    These can safely be ignored as the node's expected state would already have been updated to "down" and we will receive notice of its successful completion from the CCM. QUESTION: Should/Will there be a "shutdown" that would only shutdown the ClusterResourceManagerDaemon and NOT the HeartbeatProgram as well? If so, perhaps these answers should not be ignored.

  • update,replace,delete
    We're about to overwrite 90% of their copy of the ClusterInformationBase and any problems in the status section will become evident during the JoinProtocol

  • query
    If we are interested in finding something out, we will ask.

Questions

  1. Can the the order of messages from NodeA to NodeB where the messages are a mixture of boardcast and directed messages be guarenteed? Its not totally obvious from the Code.