This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

18 August 2008 Heartbeat release 2.1.4 is now out Download it and install it!

11 October 2007 NEW educational HA/DR Blog hosted by Alan Robertson

9 April 2007 Check out the Cool Heartbeat Screencasts: Installation, Intro to the GUI Part of the Heartbeat Education project

Last site update:
2008-08-28 08:17:29

The algorithm

This is a rough sketch of the algorithm to be used. It is transferred from highly sophisticated tools and rigourous mathematical descriptions (aka drawings on handkerchiefs) to this mediocre format and thus will need a few rounds to settle down.

To recall, the purpose of this algorithm is to take the current cluster state, the configuration, and to determine the next steps.

The algorithm is basically divided into three phases. The first phase can be thought of as normalization of the input data. For the division between the second and third phase, keep in mind that we are solving two aspects: the location of the resources, and the ordering in which these actions are performed. By dividing of this into two distinct phases, the complexity of each is greatly reduced.

Phase 1

In phase one, we read in all the information from the ClusterInformationBase XML description provided to us and convert them into our internal data structures:

  • Resources are converted into a GHashTable by resource id.
  • Nodes are converted into a GHashTable by node id.
  • Constraints are ... guess what.
  • etc

However, during the building of this internal data structure, we also translate certain complex configuration directives into RISC-style primitives. This is actually a good place for PolicyPlugins to be inserted!

Examples of such translations might include, but is not fully covered yet:

  • the splitting of a multi-node resource into several primitive one-node instances, which are then easier to place. This likely also includes proper handling of their dependencies etc.

  • Or the insertion of implicit dependencies into the graph, lets say for STONITH operations.

  • Or taking out a cluster node from the list of candidate nodes for running resources because it is requesting to go down.
  • the translation of the one-way negated dependency Not on the same node as into two one-way dependency (bidirectional), to make sure that regardless of which node in the graph we evaluate first, we arrive at the same conclusions.

  • etc.

Maybe some of these also happen later, who knows ;-)

Phase 2

In phase two, we determine the location of the resources.

We traverse the graph according to the resource placement priority (so that a top-level resource with a weight of 100 will get it's wishes satisfied before a less important resource), following the placement dependencies, ignoring the ordering ones.

  • As AlanRobertson points out, this ordering needs to be stable, both for production as for sanity during testing, so in case of priority being ambigious (or none set at all), we arbitate on a static value, such as the resource uuid, name or whatever - so that the algorithm yields repeatable results.

This is a graph colouring problem. We follow the locational dependencies from top to bottom, setting all yet-uncoloured resources to our current colour, and narrowing down the set of nodes associated with the colour at the same time.

  • If a resource is uncoloured when we reach it, we take the union of the candidate nodes for our current colour and the resource (we we need to determine a list of nodes acceptable for all resources of this colour). For negative co-location constraints, we obviously do the opposite.
  • We also merge the weights of the candidate nodes, with the goal of running on the node preferred by most resources. This step needs elaboration, but factors include the weights in the different candidate sets, if a resource is already active on a given node, of course the resource priority and possibly other configuration directives and constraints. This merge may also be another place for PolicyPlugins to go in, but this is not yet clear.

  • If a resource is already coloured when we reach it (ie, a higher priority resource than the one from which we have currently started depends on it), we only update our set of nodes accordingly, but do not recolour the resource. Then, we go back up.
  • If in the last step we have found the sets of eligible node to be disjunct, whether or not this propagates upwards as an error is the must versus should distinction in the strength of the constraint. An unsatisfied must constraint will propagate upwards and essentially cause all higher-levels to be failed too.

At the end of this phase, all resources will have been assigned a colour, and each colour will in fact correspond to a node on which to run the resources of this colour.

However, if the current cluster state does not allow a certain subtree of the resources to be allocated, it may also be an empty set. (If all nodes are going offline, or if a high priority resource pushes off a low priority one after a failure.)

Phase 3

In phase three, we finally determine the ordering in which the operations have to be performed. Thus, everytime I say dependency in the following, I'm referring to the ordering, not to the location.

This time, we do not traverse the graph according to resource priority, but least number of dependencies. (Or should we use the resource priority once more, using the number of references as a tie-breaker only?)

  • We may find resources which cannot be started due to some reason or which are permanently failed; whether or not this propagates upwards as an error is the must versus should distinction in the strength of the constraint. An unsatisfied must constraint will propagate upwards and essentially cause all higher-levels to be failed too.

We are actively building the TransitionGraph in this step - we stop resources which are running where they should not be running, and start them on nodes where they should be (if any). We insert STONITH operations into the tree as needed.

Done

We write out the TransitionGraph and hand it back to the CRM.

OpenIssues

Please, do discuss these issues with me by e-mail instead of the Wiki ;)

  • A ResourceInstance in state stop failed is not relocatable to another node in phase 3. If a resource with a keep me running no matter what depends on it (directly or indirectly), the node in question needs to be fenced (after having tried to stop all other resources on it); we may need to insert a take node offline with force constraint in phase 1 for this.

    • TODO: This keep me running no matter what is not yet modelled in the constraints!

  • It is certainlybe possible to translate the above into a linear programming problem, and apply a Simplex algorithm to find a solution. However, it is not yet clear to me how that might be done. (Some of the constraints may turn out to be not satisfyable in the current configuration, and expressing ordering and parallelization in such systems is not straightforward too.) Linear programming is a very general approach but also heuristic approach, but for certain scenarios - such as flow in a network et cetera - more efficient algorithms exist. It is not clear whether an advantage is gained from employing a linear programming approach in this scenario, but it is clearly an interesting research problem for a graduate student.
    • The above algorithm certainly is flexible enough to satisfy quite a few requests, and is likely more easily understood, which is vital for a first release. Actually, I think the Phase 1 mapping of complex configuration directives into primitive ones is pretty similar to the translation which would be done by PolicyPlugins translating them into additional constraints for the linear equation system. I hope to revisit this problem when we find more time.

See also

TransitionGraph, ClusterResourceManager, NodeFencing, ClusterInformationBase.