
This is a rough sketch of the algorithm to be used. It is transferred from highly sophisticated tools and rigourous mathematical descriptions (aka drawings on handkerchiefs) to this mediocre format and thus will need a few rounds to settle down.
To recall, the purpose of this algorithm is to take the current cluster state, the configuration, and to determine the next steps.
The algorithm is basically divided into three phases. The first phase can be thought of as normalization of the input data. For the division between the second and third phase, keep in mind that we are solving two aspects: the location of the resources, and the ordering in which these actions are performed. By dividing of this into two distinct phases, the complexity of each is greatly reduced.
In phase one, we read in all the information from the ClusterInformationBase[1] XML description provided to us and convert them into our internal data structures:
However, during the building of this internal data structure, we also translate certain complex configuration directives into RISC-style primitives. This is actually a good place for PolicyPlugins[2] to be inserted!
Examples of such translations might include, but is not fully covered yet:
the splitting of a multi-node resource into several primitive one-node instances, which are then easier to place. This likely also includes proper handling of their dependencies etc.
Or the insertion of implicit dependencies into the graph, lets say for STONITH operations.
the translation of the one-way negated dependency Not on the same node as into two one-way dependency (bidirectional), to make sure that regardless of which node in the graph we evaluate first, we arrive at the same conclusions.
Maybe some of these also happen later, who knows
In phase two, we determine the location of the resources.
We traverse the graph according to the resource placement priority (so that a top-level resource with a weight of 100 will get it's wishes satisfied before a less important resource), following the placement dependencies, ignoring the ordering ones.
As AlanRobertson[3] points out, this ordering needs to be stable, both for production as for sanity during testing, so in case of priority being ambigious (or none set at all), we arbitate on a static value, such as the resource uuid, name or whatever - so that the algorithm yields repeatable results.
This is a graph colouring problem. We follow the locational dependencies from top to bottom, setting all yet-uncoloured resources to our current colour, and narrowing down the set of nodes associated with the colour at the same time.
We also merge the weights of the candidate nodes, with the goal of running on the node preferred by most resources. This step needs elaboration, but factors include the weights in the different candidate sets, if a resource is already active on a given node, of course the resource priority and possibly other configuration directives and constraints. This merge may also be another place for PolicyPlugins[2] to go in, but this is not yet clear.
If in the last step we have found the sets of eligible node to be disjunct, whether or not this propagates upwards as an error is the must versus should distinction in the strength of the constraint. An unsatisfied must constraint will propagate upwards and essentially cause all higher-levels to be failed too.
At the end of this phase, all resources will have been assigned a colour, and each colour will in fact correspond to a node on which to run the resources of this colour.
However, if the current cluster state does not allow a certain subtree of the resources to be allocated, it may also be an empty set. (If all nodes are going offline, or if a high priority resource pushes off a low priority one after a failure.)
In phase three, we finally determine the ordering in which the operations have to be performed. Thus, everytime I say dependency in the following, I'm referring to the ordering, not to the location.
This time, we do not traverse the graph according to resource priority, but least number of dependencies. (Or should we use the resource priority once more, using the number of references as a tie-breaker only?)
We may find resources which cannot be started due to some reason or which are permanently failed; whether or not this propagates upwards as an error is the must versus should distinction in the strength of the constraint. An unsatisfied must constraint will propagate upwards and essentially cause all higher-levels to be failed too.
We are actively building the TransitionGraph[4] in this step - we stop resources which are running where they should not be running, and start them on nodes where they should be (if any). We insert STONITH operations into the tree as needed.
We write out the TransitionGraph[4] and hand it back to the CRM.
Please, do discuss these issues with me by e-mail instead of the Wiki ;)
A ResourceInstance[5] in state stop failed is not relocatable to another node in phase 3. If a resource with a keep me running no matter what depends on it (directly or indirectly), the node in question needs to be fenced (after having tried to stop all other resources on it); we may need to insert a take node offline with force constraint in phase 1 for this.
TODO: This keep me running no matter what is not yet modelled in the constraints!
The above algorithm certainly is flexible enough to satisfy quite a few requests, and is likely more easily understood, which is vital for a first release. Actually, I think the Phase 1 mapping of complex configuration directives into primitive ones is pretty similar to the translation which would be done by PolicyPlugins[2] translating them into additional constraints for the linear equation system. I hope to revisit this problem when we find more time.
TransitionGraph[4], ClusterResourceManager[6], NodeFencing[7], ClusterInformationBase[1].
| [1] | http://www.linux-ha.org/ClusterInformationBase |
| [2] | http://www.linux-ha.org/PolicyPlugins |
| [3] | http://www.linux-ha.org/AlanRobertson |
| [4] | http://www.linux-ha.org/TransitionGraph |
| [5] | http://www.linux-ha.org/ResourceInstance |
| [6] | http://www.linux-ha.org/ClusterResourceManager |
| [7] | http://www.linux-ha.org/NodeFencing |
This information provided courtesy of the Linux-HA project at http://linux-ha.org/