This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

7th - 10th September 2009 Linux-Kongress in Hamburg will have several sessions on Linux-HA - see you there!

18 August 2008 Heartbeat release 2.1.4 is now out Download it

11 October 2007 NEW educational HA/DR Blog hosted by Alan Robertson

9 April 2007 Check out the Cool Heartbeat Screencasts: Installation, Intro to the GUI Part of the Heartbeat Education project

Last site update:
2008-09-07 06:55:55

Introduction and terminology

The regular fail-over style resource instance can only be run once in the cluster. Running them more than once is a fatal error; the cluster resource manager is rather requested to rather not run them at all then risk activating them more than once.

As always, exceptions exist. Some resources can be concurrently active on more than one node. We have decided to term this multiple incarnations of the same resource instance to have a clear distinction what entity we are discussing.

The ClusterResourceManager/MultiStateResources are a special case of this; not only do they support multiple incarnations, but they support / require that those be run in different states; typical examples include all things replicated. The extensions to the multiple incarnations model required for this are discussed over there.

Our prime example for a resource supporting multiple incarnations but not multiple states is the ClusterAlias IP. Other examples might include running several identical Apache's on top of a shared filesystem (NFS, GFS or read-only updates via rsync) or whatever else people may come up with once given this capability.

Additional resource parameters

This requires the following new parameters for affected resources in the CIB:

  • incarnations_max_global shall define the maximum number of concurrent incarnations of the resource instance in the whole cluster. The CRM will start up to this many incarnations in the whole cluster. This number shall default to 1 for normal resources.

  • incarnations_max_node shall define the maximum number of concurrent incarnations on a single node. If this number is greater than 1, the CRM will start several incarnations on the same node to reach the global incarnation count.

Resource Agent defaults

The defaults for the new parameters shall come in from the RA metadata. As they are essentially regular instance parameters, where we need localized text to describe them and their effect on the resource etc, I propose to make them special instance parameters will well-known reserved names and of type integer.

It would be helpful if we extended the content element in the metadata to allow a min,max tuple to be set for integer.

As names, I suggest to use the same as those used in the CIB; if desired, we could prefix them with _ to point out their reserved character.

For RAs which don't set these, they shall default to the regular behaviour of one incarnation maximum.

Q&A time: What if the admin wants to run several incarnations of a resource agent which doesn't support it - for example above mentioned Apache on top of a shared pool. Suggested answer: These options should still be available in an expert menu (with some default explanation), even if the RA metadata does not explicitly advertise them. (In which case they should, being a regular feature of the resource, show up in the regular settings.)

Exporting this information to the Resource Agent

As instance parameters, these maximum numbers from the settings would exported as regular OCF_RESKEY_ fields. (This mechanism is not supported for for resource agent classes which don't take name/value pairs!)

Additionally, we shall provide the Resource Agent with a incarnation_number parameter to tell it which incarnation it is supposed to be operating on.

So if, for example, start 7 out of 9 means something special to it (as it does for the Cluster Alias IP), it can do it's magic. Otherwise, it may just ignore it.

Start, stop etc semantics

One can see at least two different models for handling the RA semantics here.

The first one would be to start the resource instance as normal (as a local master) and then eventually add incarnations to it via some special, additional start-incarnation command, equivalents for monitor, stop et cetera.

The second one would be to suggest that we can start, stop, monitor incarnations directly (via supplying the additional parameters as described), and if the RA needs to take special precautions when starting the first or stopping the last incarnation, this should be handled inside the RA itself.

The second approach extends the semantics of start, stop, monitor slightly, but it greatly eases the complexity in the CRM, and is thus the one to go with. In the LocalResourceManager, these would be handled as completely different objects (the LRM doesn't need to know about incarnations).

There is a slight deficiency in this scheme in discovering, at start-up time and when we need to re-discover what runs where. Essentially, we need to query each node about every single resource instance and all it's incarnations, which may be a performance problem for larger configurations. One could argue that an extended status which operates on the instance and reports back the state of all incarnations in a single go is preferrable, which would also allow us to catch scenarios where the instance is running with different maximum values on different nodes. The performance hit likely doesn't matter much initially though, and the second case can only occur when the admin messes with us really bad; I'd defer this extension for a future version.

Numbering of incarnations

Incarnations are numbered from 1 to N.

Notifying the other incarnations

In some scenarios (discovered during discussing a specific failure scenario of drbd), we need to provide pre- and post-notification to our operations on an incarnation to the other incarnations.

This shall be implemented by an additional notify action in the RA, which will be called at the appropriate times with the following additional parameters to inform it about what happened or is about to happen:

  • target_incarnation the number of the incarnation which this notification is about

  • target_node the id (uname?) of the node on which said incarnation is. LarsMarowskyBree isn't quite sure about this one, but it's cheap to add and might be useful.

  • target_action the action which has been performed or is about to be performed. No notification is delivered for meta-data and notify itself.

  • target_notify_type is set to one of

    • pre implying that we intend to perform that operation for start, stop, recover, promote, demote, reload; we do not deliver pre-monitor events.

    • post implying that the operation has been performed, and the result status of it will be passed to the RA in the target_result parameter.

  • If the other incarnation has failed and the node been fenced, we'll deliver a synthetic pre fence before and a synthetic post fence accordingly.

  • If one incarnation reports a monitor failure, the other incarnations will receive a post monitor notification.

  • Before performing the action itself, we wait until all pre notifications have completed.

  • After performing the action but before performing any other actions which might have depended on it, we wait until all post notifications have completed.

  • As per discussion between Andrew and Lars, this is strictly a notification mechanism. While we wait until all notify actions have completed before performing the action itself, a failed status code here does not 'veto' the action; that way leads to madness.

  • After start, we ought to deliver an initial set of post notifications to the incarnation about the state of its peers as far as we know it.

  • Do we need a way for the RA to sign-up explicitly? Right now I'd deliver them all as soon as we see a notify operation in the meta-data, which will do.

  • LarsMarowskyBree can see an fast resource agent class coming up here which actually links in via some faster mechanism instead of always exec'ing, but this should be good enough for round one.

Handling of dependencies

Dependencies between resources which support multiple incarnations are of course important. Note that here we are not yet discussing several states, so we are looking at plain dependencies.

Besides the first sane case, there's several special cases to be discussed; these follow largely from the suggested internal handling of the incarnations inside the CRM.

Internal handling inside the CRM policy engine

Just as already hinted at above for the LRM, internally the CRM will expand a resource instance capable of running several incarnations into several atomic objects (each one representing a single incarnation).

For coherency, it may be sane to expand resources which don't support multiple incarnations to a single incarnation object too, this may avoid special code paths (except that we may wish to filter the special RA parameters in this case as not to confuse the unsuspecting RA, but they are supposed to ignore unknown parameters anyway...).

LarsMarowskyBree points out for the future, just to be annoying, that this is a neat place for plugins - ie the processing of complex resources into simple atomic objects.

Inter-dependencies between symmetric resources

If the administrator configures dependencies between resource instances which both have the same maximum number of incarnations, we can internally expand them to that number of N atomic dependencies, ie from incarnation 1 of res foo to incarnation 1 of res bar etc. This case is fairly straight-forward.

Explict inter-dependencies between incarnations

If the adminstrator configures explicit dependencies between incarnation 2 of res foo and incarnation 3 of res bar (or even to res baz, if that is a regular resource instance), again our handling of this is simple enough and straight forward.

Inter-dependencies between asymmetric resources

We go through the resource tree top-down, and expand the dependencies in the same fashion. If we encounter a resource which supports more incarnations than the one it depends on, this essentially means we get dangling dependencies which cannot be satisfied, which we should warn about in the logfiles. Thus, those excess incarnations would not be started.

In the other scenario, where a resource depends on a resource which supports more incarnations then itself, this doesn't lead to dangling dependencies (as the lower-level incarnations can be started without harm still, and all dependencies for us are satisfied), and thus isn't a problem.

Intra-dependencies between incarnations of the same resource

This shall not be supported as I can't figure it out sanely and also don't see a need.

By default, the CRM shall try to balance the number of incarnations on each node, ie spread them out as evenly as possible within the constraints of the incarnations_max_node and the number of available nodes.

Error recovery

Losing a node

If we lose a node and now have less nodes available than incarnations_max_global, and incarnations_max_node do not allow us to start several incarnations in parallel to fill up, those incarnations will be flagged as failed.

While at initial startup (except if explicit dependencies force other behaviour) we are likely to have a continuous numbering scheme from 1 through whatever number of nodes we have available, note that may result in discontinuous incarnation numbering after run-time errors.

Split-brain et al

After a resolved split-brain scenario, sophisticated adminstrator intervention or, although completely far-fetched, a software bug, we may discover that we are running more incarnations of a given resource instance than we should; either running a resource incarnation twice somewhere, or running excess incarnations because the admin has interferred either directly or changed the maximum no of incarnations in one half of the cluster.

Three options come to mind:

  • Do nothing (just complain and flag the resources as failed in the CIB) and let the administrator resolve the mess. While simple for us, probably not the way to go in a truely automated environment...
  • Just stop the excess incarnations and continue as if nothing has happened.
  • Stop all incarnations and restart from there.

This likely ought to be another configuration parameter, with the default policy also being specified by the RA. incarnations_error_recovery_policy=(manual|auto|complete)?

Complexity analysis

Actually, this seems fairly straight forward. (From the CRM/LRM point of view, because all the complexity disappears after we have done the expansion step.)

Main complexity is in the GUI and obviously some RAs which want to take full advantage of it.

Example how this would work out for the Cluster Alias IP

...

Open Issues

  • None! This is the end-all, complete and correct answer! No weasel words! ;)