(I'm braindumping, structural cleanup will follow a bit later.)
Resources which can be in multiple states are of special interest to a Cluster Resource Manager, as they model important applications like replicated databases, drbd, the SAP CI Enqueue Server, failover firewalls et cetera very well. In particular, we consider resources which can be in two states, master/slave, primary/secondary, rw-master/ro-shadow et cetera.
These resources are an extension to ClusterResourceManager/MultipleIncarnationResources, and it is assumed that the reader is familiar with that page and concepts.
As an extension to the multiple incarnations, where a resource instance can be run on several nodes, we now support running them on different nodes in one of two states.
active vs passive
active vs shadow (This is the choice of words I'll use for the rest of this text.)
master vs slave
primary vs secondary
As a restriction, we do not support resources which are in more than two states. (Mostly because I simply couldn't find any meaningful scenarios of this so far or resources which support this.)
We have a limit (see below) on the number of promoted resources in active state. From this it follows that the others are shadows.
incarnations_active_global_max - maximum number of active incarnations in the cluster ( <= incarnations_global_max )
incarnations_active_node_max - maximum number of active incarnations on a given node ( <= incarnations_node_max )
On start, we assume that a started incarnation comes up in shadow mode first, and that we later promote it to active, and then can demote to shadow again.
(It makes no sense for a resource to come up in active mode directly, as that could too easily violate the number of maximally active incarnations until it would be demoted; none of the resources looked at does.)
Usual idempotency rules apply: demote on a shadow shall have no effect, promote on an active shall have no effect etc.
promote may fail; an inconsistent shadow cannot become active,
demote must always succeed, like the stop operation. If necessary, all open connections/users etc must be kicked off, this is a forced request by the CRM...
This is two new operations to the OCF RA API!
For the notify action discussed for multiple incarnation resources, we also provide target_state to inform the incarnation about the current state of the resource incarnation (prior to the operation).
Whether or not a resource incarnation can be promoted to active we need to verify first, for which we need an extended status operation once more; it should tell us whether the resource is currently stopped, started/shadow, started/active, failed/something etc.
This status should tell us also
Whether it already is active,
or whether it could become active,
and a preference for it to become active. This preference should be an arbitary integer (and when we arbitate which one to promote, we should pick the highest numbered one). (Assume for example the drbd case; with drbd 0.7 and the decoupling of Sync direction from the primary/secondary status, we could make either one active, but during sync, it's clearly preferrable to be running on the SyncMaster if possible.)
Note: We are reporting back fairly extensive status messages per resource instance/incarnation. Maybe status should rovide XML feedback on stdout so we can more readily extend this in the future?
This is just a snapshot taken at an arbitary amount of time serving as a hint. The state may change transiently under us. A resource which couldn't become active a second ago may decide it can ten seconds later, or because of some internal failure a resource which said it could go active will reject the promote command. The only way to really find out is to try the real operation.
More complexity: For some resources (like drbd) the preference may change if both sides are up in secondary mode and connected - because then it knows about the state of the other side only. One answer is to first start all resource incarnations we can (in shadow mode), then invoke some magic wait_until_connected or whatever command, and then re-inquire about the status.
This doesn't any longer fit the retrieve cluster status and compute a single transition graph, but instead chunks the transitioning into several iterations, each time requiring a run of the PolicyEngine... This seems to be perfectly fine, just something to consider.
We introduce a special flag to depend on the state of a resource in the rsc_to_rsc constraint, so the admin can easily depend on, for example, drbd being in active mode.
I suggest we expand these automatically according to the multi-incarnation rules too, ie one dependency / internal object for each active and shadow dependency.
Again, this could be a place for a plug-in architecture which takes care of the expansion, but here with the added complexity of the demotion/promotion, so maybe ignore that for now ;)
Superset of error recovery for multiple incarnations
Again, we have several options:
Demote excess resource incarnations (the ones with the lowest preference?)
Open: How to handle stacked multistate resources in this error case
How do we handle internal split-brain scenarios - the inability of one or more master or slave to talk to the other(s) -, but us (by virtue of our redundant cluster communication media) still being able to talk to them all? How is that error reported to us, and what's the recovery strategy? Do we simply stop and invalidate/disconnect the slaves (or lower priority incarnations) and continue running in degraded mode etc?
Some discussion between lge, AndrewBeekhof and LarsMarowskyBree suggests that this might be handled partially by the post notify after a monitor failure. The other part is provided by No news is good news, ie as long as we don't deliver that monitor or stop notification, it has to assume that the other side is healthy and up, and thus it is experiencing a split-brain if it can't talk to it itself. This would allow the instance/incarnations to detect a split-brain and then respond to it in a resource-specific manner.
(One example of how this might be handled in the resource agent is in the proof-of-concept code for the drbd agent.)
demote and promote are fairly straightforward, the complexity is in the evaluation of which incarnations can/cannot become active etc.
Do we really need to make the names of the two states configurable from the metadata? I could see the point for the GUI (where this would automatically follow from the shortdesc, longdesc anyway), but internally, it would be quite cool if we could always use neutral one naming consistently.
AlanRobertson asked for named states, but, is that wanted (at this time), or is that maybe overkill? Extensibility is good, but more than two states will require sufficiently more complexity (ie, the transition matrix and other nasty details), that maybe we can get away with ignoring that for now and just have two states, which is really what we care about? Early abstraction and all that... (Alan doesn't remember this, so he must not have been too excited about it)
In particular with clone 0 and 1, then if clone 0 goes away, I do not want clone 1 renumbered as clone 0. This comes up because each clone might have its own storage that it is managing for replication. The only way the resource can know which storage it is supposed to use would be for it to correlate the clone number with the storage it owns.
A more concrete example:
Imagine FC storage replicating from one FC device to another, but both devices are actually accessible from every node in the cluster. So, if a node fails, another node which takes over its clone number must then be able to tell which storage they are to "own" by using the clone number. As an example, maybe the parameters are clone0_storage=/dev/sda1 and clone1_storage=/dev/sdb1 so if you came up as clone 1, you should use clone1 storage, and if you came up as clone 0, you should use clone0 storage. And, if clone0 goes away, you do not want to renumber clone1 as clone0, because it would imply a change in storage affinity. Using clone numbers is much better than using node names, because this would allow the replication to continue on and be up to date even if both the master and the slave had to run on the same node for some period of time. This seems like a very nice thing to be able to do - but it won't work
without some persistent affinity between something and "which copy am I?". Right now, all I know of which comes close to doing this is the clone number. For a picture, it might look like this:
site1 site2
NODEA------+ +----- NODEC
+------+
NODEB------+ +----- NODED
FC storage 1------ FC storage 2
Now, if all nodes can (in theory) access both FC boxes, you would normally want to run clone0 on NODEA or NODEB (i.e., site1), and clone1 on NODEC or NODED (i.e., site2). But, if the FC is accessible, it would be permissible to run any of them from anywhere. If the replication software allowed it, it would even be possible to run them both from any single remaining node (of course, quorum is a separate issue).