This is the summary of the discussion on linux-ha-dev.
The goal is to make fencing as much as a regular resource as that is possible. This goal is reached except for a special action which the STONITH resources need to support; see below.
STONITH requests are always initiated from the DesignatedCoordinator.
The STONITH controllers are configured in the resources section of the CIB as resources of the class stonith. All normal constraints for resource placement et cetera apply.
For sanity, a stonith-class resource may not require node fencing itself.
The STONITH device is controlled via a StonithAgent, which is a special resource agent running under the control of the LocalResourceManager; see LocalResourceManager/FencingOperations.
As the STONITH controller is a regular resource internally, just of a special class, the regular node placement rules apply. This limits access to the STONITH device to the nodes which actually can do so - this will likely either be a single node for serial STONITH device or a wildcard for most network power switches.
As all requests are made through this single node, we also avoid the limitation that some network power switches only allow a single session to connect to them.
As explained on the StonithAgents page, we learn which nodes a given STONITH device can control on start time.
As the STONITH controller, through which all further requests to a given STONITH device are gated, is a regular resource, it will also be subject to monitoring, and thus we can find out immediately (and not only at the time where we want to use it) that a STONITH device has become unuseable and can inform the administrator and re-allocate the STONITH controller somewhere else.
Whether or not a STONITH dependency is needed in the TransitionGraph is of course decided by the PolicyEngine via the resource parameters.
For regular resources, whether or not they need node-granularity fencing is controlled via the mandatory node_fencing="(yes|no)" attribute in the CIB.
The default for this attribute should be set by the GUI/administrator from the Resource Agent metadata for OCF agents (available in the CIB in the lrm_agent section), and for safety default to yes for heartbeat or lsb agents.
We need to compute the maximum set of eligible nodes for a given resource - assuming that all nodes where up right now and no other resources were running - and contrast this with the list of nodes which actually are up and healthy. Everything else needs killing.
Another scenario where a node may be STONITHed is a failed stop operation. Before we can recover the resource on another node, we must clean up by force.
Whether a failed stop operation causes the node the resource is running on to be STONITHed shall be controlled by a failstop_type=(ignore|block|stonith) attribute of either the resource or a resource depending on it.
ignore should only be used for self-fencing resources; the default must be either stonith or block for all others. As for the node_fencing attribute, the default should be retrieved from the resource agent metadata.
LarsMarowskyBree still wonders what happens if a lower priority resource has stonith set, fails to stop, but a higher priority resource (not depending on the first) is happily running along on that node; if we follow the wish of the lower prio resource, we affect the service level of the higher priority resource...
Yet another scenario is that a STONITH induced reboot of a failed node may cure a intermittent fault of the node and thus reduce the MeanTimeToRepair and the time we spent in a partially degraded mode. Even if no resource might actively require the node to be shot, it may still be desireable because of this.
Whether or not a potentially failed node is shot because of this shall be controlled by a global always_stonith_failed_nodes flag; whether or not a given resource has to actually wait until this succeeded is controlled via the other parameters discussed above.
If STONITH for a given node fails, we of course retry indefinetely, but in the meantime we block all resources which depend on this.
A manual override needs to be possible; the admin needs to be able to manually confirm that a given node (or set of nodes) is really down, so that the cluster can proceed.