It appears it's a good time to take a step back and outline the requirements which we need the fencing system to be addressed and make all the implicit assumptions more explicit.
The process which was suggested by AlanRobertson was to first list them all (regardless of their importance), make sure we understand them and they are all sane, and then assign priorities to them, and then build an implementation from there.
In this first step, please list all requirements you have for the fencing functionality. Everybody is invited to participate, and add questions / clarificiations where necessary.
So far, the list is completely unordered, so please don't take offense by that.
Rationale: The fencing subsystem must work in clusters where the CRM is not present.
Rationale: There should be one single configuration frontend to the user to ease the complexity. Thus the fencing subsystem should draw its configuration from the same source as the CRM (if present). The configuration must support localized help texts etc just like the other parts of the system.
This will allow reuse of the moderately complex GUI and CIB infrastructure.
Rationale: If the CRM is not around, the fencing subsystem needs to be able to draw it's configuration from some other means (simple text files).
Rationale: If the lower-level layers (membership & fencing) have already performed fencing of a failed node, the CRM needs to be able to be told and use this information correctly.
Or, to put it differently, the CRM should be capable of registering fencing needs with an external fencing authority.
Rationale: As the CRM has been assumed to be, if present, the sole enforcer of cluster (recovery) policy, lower-levels have to register with the CRM if they want a node to be fenced.
LarsMarowskyBree: Yes, this is a conflicting requirement to the previous one, and maybe the two should instead be both abolished and replaced with the requirements hidden behind them and then decide on whether we need to go top-down or bottom-up?
Rationale: As outlined on the NodeFencing page, STONITH style fencing may occur in response to non-node failures; for example a failed resource stop to recover a high priority resource. In this case, all other resources (be it CRM controlled ones or GFS mounts) need to be informed that we are about to perform such a recovery operation (and migrate other resources away cleanly first).
Rationale: This goes beyond fencing, but it sure seems to be related, but in case of a regular shutdown of the node, all resources and subsystems need to be disabled in an orderly and dependency-coherent fashion, so that one by one, each of them can release the node so that it doesn't have to be fenced.
In RHAT's GFS, this seems to be done by their Service Manager, and in the heartbeat-TNG world, by the CRM.
Rationale: In scenarios where other functionality is available, it would be desireable to be able to use resource level fencing - ie, for GFS, while mounting from an iSCSI server (which can block access to the fenced node w/o power cycling the node).
Rationale: The fencing subsystem must be capable of monitoring the liveness and reachability of the fencing device and taking appropriate action and notification.
Rationale: Some network power switches or serial power switches by their very nature are only reachable from one node at a time. Thus all access to them, be it for configuration inquiries, monitoring or fencing operations needs to be coordinated to go via a single node from the list of those which can reach the device, or be otherwise serialized to avoid contention and spurious monitoring failures.
This could probably also be phrased as managing the STONITH topology.
Rationale: As there is bound to be some cluster policy manager around, it seems sensible to re-use as much functionality there as possible to reduce the complexity of the fencing subsystem and speed up the implementation work.
Components which seemed to lend themselves to re-use where the managing of resource topology, monitoring functionality and configuration.