Linux-HA Logo

Cluster Resource Manager (CRM)

The ClusterResourceManager consists of the following components:

The ClusterResourceManager uses IPC to send messages to its subsystems and HeartbeatMessages[15] for communication with the ClusterResourceManagerDaemon[6] or DesignatedCoordinator[8] on other ClusterNodes[4].

See also

ClusterResourceManager/Setup[16], ClusterResourceManager/BugReports[17], ClusterResourceManager/Related[18], ClusterInformationBase/UserGuide[19]


Cluster Resource Manager

#pragma section-numbers on

DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT! DRAFT!

Title: Design of a Cluster Resource Manager
Revision: $Id: crm.txt,v 1.2 2003/12/01 14:10:14 lars Exp $
Author: Lars Marowsky-Brée <lmb@suse.de>
Acknowledgements:
  Andrew Beekhof <andrew@beekhof.net>
  Luis Claudio R. Goncalves <lclaudio@conectiva.com.br>
  Fábio Olivé Leite <olive@conectiva.com.br>
  Alan Robertson <alanr@unix.sh>


Global TODO for this document

In this section I keep track of tasks which I still want to perform on this document; probably of little interest to anyone else, and this section should be gone in the final version ;-)


Abstract


Introduction

Requirements overview

The CRM needs to provide the following functionality:

Scenario description

Basic algorithm

  1. Every partition elects a "Designated Coordinator"; this node will activate special logic to coordinate the recovery and administrative actions on all nodes in the cluster.
    • a1) The DC has the full state of the cluster available (or is able to retrieve it), as well as an uptodate copy of the administrative policies, information about fenced nodes etc; this shall further be referred to as "Cluster Information Base".
    b) Whenever a cluster event occurs, be it an adminstrative request, a node failure by membership services or a resource failure reported by a participating LRM, it is forwarded to the "Designated Coordinator".
    • b1) For administrative requests, the DC arbitates whether or not they will be accepted into the CIB and serializes these updates. ie, policy changes which cannot be satisfied or would lead to an inconsistent state of the cluster will be rejected (unless explicitly overridden).
    c) It then computes via the Policy Engine:
    • c1) the new CIB c2) The Transition Graph, an dependency-ordered graph of the actions necessary to go from the current cluster state as close as possible to the cluster state described by the CIB.
    d) Leading the transition to the target state:
    • d1) Replicating the new CIB to all clients. d2) Executing each step of the transition graph in dependency order. (Potentially parallelized.)

    e) Exception handling if any event or failure occurs:

    • e1) The algorithm is aborted cleanly; pending operations are allowed to finish, but no new commands are issued to clients (in particular during phase d2) e2) The algorithm is invoked again from scratch. (It is obvious that there is room for optimization here by only recomputing smaller parts of the dependency tree or not broadcasting the full CIB every time, in particular if the DC has not been re-elected. However, these complicate the implementation and are not necessary for the first phase.)

Feature analysis

Stability of the algorithm

Components

Local Resource Manager

Note: This section only documents the requirements on the LRM from the point of view of the cluster-wide resource management. For a more detailled explanation, see the documentation by lclaudio.

This component knows which resources the node currently holds, their status (running, running/failed, stopping, stopped, stopped/failed, etc) and can provide this information to the CRM. It can start, restart and stop resources on demand. It can provide the CRM with a list of supported resource types.

It will initiate any recovery action which is applicable and limitted to the local node and escalate all other events to the CRM.

Any correct node in the cluster is running this service. Any node which fails to run this service will be evicted and fenced.

It has access to the CIB for the resource parameters when starting a resource.

NOTE: Because it might be necessary for the success of the monitoring operation that it is invoked with the same instance parameters as the resource was started with, it needs to keep a copy of that data, because the CIB might change at runtime.

Cluster Resource Manager

The CRM coordinates all non-local interactions in the cluster. It interacts with:

Only one node is running the "Designated Coordinator" CRM at any given time in any given partition. All other nodes forward their input to this node only, and will relay its decisions to the local LRM.

The coordinator is a "primus inter pares"; in theory, any CRM can act in this fashion, but the arbitation algorithm will distinguish a designated node.

How to deal with failure of the designated CRM

Election algorithm for the DC

Consistency audits

Communication between the CRM and LRM

Executing the Transition Graph

Cluster Information Base

The CIB is also running on every node in the cluster. In essence, it provides a distributed database with weak transactional semantics, exploiting the fact that all updates are serialized by the DC and that each node itself knows its own latest status.

Contents of the CIB

Process of generating an uptodate CIB

How are updates to the CIB handled?

Policy Engine

Functionality provided

Required constraints

Future extensions

Thoughts about algorithm implementation

...

Design considerations

Whether to deal with resource groups or "only" resources and dependencies?

Why a transition graph

Who orders resource operations on a single node

Executioner

Integration

Fencing algorithm

Un-fencing

After a successful fencing

Rejoining node while fencing requests are still pending

Rejoining node for which fencing has ultimately failed in the past

Interaction with quorum

Summary: CRM does not need quorum. However, the CRM could easily compute "quorum" as just another resource.

"Quorum" is in fact not necessary for this design. It is implicit in the policy engine / CRM which will only bring a resources for which all dependencies - including fencing - have been satisfied.

This in fact is quorum with slightly finer granularity. It allows the cluster to proceed in a scenario like:

Of course, as soon as a global resource spawning {a,b,c,d} is added, this in fact translates to "global quorum".

This makes me think that if global quorum is in fact required it can be best expressed in this design by mapping it to such a global ('configured on all nodes') resource and communicating to the partition that it has quorum if it was able to recover this resource (or failed to recover it, that is).

However, it also allows for "sub-quorum"; ie, given the example of an application requiring quorum to operate, it will usually only be interested in quorum of the nodes eligible for the related resources. So quorum could potentially be different if reported to different clients...

Issues wrt quorum

Integration with other projects

Integration with heartbeat

Integration with CCM

Integration with Group Services

Integration with non-heartbeat clusters

Integration with cluster-aware applications

Relation to OCF

Monitoring

Integration with health monitoring

Monitoring the CRM


Attention: Here be dragons. Anything following these lines are unordered
thoughts which haven't yet been incorporated into the grand scheme of things.

X. ...


References

[1]http://www.linux-ha.org/ClusterInformationBase
[2]http://www.linux-ha.org/StateInformation
[3]http://www.linux-ha.org/resource
[4]http://www.linux-ha.org/ClusterNodes
[5]http://www.linux-ha.org/Constraints
[6]http://www.linux-ha.org/ClusterResourceManagerDaemon
[7]http://www.linux-ha.org/ClusterNode
[8]http://www.linux-ha.org/DesignatedCoordinator
[9]http://www.linux-ha.org/FullyConnected
[10]http://www.linux-ha.org/ConsensusClusterMembership
[11]http://www.linux-ha.org/PolicyEngine
[12]http://www.linux-ha.org/NextState
[13]http://www.linux-ha.org/Transitioner
[14]http://www.linux-ha.org/LocalResourceManager
[15]http://www.linux-ha.org/HeartbeatMessages
[16]http://www.linux-ha.org/ClusterResourceManager/Setup
[17]http://www.linux-ha.org/ClusterResourceManager/BugReports
[18]http://www.linux-ha.org/ClusterResourceManager/Related
[19]http://www.linux-ha.org/ClusterInformationBase/UserGuide
[20]http://www.linux-ha.org/FailSafe
[21]http://www.cs.washington.edu/research/constraints/cassowary/
[22]http://www.linux-ha.org/BasicArchitecture#PILS


This information provided courtesy of the Linux-HA project at http://linux-ha.org/