SBD Fencing

SBD expands to storage-based death, and is named in reference to Novell's Cluster Services, which used SBD to exchange poison pill messages.

The sbd daemon, combined with the external/sbd STONITH agent, provides a way to enable STONITH and fencing in clusters without external power switches, but with shared storage.

The sbd daemon runs on all nodes in the cluster, monitoring the shared storage. When it either loses access to the majority of sbd devices, or sees that another node has written a fencing request to its mailbox slot, the node will immediately fence itself.

sbd can be used in virtual environments where the hypervisor layer is not cluster-enabled, but a shared storage device between the guests is available; for other scenarios, please see DomUClusters.

While this mechanism requires minimal cooperation from the node to be fenced, the code has proven very stable over the course of several years.

Requirements

 * You must have shared storage.
 * You can use one, two, or three devices.
 * This can be connected via Fibre Channel, Fibre Channel over Eterhnet, or even iSCSI.
 * Thus, an iSCSI target can become a sort-of network-based quorum server; the advantage is that it does not require a smart host at your third location, just block storage.
 * You must dedicate a small partition of each as the SBD device.
 * The SBD devices must not make use of host-based RAID.
 * The SBD devices must not reside on a DRBD instance.
 * Why? Because DRBD is not shared, but replicated storage. If your cluster communication breaks down, and you finally need to actually use stonith, chances are that the DRBD replication link broke down as well, whatever you write to your local instance of DRBD cannot reach the peer, and the peer's sbd daemon has no way to know about that poison pill it is supposed to commit suicide uppon.
 * The SBD device may of course be an iSCSI LU, which in turn may be exported from a DRBD based iSCSI target cluster.
 * A SBD device can be shared between different clusters, as long as no more than 255 nodes share the device.

How many devices should I use?
SBD supports one, two, or three devices. This affects the operation of SBD as follows:

One device
In its most simple implementation, you use one device only. (Older versions of SBD did not support more.) This is appropriate for clusters where all your data is on the same shared storage (with internal redundancy) anyway; the SBD device does not introduce an additional single point of failure then.

Three devices
In this most reliable configuration, SBD will only commit suicide if more than one device is lost; hence, this configuration is resilient against one device outages (be it due to failures or maintenance). Fencing messages can be successfully relayed if at least two devices remain up.

This configuration is appropriate for more complex scenarios where storage is not confined to a single array.


 * Host-based mirroring solutions could have one SBD per mirror leg (not mirrored itself), and an additional tie-breaker on iSCSI.

Two devices
This configuration is a trade-off, primarily aimed at environments where host-based mirroring is used, but no third storage device is available.

SBD will not commit suicide if it loses access to one mirror leg; this allows the cluster to continue to function even in the face of one outage.

However, SBD will not fence the other side while only one mirror leg is available, since it does not have enough knowledge to detect an asymmetric split of the storage. So it will not be able to automatically tolerate a second failure while one of the storage arrays is down. (Though you can use the appropriate crm command to acknowledge the fence manually.)

Initialize the sbd device(s)
All these steps must be performed as root.

Decide which block device(s) will serve as the SBD device(s). This can be a logical unit, partition, or a logical volume; but it must be accessible from all nodes. Substitute the full path to this device wherever /dev/sbd is referenced below.


 * If you have selected more than one device, provide them by specifying the -d options multiple times, as in: sbd -d /dev/sda -d /dev/sdb -d /dev/sdc ...

After having made very sure that these are indeed the devices you want to use, and do not hold any data you need - as the sbd command will overwrite it without further requests for confirmation -, initialize the sbd devices:

# sbd -d /dev/sbd create # sbd -d /dev/sbd3 -d /dev/sdc2 -d /dev/disk/by-id/foo-part1 create

This will write a header to the device, and create slots for up to 255 nodes sharing this device. You can look at what was written to the device using:

# sbd -d /dev/sbd dump Header version    : 2 Number of slots   : 255 Sector size       : 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop)    : 1 Timeout (msgwait) : 10

As you can see, the timeouts are also stored in the header, to ensure that all participating nodes agree on them.

Setup the software watchdog
It is most strongly suggested that you set up your Linux system to use a watchdog. Use the watchdog driver, which fits best to your hardware, e. g. hpwdt for HP server. A list of available watchdogs can be found in /usr/src/linux/drivers/watchdog/

If no watchdog matches to your hardware then use softdog.

You can do this by adding the line modprobe softdog to /etc/init.d/boot.local

Start the sbd daemon
The sbd daemon is a critical piece of the cluster stack. It must always be running when the cluster stack is up, or even when it has crashed.

The heartbeat/openais init script starts and stops SBD if configured; add the following to /etc/sysconfig/sbd:

SBD_DEVICE="/dev/sbd" SBD_OPTS="-W"

-W enables the watchdog support, which you are most strongly suggested to do. If you need to specify multiple devices here, use a semicolon to separate them (their order does not matter):

SBD_DEVICE="/dev/sbd;/dev/sde;/dev/sdc"

If the SBD device is not accessible, the daemon will fail to start and prevent the cluster stack from coming up, too.

Testing the sbd daemon
sbd -d /dev/sbd list

Will dump the node slots, and their current messages, from the sbd device. You should see all cluster nodes being listed there; most likely with a message clear.

You can now try sending a test message to one of the nodes:

sbd -d /dev/sbd message nodea test

The node will acknowledge the receipt of the message in the system logs: Aug 29 14:10:00 nodea sbd: [13412]: info: Received command test from nodeb

Messages are considered to have been delivered successfully if they have been sent to more than half of the configured devices.

Configure the fencing resource
All that is required is to add a STONITH resource of type external/sbd to the CIB. Newer versions of the agent will automatically source the devices from the host's /etc/sysconfig/sbd. If this does not match your configuration, or if you are running an older version of the agent, set the sbd_device instance attribute accordingly.

Sample configuration for crm configure:

primitive stonith_sbd stonith:external/sbd

Or, if you need to specify the device name:

primitive stonith_sbd stonith:external/sbd \ params sbd_device="/dev/sbd"

The sbd agent does not need to and should not be cloned. If all of your nodes run SBD, as is most likely, not even a monitor action provides a real benefit, since the daemon would suicide the node if there was a problem.

SBD also supports turning the reset request into a crash request, which may be helpful for debugging if you have kernel crashdumping configured; then, every fence request will cause the node to dump core. You can enable this via the crashdump="true" setting on the fencing resource. This is not recommended for on-going production use, but for debugging phases.

Multipathing
If your single sbd device resides on a multipath group, you may need to adjust the timeouts sbd uses, as MPIO's path down detection can cause delays. (If you have multiple devices, transient timeouts of a single device will not negatively affect SBD. However, if they all go through the same FC switches, you will still need to do this.)

After the msgwait timeout, the message is assumed to have been delivered to the node. For multipath, this should be the time required for MPIO to detect a path failure and switch to the next path. You may have to test this in your environment.

The node will perform suicide if it has not updated the watchdog timer fast enough; the watchdog timeout must be shorter than the msgwait timeout - half the value is a good rule of thumb.

You would set these values by adding -4 msgwait -1 watchdogtimeout to the create command:

/usr/sbin/sbd -d /dev/sbd -4 20 -1 10 create

(All timeouts are in seconds.)

Note: This can incur significant delays to fail-over, unfortunately.

Recovering from temporary device outages
If you have multiple devices, failure of a single device is not immediately fatal. SBD will retry ten times in succession to reattach to the device, and then pause (as to not flood the system) for an hour before retrying. Thus, SBD should automatically recover from temporary outages.

Should you wish to try reattach to the device right now, you can send a SIGUSR1 to the SBD parent daemon.

The timeout can be tuned via the -t 3600 option. Setting the timeout to 0 will disable automatic restarts.

Limitations

 * Again, the sbd device must not use host-based RAID.
 * sbd is currently limited to 255 nodes per partition. If you have a need to configure larger clusters, create multiple sbd partitions, split the watch daemons across them, and configure one external/sbd STONITH resource per sbd device.

Misc
Slot allocation is automatic; when a daemon is started in watch mode, it will allocate one slot for itself if needed and then monitor it for incoming requests.

Similarly, no hostlist has to be provided to external/sbd; it retrieves this list automatically from the sbd device.

To overcome the MPIO delays, sbd should handle several paths internally, submitting the requests to all paths concurrently. However, this is not quite as trivial, as IO ordering is not guaranteed.