Sfex (resource agent)

Basic Consept

 * SFEX is resource which control ownership of shared disk.
 * SFEX uses an special partition on the shared disk and maintains the following data.
 * "status" shows whether a disk is owned by somebody.
 * "node" shows the node name to own the disk.
 * "count" is used for judgment that owner node is up or down.
 * Typically, resources which use data partition on the shared disk(like PostgreSQL) make resource group with SFEX.
 * Resource of node which is holding ownership can access data partition.
 * When node can get ownership?
 * case1: Nobody has ownership.
 * case2: Node can judge another node is down.

Start Up Process
SFEX can start on the node which has the highest score in cib.xml because more than one nodes do not access shared disk at the same time. Node A


 * 1) SFEX reads data from shared disk, and get "status". Usually "status" is "NO_OWNED" because nobody has owned shared disk.
 * 2) Writes data that include node=Node A and status=OWNED.
 * 3) Reads data again, and get "node=Node A".
 * 4) Compareses it with my node name. If node name has not been changed, Node A get ownership!!
 * 5) SFEX increments "count" on the shared disk by monitor processing of heartbeat. This processing means the update of ownership.

Heartbeat Communication Failure
Node A
 * 1) SFEX updates ownership by HB monitor processing.

Node B
 * 1) When heartbeat communication fail, standby node (Node B) starts resources.
 * 2) SFEX reads data on the sheard disk.
 * 3) Waits a while. Wait time should be longer than sfex monitor interval. By this wait time, it waits for periodical update from Node A and confirms that Node A maintains ownership.
 * 4) Reads data again.
 * 5) Checks value of new "count". When the values of two "count" are different, it is able to think that Node A is up.
 * 6) SFEX starts up process is stopped.

Active Node Failure
Node A
 * 1) Node A is downed by failure.

Node B This Node B start up in the same way as HB communication failure.
 * 1) Waits for a while. It waits for periodical update from Node A but confirms that Node A does not it.
 * 2) SFEX reads data again.
 * 3) Checks value of new "count". The values of two "count" are SAME, it is able to think that Node A is DOWN.
 * 4) Writes data that include node=Node B and status=OWNED.
 * 5) Reads data again.
 * 6) Compareses it with my node name. If node name has not been changed, Node B get ownership!!
 * 7) Afterwards, other resources start.

Disk Access At The Same Time
This is hardly generated. However for example, this case occurs when multiple nodes start up at the same time without heartbeat communication.

Node A / Node B Writing to shared disk is serialized finally because writable area is "one". As a result, the node name written at the last time remains. In this example, Node B remains.
 * 1) Read data again
 * 2) Node A: value of "owner" is changed. this node does not get ownership. Node B: value of "owner" is name of Node B. Node B get ownnership!

How To Initialize a SFEX Device
Create a empty partition. SFEX needs about 1Kb per node.

fdisk /dev/sdX

The parameter “-n “ allows to put multiple shared locks on one disk.

sfex_init -n 1 /dev/sdX1 sfex_stat /dev/sdX1

The following is an example configuration for a sfex resource using the "crm configure" shell. With index=1 the first slot will be used.

primitive sfex_1 ocf:heartbeat:sfex \ params device="/dev/sdX1" index="1" collision_timeout="1" \ lock_timeout="70" monitor_interval="10" \ op monitor interval="10" timeout="30" on_fail="fence" \ op start interval="0" timeout="120" \ op stop interval="0" timeout="30"