A score is a preference to run a resource on a node. It is an attribute of a node-resource tuple.
The script "showscores.sh" (that ships with pacemaker >=0.6.0 (see the contrib directory in the source tree)) will display the current scores for you. Note that this is not in realtime and that it should only be seen as a single-point-in-time-display.
You can get a recent version of the script here:
http://www.gossamer-threads.com/lists/engine?do=post_attachment;postatt_id=2763;list=linuxha
Pacemaker Repository Version, shipping with the current/next Pacemaker release
How these scores are calculated is a frequent question on the Mailing Lists. I'll try to explain the basics.
If a node's score changes and becomes higher than the current highest score or a new node enters the cluster and get a higher score for the resource, the resource might be migrated (if it is a single independent resource, it will be migrated).
If a node's score changes and becomes lower than the current highest score, the resource might be migrated (if it is a single independent resource, it will be migrated).
resource_stickiness should to be >= 0 (negative values are allowed, but do not make any sense)
resource_failure_stickiness should to be <= 0 (positive values are allowed, but do not make any sense)
<nvpair id="default-resource-stickiness" name="default-resource-stickiness" value="100"/> <nvpair id="default-resource-failure-stickiness" name="default-resource-failure-stickiness" value="-100"/>
<meta_attributes id="ma-webserver"> <attributes> <nvpair name="resource_stickiness" id="ma-1" value="100"/> <nvpair name="resource_failure_stickiness" id="ma-2" value="-100"/> </attributes> </meta_attributes>
If a resource fails on a node, the value of resource_failure_stickiness times the number of failures will be added to the nodes score. Therefore, as described in the FAQ, resource_failure_stickiness should be a negative value.
Heartbeat < 2.1.4 and pacemaker < 0.6.2: If the cluster believes a resource is running, but the monitor operation reports "not running" (return $OCF_NOT_RUNNING), resource_stickiness will no longer be applied. So without applied resource_stickiness, the decision will be made where to re-start the resource.
By contrast, if the cluster believes a resource is running and the monitor operation returns one of the error codes, then the stickiness will still apply.
Before reading about scores in groups, make sure you understand ClusterInformationBase/ResourceGroups.
If you look at the scores for resources within a colocated group (colocated is the default for groups, meaning all resources shall run on the same node), you will find only the first resource of the group to have the score you configured. All subsequent resources will have a score of INFINITY for the node the first resource was started on and -INFINITY for any other node(s).
When you use groups or other things with mandatory colocation constraints, it gets more complicated. Everything you force to be colocated on the same machine will be treated just as though they were in a group together. So, keep that in mind in the explanation below.
The resource-stickiness of a group is the sum of all resource's resource_stickiness values within that group.
<group id="group-webserver">
<primitive id="htdocs-mount" ...
<meta_attributes id="ma-htdocs-mount">
<attributes>
<nvpair id="ma-htdocs-mount-1" name="resource_stickiness" value="200"/>
</attributes>
</meta_attributes>
...
</primitve>
<primitive id="webserver-ip" ...
<meta_attributes id="ma-webserver-ip">
<attributes>
<nvpair id="ma-webserver-ip-1" name="resource_stickiness" value="300"/>
</attributes>
</meta_attributes>
...
</primitve>
<primitive id="webserver" ...
<meta_attributes id="ma-webserver">
<attributes>
<nvpair id="ma-webserver-1" name="resource_stickiness" value="400"/>
</attributes>
</meta_attributes>
...
</primitve>
</group>
I don't think this example makes much sense, at least I cannot think of a "why" for this right now. All I want here is to show you how it works. So in the example above we have resource_stickiness values of 200, 300 and 400, which sums up to the group's resource_stickiness of 900.
In most cases, you will not set different resource_stickiness values for each resource in a group. So the statement above can be simplified to: The resource_stickiness of a group is the group's resource_stickiness times the number of resources.
<group id="group-webserver"> <meta_attributes id="ma-group-webserver"> <attributes> <nvpair id="ma-group-webserver-1" name="resource_stickiness" value="100"/> </attributes> </meta_attributes> <primitive ... <primitive ... <primitive ... </group>
In this example, you have 3 primitives and a resource_stickiness of 100. So the entire group's resource_stickiness will be 100 * 3 = 300.
In a lot of cases, maybe not most, you will not even set a resource_stickiness value for the group, but use default-resource-stickiness instead. So the resource_stickiness of the group will be default-resource-stickiness times the number of resources within that group.
<crm_config> <cluster_property_set id="cib-bootstrap-options"> <attributes> ... <nvpair name="default-resource-stickiness" value="100" id="default-resource-stickiness"/> ... </attributes> </cluster_property_set> </crm_config> <resources> <group id="group-webserver"> <primitive ... <primitive ... <primitive ... </group> </resources>
This will give the same result as the previous example. 100 * 3 = 300.
The resource-failure-stickiness value for a group is computed in a similar way.
A master/slave resource is basically a clone (see v2/Concepts/Clones and v2/Concepts/MultiState).
<master_slave id="ms-drbd1" ... ... <primitive id="drbd1" ... ... </master_slave>
Note the master_slave has a different id than the encapsulated primitve. As the primitive is cloned, you will have multiple instances of it running with names like "drbd1:0" and "drbd1:1". Keep that in mind as there's a difference in wether you set a score for the master_slave id rather than for the primitive id or the primitive id's clone instances. We'll get to that later.
Master_slave resources do not only have a score like every other resource, they have an additional score for the master state. Let's refer to that as the "master score". The node with the highest master score gets to run the resource in master state. Every other instance will default to slave state (crm_mon sometimes shows "started" instead of "slave", but you should not worry about that).
By default, this master score is 0, just like any other score. If you do not set it, the cluster will choose where to promote the resource to master state.
You can set a master score like this:
<rsc_location id="loc:ms-drbd1_likes_ace" rsc="ms-drbd1">
<rule id="rule:ms-drbd1_likes_ace" role="master" score="100">
<expression attribute="#uname" operation="eq" value="ace"/>
</rule>
</rsc_location>
With this constraint, the master_slave resource "ms-drbd1" will be promoted on node "ace" after startup.
Now that we know what a master score is, we can get to the difference about setting a score for the master_slave id, the primitive id and the primitive id's clone instances:
Setting a score for the master_slave id, that value will increase both the master score and the "normal" score for the clone instances
<rsc_location id="rscloc-ms-drbd1" rsc="ms-drbd1"> <rule id="rscloc-ms-drbd1-rule1" score="100"> <expression id="rscloc-ms-drbd1-rule1-expr" attribute="#uname" operation="eq" value="ace"/> </rule> </rsc_location>
This is the preferred way of setting a score for a master_slave resource.
If you set a score for the primitive id, this has no effect whatsoever
<rsc_location id="rscloc-drbd1" rsc="drbd1"> <rule id="rscloc-drbd1-rule1" score="100"> <expression id="rscloc-drbd1-rule1-expr" attribute="#uname" operation="eq" value="ace"/> </rule> </rsc_location>
It might read weird that this does not affect any score, but if you think about it - there is no resource with that name in the cluster. The actual resources the cluster manages are "primitive_id:0" and "primitive_id:1".
If you set a score for the primitive id's clone instance (like drbd1:0, drbd1:1), that will only increase the "normal" score
<rsc_location id="rscloc-drbd1:0" rsc="drbd1:0"> <rule id="rscloc-drbd1:0-rule1" score="100"> <expression id="rscloc-drbd1:0-rule1-expr" attribute="#uname" operation="eq" value="ace"/> </rule> </rsc_location> <rsc_location id="rscloc-drbd1:1" rsc="drbd1:1"> <rule id="rscloc-drbd1:1-rule1" score="100"> <expression id="rscloc-drbd1:0-rule1-expr" attribute="#uname" operation="eq" value="ace"/> </rule> </rsc_location>
A master_slave resource as such does not receive any resource_stickiness bonus after having started successfully. I don't know why that is, but it is (at least as of today (heartbeat 2.1.3 + pacemaker 0.6.2)).
Things change when you colocate a resource to the master state, which is a very common setup:
<rsc_order id="ms-drbd0_before_fs0" from="fs0" action="start" to="ms-drbd0" to_action="promote"/> <rsc_colocation id="fs0_on_ms-drbd0" to="ms-drbd0" to_role="master" from="fs0" score="infinity"/>
Now, both the "normal" score and the master score will be increased by the resource_stickiness value of the colocated resource "fs0". If that colocated resource is a group, remember how resource_stickiness works in groups!
The same applies for resource_failure_stickiness. If the colocated resource fails, the master score and the "normal" score of the master_slave resource will be affected by resource_failure_stickiness. This way, you can have a resource fail over together with the master state.
Say we have the following setup:
<constraints>
<rsc_location id="rscloc-webserver" rsc="webserver">
<rule id="rscloc-webserver-rule-1" score="200">
<expression id="rscloc-webserver-expr-1" attribute="#uname" operation="eq" value="ace"/>
</rule>
<rule id="rscloc-webserver-rule-2" score="150">
<expression id="rscloc-webserver-expr-2" attribute="#uname" operation="eq" value="king"/>
</rule>
<rule id="rscloc-webserver-rule-3" score="-100">
<expression id="rscloc-webserver-expr-3" attribute="#uname" operation="eq" value="queen"/>
</rule>
</rsc_location>
</constraints>
So, without any stickiness values assigned, the scores will look like:
Node Resource Score ace webserver 200 king webserver 150 queen webserver -100
This will make the webserver start on ace.
Now we add values for default-resource-stickiness and default-resource-failure-stickiness. Notice the "-" in the nvpair names!
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<attributes>
<nvpair name="default-resource-stickiness" id="default-resource-stickiness" value="100"/>
<nvpair name="default-resource-failure-stickiness" id="default-resource-failure-stickiness" value="-100"/>
</attributes>
</cluster_property_set>
</crm_config>
We could also add per-resource values. Notice the "_" in the nvpair names!
<resources>
<primitive class="ocf" provider="heartbeat" type="apache" id="webserver">
<meta_attributes id="ma-webserver">
<attributes>
<nvpair name="target_role" id="ma-webserver-1" value="started"/>
<nvpair name="resource_stickiness" id="ma-webserver-2" value="100"/>
<nvpair name="resource_failure_stickiness" id="ma-webserver-3" value="-100"/>
</attributes>
</meta_attributes>
<instance_attributes id="ia-webserver">
<attributes>
<nvpair id="ia-webserver-1" name="configfile" value="/usr/local/apache/conf/httpd.conf"/>
<nvpair id="ia-webserver-2" name="httpd" value="/usr/local/apache/bin/httpd"/>
<nvpair id="ia-webserver-3" name="port" value="80"/>
</attributes>
</instance_attributes>
<operations>
<op id="op-webserver-1" name="monitor" interval="30s" timeout="30s" start_delay="30s"/>
</operations>
</primitive>
</resources>
So assuming the webserver started successfully on ace, the scores will be:
ace = (constraint-score) + (resource_stickiness) + (failcount * (resource_failure_stickiness) )
ace = 200 + 100 + (0 * (-100) )
Node Resource Score ace webserver 300 king webserver 150 (unchanged) queen webserver -100 (unchanged)
Now, lets say the webservers monitor operation reports an error (read: error - this is different to "not started"). Then the new scores will be
ace = (constraint-score) + (resource_stickiness) + (failcount * (resource_failure_stickiness) )
ace = 200 + 100 + (1 * (-100) )
Node Resource Score ace webserver 200 king webserver 150 (unchanged) queen webserver -100 (unchanged)
After the reported error, the webserver will be stopped on ace, then started on ace again. The resource is restarted on ace, as this node still has the highest score.
If it fails again (failcount = 2), you will end up with
ace = constraint-score + (resource_stickiness) + (failcount * (resource_failure_stickiness) )
ace = 200 + 100 + (2 * (-100) )
Node Resource Score ace webserver 100 king webserver 150 (unchanged) queen webserver -100 (unchanged)
At this point, the resource will failover to node king, as it has the highest score (even with resource_stickiness applied). After a successful start on king, the scores will look like this:
ace = (constraint-score) + (failcount * (resource_failure-stickiness) )
ace = 200 + (2 * (-100) )
king = (constraint-score) + (resource_stickiness) + (failcount * (resource_failure_stickiness) )
king = 150 + 100 + (0 * (-100) )
Node Resource Score ace webserver 0 king webserver 250 queen webserver -100 (unchanged)
The following scenario is only different from scenario 1 if you are using heartbeat < 2.1.4 or pacemaker < 0.6.3.
Lets start over. Nothing has failed, the webserver has just started successfully on ace.
Now say the monitor operation reports "not running".
Then we end up with:
ace = (constraint-score) + (failcount * (resource_failure_stickiness) )
ace = 200 + (1 * (-100) )
Node Resource Score ace webserver 100 king webserver 150 (unchanged) queen webserver -100 (unchanged)
So the webserver will failover to node king, as king has the highest score. If it starts successfully on king, then scores will look like
king = (constraint-score) + (resource_stickiness) + (failcount * (resource_failure_stickiness) )
king = 150 + 100 + (0 * (-100) )
Node Resource Score ace webserver 100 (unchanged) king webserver 250 queen webserver -100 (unchanged)
Say we expand the previous setup.
The resource group is named webserver
<group id="webserver"> <primitive id="nfsmount" ... <primitive id="ipaddress" ... <primitive id="apache" ... </group>
We can keep the constraints from the first example:
<constraints>
<rsc_location id="rscloc-webserver" rsc="webserver">
<rule id="rscloc-webserver-rule-1" score="200">
<expression id="rscloc-webserver-expr-1" attribute="#uname" operation="eq" value="ace"/>
</rule>
<rule id="rscloc-webserver-rule-2" score="150">
<expression id="rscloc-webserver-expr-2" attribute="#uname" operation="eq" value="king"/>
</rule>
<rule id="rscloc-webserver-rule-3" score="-100">
<expression id="rscloc-webserver-expr-3" attribute="#uname" operation="eq" value="queen"/>
</rule>
</rsc_location>
</constraints>
So, without any stickiness values assigned, the scores will look like:
Node Resource Score ace nfsmount 200 ace ipaddress 0 ace apache 0 king nfsmount 150 king ipaddress 0 king apache 0 queen nfsmount -100 queen ipaddress 0 queen apache 0
This will make the group start on ace.
Now we add values for default-resource-stickiness and default-resource-failure-stickiness. Notice the "-" in the nvpair names!
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<attributes>
<nvpair name="default-resource-stickiness" id="default-resource-stickiness" value="100"/>
<nvpair name="default-resource-failure-stickiness" id="default-resource-failure-stickiness" value="-100"/>
</attributes>
</cluster_property_set>
</crm_config>
So assuming all resources of the webserver group started successfully on ace, the scores will be:
ace = (constraint-score) + (num_group_resources * resource_stickiness) + (failcount * (resource_failure_stickiness) )
ace = 200 + (3 * 100) + (0 * (-100) )
Node Resource Score ace nfsmount 500 ace ipaddress INFINITY ace apache INFINITY king nfsmount 150 king ipaddress -INFINITY king apache -INFINITY queen nfsmount -100 queen ipaddress -INFINITY queen apache -INFINITY
Now, lets say the apaches monitor operation reports an error (read: error - this is different to "not started"). Then the new scores will be
ace = (constraint-score) + (num_group_resources * resource_stickiness) + (failcount * (resource_failure_stickiness) )
ace = 200 + (3 * 100) + (1 * (-100) )
Node Resource Score ace nfsmount 400 ace ipaddress INFINITY ace apache INFINITY king nfsmount 150 (unchanged) king ipaddress -INFINITY (unchanged) king apache -INFINITY (unchanged) queen nfsmount -100 (unchanged) queen ipaddress -INFINITY (unchanged) queen apache -INFINITY (unchanged)
Notice that although apache failed, the score for the first resource of the group changes. So apache will be restarted on ace as ace still has the highest score.
If you read the previous example, you will now notice that there can not only be 2 failures as in the first example before the group will failover to king, now there can be 4. After 4 failures (as a sum of failures within that group), the scores would look like:
ace = (constraint-score) + (num_group_resources * resource_stickiness) + (failcount * (resource_failure_stickiness) )
ace = 200 + (3 * 100) + (4 * (-100) )
Node Resource Score ace nfsmount 100 ace ipaddress INFINITY ace apache INFINITY king nfsmount 150 (unchanged) king ipaddress -INFINITY (unchanged) king apache -INFINITY (unchanged) queen nfsmount -100 (unchanged) queen ipaddress -INFINITY (unchanged) queen apache -INFINITY (unchanged)
This will cause a failover to node king, as now king as has the highest score. After a successfull start on king, you will see:
ace = (constraint-score) + (failcount * (resource_failure_stickiness) )
ace = 200 + (4 * (-100) )
king = (constraint-score) + (num_group_resources * resource_stickiness) + (failcount * (resource_failure_stickiness) )
king = 150 + (3 * 100 ) + (0 * (-100) )
Node Resource Score ace nfsmount -200 ace ipaddress -INFINITY ace apache -INFINITY king nfsmount 450 king ipaddress INFINITY king apache INFINITY queen nfsmount -100 (unchanged) queen ipaddress -INFINITY (unchanged) queen apache -INFINITY (unchanged)
At this point, the group will only be allowed to run on node king, because all other nodes have negative scores. You will need to reset the failcount (crm_failcount -D) to change this.
Note: If you want a resource group to behave in the same way (e.g. failover after the same number of failures), you will have to adjust the stickiness values according to the number of items within that group.
The following scenario is only different from scenario 1 if you are using heartbeat < 2.1.4 or pacemaker < 0.6.3.
Lets start over. Nothing has failed, the group has just started successfully on ace.
Now say the nfsmounts operation reports "not running".
Then we end up with:
ace = (constraint-score) + (failcount * (resource_failure_stickiness) )
ace = 200 + (1 * (-100) )
Node Resource Score ace nfsmount 100 ace ipaddress INFINITY ace apache INFINITY king nfsmount 150 (unchanged) king ipaddress -INFINITY (unchanged) king apache -INFINITY (unchanged) queen nfsmount -100 (unchanged) queen ipaddress -INFINITY (unchanged) queen apache -INFINITY (unchanged)
So the group will failover to node king, as king has the highest score. If it starts successfully on king, then scores will look like
ace = (constraint-score) + (failcount * (resource_failure_stickiness) )
ace = 200 + (1 * (-100) )
king = (constraint-score) + (num_group_resources * resource_stickiness) + (failcount * (resource_failure_stickiness) )
king = 150 + (3 * 100 ) + (0 * (-100) )
Node Resource Score ace nfsmount 100 ace ipaddress -INFINITY ace apache -INFINITY king nfsmount 450 king ipaddress INFINITY king apache INFINITY queen nfsmount -100 (unchanged) queen ipaddress -INFINITY (unchanged) queen apache -INFINITY (unchanged)
It is often likely that you want your resource to be on a node with a working network connection. To express this, you need to use pingd. pingd sends icmp echo requests to a list of ping nodes and sets a numeric node attribute which is calculated by the number of ping nodes pingd can reach times the configured multiplier. Now you can create constraints using this node attribute.
Now here comes the propably most important thought about scores and pingd which a lot of people don't seem to understand in the first place (from what I read on the mailinglist and in IRC): In order to move a resource if network connection is lost (read: pingd cannot reach any of the ping nodes), you have to make the pingd score greater than the score a resource has with all stickiness bonus score applied (minus the node preference of the node you want to move the resource to).
Say we have the Example 1 setup again. webserver just started on ace.
Node Resource Score ace webserver 300 king webserver 150 queen webserver -100
Let's first assume all nodes have a healthy connection and we start pingd like this (one ping node, multiplier of 200):
ping 10.10.10.10 respawn root /usr/lib/heartbeat/pingd -a pingd -d 5s -m 200
Now we constraint the webserver with the pingd attribute:
<rsc_location id="webserver:connected" rsc="webserver">
<rule id="webserver:connected:rule" score_attribute="pingd" >
<expression id="webserver:connected:expr:defined" attribute="pingd" operation="defined"/>
</rule>
</rsc_location>
This will change the scores to (note: now the webserver would eventually run on queen, too. But thats not important right now):
Node Resource Score ace webserver 500 king webserver 350 queen webserver 100
Now the network connection fails on node ace. This will produce
Node Resource Score ace webserver 200 king webserver 350 (unchanged) queen webserver 100 (unchanged)
and make the webserver move to node king. Then, stickiness will be added to make it
Node Resource Score ace webserver 200 (unchanged) king webserver 450 queen webserver 100 (unchanged)
As you can see, the number of ping nodes, the multiplier and the way you constraint the pingd attribute are things to think about when using pingd.
You could also use a more strict constraint, which forbids running the webserver resource on a node with no pingd attribute or a pingd attribute value of 0:
<rsc_location id="webserver:connected" rsc="webserver">
<rule id="webserver:connected:rule" score="-INFINITY" boolean_op="or">
<expression id="webserver:connected:expr:undefined" attribute="pingd" operation="not_defined"/>
<expression id="webserver:connected:expr:zero" attribute="pingd" operation="lte" value="0"/>
</rule>
</rsc_location>
If the network connection fails now, or pingd is not started on the node (i.e. the node attribute is not set), you will see scores like this:
Node Resource Score ace webserver -INFINITY king webserver 350 (unchanged) queen webserver 100 (unchanged)
Although this might look appealing, make sure you understand that when the ping node(s) is(are) not reachable from any node, your resource will not run anywhere at all. There may be environments/use cases where you want this, but others where you certainly do not want such behaviour.