Db2 (resource agent)

Scenario
A DB2 instance should be made highly available with 2 servers and shared storage. Once a server or the instance fails pacemaker will restart the instance or relocate it to the surviving server. Clients will connect to a service address that will always be active together with the healthy instance.

Functions of the agent
When Pacemaker calls the agent's start method the DB2 instance is started and all databases are activated (a subset of databases can be selected with parameter dblist in version 1.0.5 or later). The method monitor checks whether the instance is running and then probes individual databases by selecting from internal tables. The stop method tries everything to bring the instance down. First it tries db2stop force and if this don't works or hangs db2_kill. As a last resort Pacemaker will reboot the node with STONITH. Ultimately bringing an instance down on one node is mandatory for starting it elsewhere.

Prerequisites / Assumptions
The environment for this example is:
 * nodes
 * node-a and node-b


 * IP-address for instance
 * 192.168.178.17/24 with DNS entry ha-inst1


 * DB2 instance user
 * db2inst1
 * The home directory of db2inst1 is /db2/db2inst1</tt> and must be on shared storage.

On node-a</tt> install the DB2 software (on shared storage!) and create the instance db2inst1</tt> as specified in the DB2 documentation (e.g. DB2 Information Center).

Configure IP-address ha-inst1</tt> on node-a</tt> for now.

Cluster-enable your DB2 instance
Instance creation (db2icrt</tt>) creates a file ~db2inst1/db2nodes.cfg</tt> containing the hostname of the node where the instance was created.

Of course this entry will be wrong on the other node node-b</tt> so we have to replace it with something that is valid on either node e.g ha-inst1</tt>.

db2start</tt> and db2stop</tt> now consider this partition on a different node and try to access it with rsh. You either have to configure rsh, ssh or you can create and enable the script below:

Now try out db2start / db2stop</tt>. It should work.

Configure Pacemaker
The configuration of shared storage and IP addresses is described elsewhere. For DB2 it's essential that the file system and the service IP address are co-located with the DB2 resource and ordered before the DB2 resource e.g. put them in a group.

Be very deliberate with specifying timeout values as these have to account for crash recovery etc..

Multipartition Support
Configure each partition as a separate resource using the dbpartitionnum parameter. Partion 0 must be started first e.g.

So what part of DB2 is now highly available ?
That is the instance including all databases. As pointed out above the stop method brings down all databases. As best practice configure one database per instance.

Scenario
A DB2 database is configured for HADR on two servers with local storage. Each server has a configured on instance. Once an instance or a complete server fails pacemaker will perform a db2 takeover hadr</tt> on the surviving instance. Clients will connect to a service address that will always be active together with the Primary of the HADR pair.

Functions of the agent
The agent must be configured as a master/slave resource. When Pacemaker calls the agent's start method the DB2 instances are started. The databases are activated: one in Primary and one in Standby mode. Pacemaker then decides which member of the pair to promote. The database is then brought into Primary role on the specific node. The method monitor checks whether instances are running and then probes the Primary by selecting from internal tables. Should the Primary fail the Standby is promoted (i.e. a db2 takeover hadr</tt>) is performed. The stop method tries everything to bring the instance down. First it tries db2stop force</tt> and if this don't works or hangs db2_kill</tt>.

Prerequisites / Assumptions
The environment for this example is:
 * nodes
 * <tt>node-a</tt> and <tt>node-b</tt>


 * DB2 instance user
 * <tt>db2inst1</tt>
 * This instance is configured on both servers on local storage.


 * DB2 database
 * A database with name db1 is configured for HADR on <tt>node-a</tt> and <tt>node-b</tt>


 * IP-address for the database
 * <tt>192.168.178.17/24</tt> with DNS entry <tt>ha-db1</tt>

Install the software and configure HADR for database db1 as specified in the DB2 documentation (e.g. DB2 Information Center).

Configure Pacemaker
Configure a resource for the IP address and a master/slave resource for the database. The IP address should always be colocated with the master (a.k.a. Primary) and started after promotion.

Interaction of DB2's split brain prevention and Pacemaker
DB2 HADR has a builtin split brain prevention that can be summarized as follows:


 * A Primary can not be cold-started when the Standby is down.
 * A takeover can be constrained to the HADR_PEER_WINDOW.

That means:
 * HADR_PEER_WINDOW must be configured (available with DB2 version >= V9).
 * On a cold start a database in role Master will not come up as long as the other Slave resource is down. The resource will be stuck in a start/stop loop as long as pacemaker allows.
 * Monitoring timeouts must be set so failure detection and takeover can be completed within HADR_PEER_WINDOW. The new Master then continues to work even if the other instance is down.
 * After a crash and possible takeover the crashed database comes up as Primary as well but will not activate because there is no Standby. Both databases exchange their "First active Log" position. Once the outdated 'old' Primary can be safely determined it is restarted as Standby.