Contents
There are other FAQs in this wiki:
Yes! There are two public mailing lists for Linux-HA. You can find out about them by visiting the contact page.
HA (High availability Cluster) - This is a cluster that allows a host (or hosts) to become Highly Available, that means if one node goes down (or a service on that node goes down) another node can pick up the service or node and take over from the failed machine. http://linux-ha.org/
Computing Cluster - This is what a Beowulf cluster is. It allows distributed computing over off the shelf components. In this case it is usually cheap IA32 machines.
Load balancing clusters - This is what the Linux Virtual Server project does. In this scenario you have one machine which load balances requests to a certain server (Apache for example) over a farm of servers.
All of these sites have HOWTOs etc. on them. For a general overview on clustering under Linux, look at the Clustering HOWTO.
Resource scripts are basically (extended) System V init scripts. They have to support stop, start, and status operations. In the future we will also add support for a monitor operation for monitoring services as you requested. The IPaddr script implements this new monitor operation now (but Heartbeat doesn't use that function of it). For more info see Resource HOWTO.
Most likely there is some problem in your resource agent script. Command "<resource script> status" should not print either "ok" or "running" , otherwise Heartbeat will think the resource is running and will not run "<resource script> start". For a detailed description of a Heartbeat resource agent, see http://wiki.linux-ha.org/HeartbeatResourceAgent
If one of my resources stops working Heartbeat doesn't do anything unless the server crashes. How do I monitor resources with Heartbeat?
For Heartbeat version 2, please see the Frequently Asked Questions for Heartbeat Version 2 page.
In version 1.x, Heartbeat was not designed for monitoring various resources. The best answer is to upgrade to an R2-style CRM configuration. If you need to monitor some resources with the 1.x software (for example, availability of WWW server) you need some third party software. Mon is a reasonable solution.
Get Mon from http://kernel.org/software/mon/ . Note that you are encouraged to use a mirror, not kernel.org directly.
at the CPAN archive. I am not very familiar with Perl, so I downloaded them from CPAN archive as .tar.gz packages and installed them the usual way (perl Makefile.pl && make && make test && make install).
Mon is software for monitoring different network resources. It can ping computers, connect to various ports, monitor WWW, MySQL etc. In case of dysfunction of some resources it triggers some scripts.
Unpack Mon in some directory. The best starting point is the README file. Complete documentation is in <dir>/doc, where <dir> is the place where you unpacked the Mon package.
copy all subdirs found in <dir> to /usr/lib/mon
create directory /etc/mon
copy auth.cf from <dir>/etc to /etc/mon
Now, Mon is prepared to work. You need to create your own mon.cf file, where you should point to resources Mon should watch and actions Mon will start in case of dysfunction and when resources are available again. All monitoring scripts are in /usr/lib/mon/mon.d/. At the beginning of every script you can find explanation of how to use it. All alert scripts are placed in /usr/lib/mon/alert.d/. Those are scripts triggered in case something went wrong. In case you are using IPVS (IP Virtual Server) on their homepage (http://www.linuxvirtualserver.org) you can find scripts for adding and removing servers from an IPVS list.
Consider using hb_standby to cause services to fail over to the other side in case of problems.
Yes!
For each interface you wish to monitor, specify one or more "ping nodes" in your configuration. Each node in your cluster will monitor these ping nodes.
Should one node detect a failure in one of these ping nodes, it will contact the other node in order to determine whether it or the ping node has the problem. If the cluster node has the problem, it will try to failover its resources (if it has any).
To use ipfail you will need to add the following to your /etc/ha.d/ha.cf files:
respawn hacluster /usr/lib/heartbeat/ipfail ping <IPaddr1> <IPaddr2> ... <IPaddrN>
IPaddr1..N are your ping nodes. NOTE: ipfail requires the auto_failback option to be set to on or off (not legacy).
See the ipfail page for more information on ipfail. Note: ipfail only works for R1-style resources.
For R2-style resources, use the pingd daemon instead of ipfail. The ping nodes are configured in the same way, but pingd provides much more flexibility in terms of what should fail over under what conditions. Note: pingd only works for R2-style resources.
This isn't a problem with Heartbeat, but rather is caused by various versions of net-tools. Upgrade to the most recent version of net-tools and it will go away. You can test it with ifconfig manually.
Instead of failing over many IP addresses, just fail over one router address. On your router, do the equivalent of "route add -net x.x.x.0/24 gw x.x.x.2", where x.x.x.2 is the cluster IP address controlled by Heartbeat. Then, make every address within x.x.x.0/24 that you wish to failover a permanent alias of lo0 on both cluster nodes. This is done via "ifconfig lo:2 x.x.x.3 netmask 255.255.255.255 -arp" etc...
If anything makes your Ethernet / IP stack fail, you may lose both connections. You definitely should run the cables differently, depending on how important your data is... If you're running an ipchains firewall, then it's easy to accidentally cause a SplitBrain occurrence if you make a mistake in the rules, and currently our serial ports aren't subject to firewall rules at all.
However, most CRM-style configurations don't work well with serial configurations, because of the amount of data being transferred. If you want to try this, make sure your baud rate is as high as you can set it.
To make Heartbeat work with Ipchains, you must accept incoming and outgoing traffic on 694 UDP port. Add something like
/sbin/ipchains -A output -i ethN -p udp -s <source_IP> -d <dest_IP> -j ACCEPT /sbin/ipchains -A input -i ethN -p udp -s <source_IP> -d <dest_IP> -j ACCEPT
First of all, Heartbeat never shuts itself down for no reason at all. This kind of occurrence indicates that Heartbeat is not working properly, which in our experience can be caused by one of two things:
For how to deal with the first occurrence (heavy load), please read the answer to the next FAQ item. If your system was not under moderate to heavy load when it got this message, you probably have the kernel bug. The 2.4.18-2.4.20 Linux kernels had a bug in it which would cause it to not schedule Heartbeat for very long periods of time when the system was idle, or nearly so. If this is the case, you need to get a kernel that isn't broken.
"No local heartbeat" or "Cluster node returning after partition" under heavy load is typically caused by too small a deadtime interval, or an older version of Heartbeat. Make sure you're running at least version 1.2.0. Here is a suggestion for how to tune deadtime:
Set deadtime to 60 seconds or higher
Set warntime to 1/4 to 1/2 of whatever you *want* your deadtime to be.
Set warntime to keepalive*2.
which will cause Heartbeat to restart on all machines in the cluster. This will almost certainly annoy you at a minimum.
Adding memory to the machine generally helps. Limiting workload on the machine generally helps. Newer versions of Heartbeat are a better about this than pre 1.2.x versions. Some customers report being able to set sub-second deadtimes in their applications with 1.2.3. YMMV(!).
In general, this message usually means that a packet has been corrupted. It's common to get a single mangled packet on your serial interface when Heartbeat starts up. This is an indicator that happened. It's harmless in this scenario. It will happen occasionally under other circumstances, and as long as it only happens occasionally (like less than once an hour), it's harmless. If it happens continually, there is probably something else going on to cause packet corruption, and this cause should be investigated and corrected.
It's probably a permissions problem on authkeys. It wants it to be read only mode (400, 600 or 700). Depending on where and when it discovers the problem, the message will wind up in different places.
But, it tends to be in
/var/log/messages
Newer releases (>= 1.0) are better about also putting out startup messages to stderr in addition to wherever you have configured them to go.
Use multicast and give each its own multicast group. If you need to/want to use broadcast, then run each cluster on different port numbers. An example of a configuration using multicast would be to have the following line in your ha.cf file:
mcast eth0 224.1.2.3 694 1 0
This sets eth0 as the interface over which to send the multicast, 224.1.2.3 as the multicast group (will be same on each node in the same cluster), UDP port 694 (Heartbeat default), time to live of 1 (limit multicast to local network segment and not propagate through routers), multicast loopback disabled (typical).
There is a CVS repository for Linux-HA. You can find it at http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/ . Read-only access is via login guest, password guest, module name linux-ha. More details are to be found on the HeartbeatCVS page.
Heartbeat now uses automake and is generally quite portable at this point. Join the Linux-HA-dev mailing list if you want to help port it to your favorite platform.
Due to distribution RPM package name differences, this was unavoidable. If you're not using STONITH, use the "--nodeps" option with RPM. Otherwise, use the Heartbeat source to build your own RPMs. You'll have the added dependencies of autoconf >= 2.53 and libnet (get it from http://www.packetfactory.net/libnet ). Use the Heartbeat source RPM (preferred) or unpack the Heartbeat source and from the top directory, run ./ConfigureMe rpm. This will build RPMs and place them where it's customary for your particular distro. It may even tell you if you are missing some other required packages!
You configure a "meatware" STONITH device into the ha.cf file. The meatware STONITH device asks the operator to go power reset the machine which has gone down. When the operator has reset the machine he or she then issues a command to tell the meatware STONITH plugin that the reset has taken place. Heartbeat will wait indefinitely until the operator acknowledges the reset has occurred. During this time, the resources will not be taken over, and nothing will happen.
STONITH is a form of fencing, and is an acronym standing for Shoot The Other Node In The Head. It allows one node in the cluster to reset the other. Fencing is essential if you're using shared disks, in order to protect the integrity of the disk data. Heartbeat supports STONITH fencing, and resources which are self-fencing. You need to configure some kind of fencing whenever you have a cluster resource which might be permanently damaged if both machines tried to make it active at the same time. When in doubt check with the Linux-HA mailing list.
To get the list of supported STONITH devices, issue this command:
stonith -L
To get all the gory details on exactly what these STONITH device names mean, and how to configure them, issue this command:
stonith -h
This is not something which Heartbeat supports directly, however, there are a few kinds of resources which are "self-fencing". This means that activating the resource causes it to fence itself off from the other node naturally. Since this fencing happens in the resource agent, Heartbeat doesn't know (and doesn't have to know) about it. Two possible hardware candidates are IBM's ServeRAID RAID controllers and ICP Vortex RAID controllers - but do your homework!!! When in doubt check with the mailing list.
Yes, Heartbeat has supported active/active configurations since its first release. The key to configuring active/active clusters is to understand that each resource group in the haresources file is preceded by the name of the server which is normally supposed to run that service. When in an "auto_failback yes (or legacy)" (or old-style "nice_failback off") configuration, when a cluster node comes up, it will take over any resources for which it is listed as the "normal master" in the haresources file. Below is an example of how to do this for an Apache/MySQL configuration.
server1 10.10.10.1 mysql server2 10.10.10.2 apache
In this case, the IP address 10.10.10.1 should be replaced with the IP address you want to contact the MySQL server at, and 10.10.10.2 should be replaced with the IP address you want people to use to contact the web server. Any time server1 is up, it will run the MySQL service. Any time server2 is up, it will run the Apache service. If both server1 and server2 are up, both servers will be active. Note that this is contradictory with the old nice_failback on parameter. With the new release which supports hb_standby foreign, you can manually fail back into an active/active configuration if you have auto_failback off. This allows administrators more flexibility in failing back in a more customized way at more safe or convenient times.
Heartbeat was written to use ifconfig to manage its interfaces. That's nice for portability for other platforms, but for some reasons ifconfig truncates interface names. If you want to have fewer than 10 aliases, then you need to limit your interface names to 7 characters, and 6 for fewer than 100 interfaces. Or, you can use the IPaddr2 resource script.
The auto_failback parameter is a replacement for the old nice_failback parameter. The old value nice_failback on is replaced by auto_failback off. The old value nice_failback off is logically replaced by the new auto_failback on parameter. Unlike the old nice_failback off behavior, the new auto_failback on allows the use of the ipfail and hb_standby facilities.
During upgrades from nice_failback to auto_failback, it is sometimes necessary to set auto_failback to legacy, as described in the upgrade procedure below.
To upgrade from a pre-auto_failback version of Heartbeat to one which supports auto_failback, the following procedures are recommended to avoid a flash cut on the whole cluster.
Stop Heartbeat on one node in the cluster.
Upgrade this node. If the other node has nice_failback on in ha.cf then set auto_failback off in the new ha.cf file. If the other node in the cluster has nice_failback off then set auto_failback legacy in the new ha.cf file.
Heartbeat. Set auto_failback the same as it was set in the previous step.
If you set auto_failback to on or off, then you are done. Congratulations!
If you set auto_failback legacy in your ha.cf file, then continue as described below...
the value of auto_failback to on in the ha.cf file on both sides.
Congratulations, you're done! You can now use ipfail, and can also use the hb_standby command to cause manual resource moves.
Please be sure that you read all documentation and searched mail list archives. If you still can't find a solution you can post questions to the mailing list. Try and start off your email with a simple, clear explanation of what you think is wrong, surprising or unexpected. We know that many of you don't have good English skills. That's OK. Just try and be logically organized, and as clear as you can be, and try and avoid really long sentences or really long paragraphs. Those just make it harder to understand. Logical organization of the information and clarity really helps everyone - including you. It's also good to send your email as plain text and not as HTML. HTML email causes problems for some people, and annoys others. Plain text works great for mailing lists. Please include the following:
How did you install Heartbeat (tar.gz, rpm, src.rpm or manual installation).
Include your configuration files from both machines. You can omit authkeys. Send them as text/plain attachments.
Please don't send "cleaned up" logs. The real logs have more information in them than cleaned up versions. Always include at least a little irrelevant data before and after the events in question so that we know nothing was missed. Please narrate what you did when, preferably with timestamps taken from the logs. Don't edit the logs unless you really have some super-secret high-security reason for doing so. This means you need to attach 5-8 files.
For heartbeat versions >= 2.1.3, the easiest way to get together all files is to use hb_report. If you still want to go the manual way:
Include 5 for an R1 cluster with debug output goes into the same file as your normal output and 6 if you're an R1 cluster with debug output going to the same file, 7 if you're running an R2 cluster with debug output going into a single file, and 8 otherwise. For each machine you need to send:
The output from cibadmin -Ql or haresources (for R1 clusters)
If you want to file a bug report, then you can see the bugzilla information in the ContactUs page for more details.
We love to get good patches. Here's the preferred way:
Linux-HA-Dev mailing list for answers before starting
Make your changes against the current CVS source
Test them, and make sure they work
cvs -q diff -u >patchname.txt
Send an email to the Linux-HA-Dev mailing list with the patch as a [text/plain] attachment. If your mailer wants to zip it up for you, please fix it.
Another good way is to submit the patch attached to a bug report. See the ContactUs page for how to access our bug reporting system. There's nothing wrong with doing both - if you don't mind.
We just reinstalled our master node (paul) and Heartbeat (1.2.0) is saying this on the slave node (silas - which has the resources):
Mar 16 19:31:43 silas heartbeat[12561]: ERROR: should_drop_message: attempted replay attack [paul]? [gen = 1, curgen = 10] Mar 16 19:32:15 silas last message repeated 38 times Mar 16 19:33:17 silas last message repeated 62 times
What should we do to get the resources back on the master node?
Put 11 (curgen+1) in /var/lib/heartbeat/hb_generation on paul - from this log it should have a 1 (gen) in there now.
Basically, it should be one larger than the curgen number from the message above.
Then if you restart Heartbeat on the master node (paul), all should be well. This is the result of a feature called ReplayAttackProtection. You can also just restart Heartbeat on both nodes, if you prefer.
So, if you put any number larger than curgen into the hb_generation file on paul, on the machine you reinstalled, and restart, Heartbeat will be happy.
IP failover within a single node can be set using bonding device. It is described in IpFailoverChannelBonding.
How to configure HA NFS, and corresponding test results can be found in the HaNFS page.
If both nodes are up, and your serial cable passes data, then the most probable explanation for the problem is that the serial cable does not pass the CTS and RTS leads through from end to end properly. Heartbeat requires these leads in order to avoid data loss. See the SerialCable page for more details.
This discussion assumes that your current machines are named A and B, and you want them to be named C and D respectively after the change.
A: Make A (host that will become C) active for all services: hb_takeover.
B: /etc/init.d/heartbeat stop; vi /etc/ha.d/ha.cf /etc/ha.d/haresources # add an entry for C and D so 4 hosts are listed B: /etc/init.d/heartbeat start
B: hb_takeover
A: /etc/init.d/heartbeat stop; vi /etc/ha.d/ha.cf # add an entry for C and D so 4 hosts are listed A: /etc/init.d/heartbeat start A: hostname C # changes the first host's name # don't forget to update /etc/hostname to C
A: hb_takeover
B: /etc/init.d/heartbeat stop; vi /etc/ha.d/ha.cf /etc/ha.d/haresources # remove A and B leaving only C and D from both files B: hostname D # changes the second host's name, don't forget to update # /etc/hostname to D B: /etc/init.d/heartbeat start
B: hb_takeover
A: /etc/init.d/heartbeat stop; vi /etc/ha.d/ha.cf /etc/ha.d/haresources # remove A and B leaving only C and D in both files A: /etc/init.d/heartbeat start
A: hb_takeover # Leave host A as the active load balancer, if you wish
The general case is somewhat painful for R1-style configurations, but I think these recipes should work starting with the 1.2.x series... But, the best answer is to use an R2-style CRM configuration, because then this all just works exactly like it should.
If you want to reorganize them, but not eliminate or add any, then this should work:
do an hb_standby on either side...
update haresources on both sides
If you don't like where the resources are now running...
do an hb_standby foreign on side A (wait for it to complete)
do an hb_standby foreign on side B
If you want to add or eliminate some, then this should work:
do an hb_standby on the A side
update the A side's haresources file
do an hb_standby on the B side (wait for it to complete)
update the haresources file on the B side
do an hb_standby foreign on side A (wait for it to complete)
do an hb_standby foreign on side B
This is almost always because of firewall rules. You may be running firewall rules even if you don't think you are. Most current distributions ship with firewall rules enabled by default. So, if you are having this problem, please check for firewall rules before asking for further help.
To check firewall rules, try using the "iptables-save" command, which dumps your current set of firewall rules. If you have no firewall rules, you would see output similar to this:
[2] guin:~# iptables-save # Generated by iptables-save v1.3.0 on Fri Dec 30 16:29:07 2005 *filter :FORWARD ACCEPT [0:0] :INPUT ACCEPT [16:1935] :OUTPUT ACCEPT [19:1790] COMMIT # Completed on Fri Dec 30 16:29:07 2005 [2] guin:~#
On Red Hat and Fedora Core systems you can disable firewall rules by running "/etc/init.d/iptables stop". This is recommended only for testing, you should always run a firewall on your systems except when explicitly testing to see if firewall rules are causing your problems.
If you have followed these steps and are sure that you have no firewall rule in place, but still are having problems, mention what you have done to ensure it's not firewall rules when asking for help. Pretty much every time this error is reported it has been because of a firewall, and in many cases the users insisted that there were no firewall rules. So, save yourself some time and make sure to check this and report the results in your request for community support.