No Local Heartbeat
I got this message "ERROR: No local heartbeat. Forcing shutdown" and then Heartbeat shut itself down for no reason at all!
First of all, Heartbeat never shuts itself down for no reason at all. This kind of occurrence indicates that Heartbeat is not working properly, which in our experience can be caused by one of two things:
- System under heavy I/O load, or
- Kernel bug.
For how to deal with the first occurrence (heavy load), please read the answer to the next FAQ item. If your system was not under moderate to heavy load when it got this message, you probably have the kernel bug. The 2.4.18-2.4.20 Linux kernels had a bug in it which would cause it to not schedule Heartbeat for very long periods of time when the system was idle, or nearly so. If this is the case, you need to get a kernel that isn't broken.
How to tune Heartbeat on heavily loaded system to avoid split-brain?
"No local heartbeat" or "Cluster node returning after partition" under heavy load is typically caused by too small a Ha.cf/deadtime_directive deadtime interval, or an older version of Heartbeat. Make sure you're running at least version 3.0.2. Here is a suggestion for how to tune deadtime:
- Set Ha.cf/deadtime_directive deadtime to 60 seconds or higher
- Set Ha.cf/warntime_directive warntime to 1/4 to 1/2 of whatever you want your deadtime to be.
- Run your system under heavy load for a few weeks.
- Look at your logs for the longest time either system went without hearing a heartbeat.
If your never saw a "late heartbeat" message, then your chosen deadtime is fine - use it. Otherwise,
- Set your deadtime to 1.5-2 times that amount.
- Set warntime to Ha.cf/keepalive_directive keepalive*2.
- Continue to monitor logs for warnings about long heartbeat times. If you
don't do this, you may get "Cluster node ... returning after partition" which will cause Heartbeat to restart on all machines in the cluster. This will almost certainly annoy you at a minimum.
Adding memory to the machine generally helps. Limiting workload on the machine generally helps. Newer versions of Heartbeat are a better about this than pre 3.0.x versions. Some customers report being able to set sub-second deadtimes in their applications. YMMV (!)
I got this message "TTY write timeout on [/dev/ttyxxx]" but both nodes are up and I tested my serial cable
If both nodes are up, and your serial cable passes data, then the most probable explanation for the problem is that the serial cable does not pass the CTS and RTS leads through from end to end properly. Heartbeat requires these leads in order to avoid data loss.
How to use Heartbeat with Ipchains firewall?
To make Heartbeat work with Ipchains, you must accept incoming and outgoing traffic on 694 UDP port. Add something like
/sbin/ipchains -A output -i ethN -p udp -s <source_IP> -d <dest_IP> -j ACCEPT /sbin/ipchains -A input -i ethN -p udp -s <source_IP> -d <dest_IP> -j ACCEPT
How to run multiple clusters on the same network segment?
Use Ha.cf/mcast_directive multicast and give each its own multicast group. If you need to/want to use broadcast, then run each cluster on different port numbers. An example of a configuration using multicast would be to have the following line in your Ha.cf file:
mcast eth0 22.214.171.124 694 1 0
This sets eth0 as the interface over which to send the multicast, 126.96.36.199 as the multicast group (will be same on each node in the same cluster), UDP port 694 (Heartbeat default), time to live of 1 (limit multicast to local network segment and not propagate through routers), multicast loopback disabled (typical).