This site best when viewed with a modern standards-compliant browser. We recommend Firefox Get Firefox!.

Linux-HA project logo
Providing Open Source High-Availability Software for Linux and other OSes since 1999.

USA Flag UK Flag

Japanese Flag

Homepage

About Us

Contact Us

Legal Info

How To Contribute

Security Issues

This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.
The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.
To get rid of this notice, you may want to browse the old wiki instead.

1 February 2010 Hearbeat 3.0.2 released see the Release Notes

18 January 2009 Pacemaker 1.0.7 released see the Release Notes

16 November 2009 LINBIT new Heartbeat Steward see the Announcement

Last site update:
2014-10-25 08:56:44

Setup of High-Availability NFS servers (HA-NFS)

This page was originally created by GuochunShi but is now maintained primarily by DaveDykstra.

Documentation

Please refer to the following papers for both introductory and detailed documentation:

Principle of Operation

HA-NFS requires the following for transparent failover for clients:

  1. identical device major/minor numbers
  2. identical inodes for the file systems on both NFS servers,
  3. identical /etc/exports, and

  4. a shared /var/lib/nfs directory between the server nodes.

When an active/passive setup such as DRBD is used as shared file system, no service should hold /var/lib/nfs open during failover (or STONITH should be used with an iron fist).

Therefore, the following resources should be under HA control, i.e., configured under heartbeat (not covered here) for migration as a group between the servers:

  • the nfslock resource

  • the nfsserver resource (on some distros named simply nfs)

  • the resources managing rpc.gssd (client) and rpc.svcgssd (server) for the RPCSEC GSS daemon (cf. RFC 2203)

  • the rpc.idmapd resource

  • the rpc_pipefs file system mount

Device Numbering

The device major/minor numbers are embedded in the NFS filehandle, making the filehandle go stale if the major/minor numbers change as the failover happens. If that is the case, you can either change your configuration to make it magically the same if you are using DRBD :) , or if you have shared disks you can use either EVMS or LVM to create a shared volume which will have the same major/minor numbers. You can download EVMS in http://evms.sourceforge.net/ and LVM in http://www.sistina.com/products_lvm.htm

Another alternative to deal with the NFS device numbering problem is afforded by later versions of NFS (commonly included with 2.6 Linux kernels). In these versions, you can specify an integer to be used in place of the major/minor of the mount device through the fsid parameter. For more details see exports(5).

Quick Setup

For those who want a quick setup, here are the steps to prepare manual NFS failover. Once this is tested and working, you should consult Heartbeat for setting up automatic failover.

NFS v4: Keeping rpc_pipefs local

For NFS v4, short of taking down the rpc_pipefs file system mount and all resources that use it, you might keep the rpc_pipefs directory local to each server to get started. Do this before setting up the sharing of /var/lib/nfs described in the next section.

Note: This is a shortcut, which may not work in complex environments using all the services mentioned above.

The following procedure was tested on Redhat Enterprise Linux 5.2. The servers were not NFS clients, and did not use GSS. You may need to adjust details for your own distribution. The changes are to be made on both nodes.

  1. Change all occurrences of /var/lib/nfs/rpc_pipefs to /var/lib/rpc_pipefs:

    vi /etc/modprobe.d/modprobe.conf.dist /etc/idmapd.conf
    
    Suggestion: keep and comment out the original lines.

    Here is a simple but not bullet-proof automated version:

    perl -i.orig -pe '
            if (m,^[^#], and m,/var/lib/nfs/rpc_pipefs,)
                {print "#", $_; s,,/var/lib/rpc_pipefs,;}
        ' \
       /etc/modprobe.d/modprobe.conf.dist \
       /etc/idmapd.conf
    
    This was tested for the files from the package versions module-init-tools-3.3-0.pre3.1.37.el5 and nfs-utils-1.0.9-33.el5, respectively.

    Note that /etc/modprobe.d/modprobe.conf.dist has two occurrences for sunrpc (an install and a remove line).

  2. Take down services:

    service nfslock stop
    service nfs stop
    service rpcidmapd stop
    /bin/umount -a -t rpc_pipefs
    
  3. Move the directory:

    mv /var/lib/nfs/rpc_pipefs /var/lib/
    
  4. Restart services:

    /bin/mount -t rpc_pipefs sunrpc /var/lib/rpc_pipefs
    service rpcidmapd start
    

    The first of these two steps is the same as is normally performed by modprobe as specified in /etc/modprobe.d/modprobe.conf.dist.

Sharing /var/lib/nfs

  1. Make sure you are using a shared disk device, either shared by hardware between your two servers or by DRBD.

  2. Mount your shared device in one of the machines. For example, if the device is /dev/sdb1 and the directory you want to mount to is /hafs:

    mount -t ext3 /dev/sdb1 /hafs 
    

    If the directory you will export is /hafs/data, create it if it is not there yet:

    mkdir /hafs/data
    
  3. Stop services (use the appropriate service names for your distro):

    service nfs stop       # or: nfsserver
    service nfslock stop   # or: nfs-common
    
  4. Move /var/lib/nfs to your shared disk and symlink the original location to it:

    mv /var/lib/nfs /hafs
    ln -sn /hafs/nfs /var/lib/nfs
    

    On the other machine, remove the /var/lib/nfs directory and create a link instead:

    rm -rf /var/lib/nfs
    ln -sn /hafs/nfs /var/lib/nfs
    

    If you get an error,

    rm: cannot remove directory `/var/lib/nfs/rpc_pipefs/statd': Operation not permitted
    
    follow the instructions in the previous section.
  5. For lock recovery to succeed, you need to supply rpc.statd the "floating" (common) host name foo-ha or IP under which the HA services are to be reached.

    • On Debian, the locking service is started by /etc/init.d/nfs-common. You can configure it by putting into /etc/default/nfs-common:

      STATDOPTS="-n foo-ha"
      
    • On Redhat/RHEL, the appropriate place to configure is /etc/sysconfig/nfs:

      STATD_HOSTNAME=foo-ha
      
    • Otherwise, you could hardcode the host name in the /etc/init.d/nfslock script on many Linux systems:

      start() {
              ...
              echo -n $"Starting NFS statd: "
              ...
              daemon rpc.statd "$STATDARG" -n foo-ha
              ....
      }
      
      This is of course not recommended. Rather, scan the top of the file for any configuration files that are source'd and edit those instead.
  6. Start services, on the primary node only:

    service nfs start
    service nfslock start
    

    For heartbeat integration, make sure that both of these services are removed from the native init.d startup sequence and instead are placed under heartbeat control, i.e., in the haresources file for heartbeat v.1 or the CIB (v.2). The order of the services is not entirely clear. Most examples have nfslock subordinate to nfs, being started after nfs, and stopped before nfs, though some init.d configurations may have the order reversed (notably RHEL5.x).

  7. Export the user directory. Add the following line to /etc/exports:

    /hafs/data  *(rw,sync)
    

    and run

    exportfs -va
    
  8. Either sync /etc/exports to the other node, or symlink it on both nodes into the shared disk, perhaps into /var/lib/nfs:

    mv /etc/exports /var/lib/nfs/       # one node only
    ln -sf /var/lib/nfs/exports /etc    # both nodes
    

Manual Failover Test

Assumptions

Heartbeat is set up and minimally configured to manage:

  1. DRBD resource demotion and promotion (if used)

  2. migration of the shared file system (/hafs in the examples above)

  3. migration of the shared IP for host foo-ha

Test

  1. On a client, mount the exported file system:

    mount foo-ha:/hafs/data /mnt/data
    
  2. Write some files to it, and watch, e.g.:

    while :; do date +"%s  %c -- $i" | tee -a /mnt/data/nfstest.log; sleep 1; (( ++i )); done
    
  3. On the primary server: take down NFS services:

    service nfslock stop
    service nfs stop
    
  4. Initiate heartbeat migration for all resources in the previous section, e.g., for a heartbeat v2 configuration:

    crm_resource -r drbdgroup -M
    crm_mon; sleep 5; crm_mon
    
  5. On the secondary server: bring up NFS services:

    service nfslock start
    service nfs start
    
  6. On the client, watch the above loop.

To migrate back, perform the analogous:

  1. On the secondary server: take down NFS services:

    service nfslock stop
    service nfs stop
    
  2. On the primary server::

    crm_resource -r drbdgroup -U
    crm_mon; sleep 5; crm_mon
    
  3. ... and bring up NFS services:

    service nfslock start
    service nfs start
    

Hints

  1. NFS-mounting any filesystem on your NFS servers is highly discouraged. DaveDykstra wanted both servers to NFS-mount the replicated filesystem from the active server, and through a lot of trouble mostly got it working but still saw scenarios where "NFS server not responding" could interfere with heartbeat failovers and he finally gave up on it. The biggest problem was with the fuser command hanging. For more details see the archives of a mailing list discussion from 2005 and another from 2006.

  2. If your kernel defaults to using TCP for NFS (as is the case in 2.6 kernels), switch to UDP instead by using the 'udp' mount option. If you don't do this, you won't be able to quickly switch from server "A" to "B" and back to "A" because "A" will hold the TCP connection in TIME_WAIT state for 15-20 minutes and refuse to reconnect.
  3. For failover between the NFS server nodes to succeed, the shared nfs directory must be properly and unconditionally handed over.

    On Redhat Enterprise Linux 5.x, nfsd occasionally refuses to die upon service nfs stop. To remedy the situation, apply the following patch:

    --- /etc/init.d/nfs     2008/05/24 23:02:19     1.1
    +++ /etc/init.d/nfs     2008/09/19 07:30:12
    @@ -109,6 +109,10 @@
            echo
            echo -n $"Shutting down NFS daemon: "
            killproc nfsd -2
    +
    +       # Need to be more persuasive for HA
    +       sleep 5
    +       killproc nfsd -9
            echo
            if [ -n "$RQUOTAD" -a "$RQUOTAD" != "no" ]; then
                    echo -n $"Shutting down NFS quotas: "
    

    This patch is for /etc/init.d/nfs from nfs-utils-1.0.9-33.el5. Adapt as necessary for your distribution.

Locking

NFS locking is a cooperative enterprise. Lock migration is coordinated by rpc.statd(8). Locks are stored in /var/lib/nfs/statd/sm and /var/lib/nfs/statd/sm.bak. Having /var/lib/nfs on shared storage as outlined above will enable cooperative lock migration upon failover, as initiated by rpc.statd on the new server.

For debugging, rpc.statd(8) allows signal-initiated recovery:

  • SIGUSR1 causes rpc.statd to re-read the notify list from disk and send notifications to clients. This can be used in High Availability NFS (HA-NFS) environments to notify clients to reacquire file locks upon takeover of an NFS export from another server.

On some kernels or in some situations, locks may not survive HA failovers without the steps above. As a workaround for those situations, it is recommended that you mount NFS filesystems with the "nolock" option. For more details see a mailing list post from 2005 and a confirmation from 2007 using a more recent kernel.

For further information about locking, which includes testing using Connectathon, see HaNFSOldLocking.

Active/Active HA NFS

The above directions and comments apply mainly to ActivePassive arrangements. For information on ActiveActive setups, please refer to Matt Schillinger's NFS page (Note: old page).