Contents
Please have a look at some of the publications and documentation.
DRBD, developed by PhilippReisner and LarsEllenberg, is a
Distributed
Replicated
Block
Device
for the Linux operating system. It allows to have a realtime mirror of your local block devices on a remote machine. In conjunction with Heartbeat it allows to create HA (high availability) Linux clusters.
At LinBit.
from LinBit resp. from drbd.org. DRBD is also included in many Linux distributions, like Debian, SuSE, RedHat and others.
There is also a git Repository
and a mailing list
CIPE project. Of course, IPSEC or OpenVPN will do, too.
Short answer: No!. But see also next question/answer.
Thus, if you want to mount the secondary, set the secondary as the primary first. Both devices mounted at the same time does not work. Actually, DRBD v8 does support two Primaries, see the next answer. If you need access to the data from both nodes, and an arbitrary number of other clients, consider using HaNFS.
If you need not just a mirrored, but a shared filesystem, use OCFS2 or GFS2 for example. But these are much slower, and typically expect write access on all nodes in question. If we have more than one node concurrently modifying distributed devices, we have some "interessting" problems to decide which part of the device is up-to-date on which node, and what blocks need to be resynchronized in which direction. These problems have been solved. You need to net { allow-two-primaries; } to activate this mode. But the handling of DRBD in "cluster fs mode" is still more complex and cumbersome than "classical" one-node-at-a-time access.
Also have a look at the DRBD Changelog.
Locally DRBD uses the configured disk-size, which has to be <= physical, and if not given its is set to the physical size. On connect the device size will be set to the minimum of both nodes. And here you could run into problems, if you do things without common sense: if you first use drbd on one node only, without disk-size configured properly, and later connect a node with smaller device size, then the drbd device size shrinks at runtime. you should find a message about Your size hint is bogus, please change to <some value> in the syslog in that case. This will confuse the file system on top of your device. Thus, if your device sizes differ, set the size to be used for DRBD explicitely. DRBD-0.7 stores information about the peers device size in its local meta data, therefore usage of disk-size is deprecated (and is disallowed in the configuration file).
Can I put one machine in Los Angeles and the other machine in New York, connected only by a VPN link over the Internet? Or do they both need to be connected to the same local Ethernet network?
The settings for your actual kernel and the .config for the kernel source against which drbd was build do not match. On SuSE Linux you can get the right config with the following commands: cd /usr/src/linux/ && make cloneconfig && make dep Ususally, you do not have to recompile your kernel, just drbd. But read INSTALL in the drbd tgz, to learn how to do it the proper way.
A Summary of LVM snapshots with DRBD posted on 2004-04-08 in drbd-user.
Maybe http://linux-vserver.org/advanced+DRBD+mount+issues helps.
Always interesting discussions on http://lists.xensource.com/archives/html/xen-users/
Found $some filesystem which uses $somuch kB current configuration leaves usable $less kB Device size would be truncated, which would corrupt data and result in 'access beyond end of device' errors. You need to either * use external meta data (recommended) * shrink that filesystem first * zero out the device (destroy the filesystem) Operation refused.
dd if=/dev/zero bs=1M count=1 of=/dev/sdXYZ; sync drbdadm create-md $r drbdadm -- -o primary $r mkfs /dev/drbdY
attempt to access beyond end of device drbd0: rw=1, want=211992584, limit=211986944 Buffer I/O error on device drbd0, logical block 26499072Your file system then remounts read-only, panics or similar. When you try to fsck, you get something like
The filesystem size (according to the superblock) is ... blocks. The physical size of the device is ...+x blocks.Envision this:
|-- usable area with drbd and internal meta data --|-IMD-| |-- real device -----------------------------------------|IMD is "internal meta data". Once created, it is fixed size. With drbd 0.7 it was fixed 128MB. With drbd 8.0 it is approximately [total storage of real device]/4/8/512/2 rounded up, +36k, rounded up to the next 4k.
exaple: grep -e hda4 -e drbd0 /proc/partitions 3 4 105996744 hda4 147 0 105993472 drbd0 ceil(105996744 kB / 32768) == 3235 kB + 36 kB == 3271 kB 4k aligned == 3272 kB 105996744 kB - 3272 kB == 105993472 kB
tune2fs -l /dev/whatever | awk '/^Block.size:/ { bs=$NF } /^Block.count:/ { bc=$NF } END { print bc * bs / 1024, "kB" }'
This is not a problem with drbd. It is a problem with using drbd incorrectly.
also see http://thread.gmane.org/gmane.linux.network.drbd/12690/focus=12692 or serach the list archives for more ascii art and explanations.
Outdated, applies to drbd versions prior drbd-0.6.4 only For historical reasons replicate used to work backwards. Most physical devices do have a pretty slow thoughput when writing data backwards.
double check the value of sync-max in the net {} section (drbd-0.6) resp. rate in the syncer {} section (drbd-0.7). Keep in mind that the default value is very low, and the default unit is kByte/sec!
check whether DMA is enabled
You may want to play with the values of protocol and sndbuf-size. If your NIC supports it, you may want to enable "jumbo frames" (up the value of the MTU). If nothing helps, ask the list for known good and performant setups...
not stopped
Note that all processes waiting for disk io are counted as runable! Therefore, if a lot of processes wait for disk io, the "load average" goes straight up, though the system actually may be almost idle cpu-wise ... E.g. crash your nfs server, and start 100 ls /path/to/non-cached/dir/on/nfs/mount-point on a client... you get a "load average" of 100+ for as long as the nfs timeout, which might be weeks ... though the cpu does nothing. Verify your system load by other means, e.g. vmstat, sysstat/sar. This will give you an idea of the bottleneck of your system. Some ideas are using multiple disks (not just partitions!) or even a RAID with 10.000rpm SCSI disks and probably even a Gigabit Ethernet. Even on a Fast Ethernet device you will rarely see more then 6 MByte per second. (100 MBit/s is at most 12.5 MByte/s minus protocol overhead and latency etc.).
DRBD-0.6 only
Exit code 255 is most likely from a script generated die, which include a verbose error message. Capture the output of that script. this is the debugfile directive in your ha.cf, iirc. If that does not help, do it by hand, and see what error message it gives. datadisk says something like cannot promote to primary, sychronization running or fsck failed or ...
Feature ...
DRBD does not automaticaly mount the partition. The script datadisk (or drbddisk since 0.7) is made for that purpose. It is intended to be called by hartbeat.
For each device, drbd will (try to) allocate X MB of bitmap, plus some constant amount (<1MB). X = storage_size_in_GB/32, so 1 TB storage -> 32 MB bitmap.
By default Linux allocates 128MB to Vmalloc. For systems using more than 4TB, this may cause an issue.
If you get the following error message in /var/log/messages, Try a Linux 2.6 hugemem kernel.
kernel: allocation failed: out of vmalloc space - use vmalloc=<size> to increase size.
0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ^ ^ ^ ^ ^ ^-[*] | | | | | | | | | wire protocol -ยด | | | `- disk state | | `- state (should be named role, but historically) | `- connection state `- minor number ns:67582830 nr:1293290 dw:68880243 dr:124296536 net sent net read disk write disk read al:13693 bm:101 meta data updates for activity log bitmap ,--------- lo:0 pe:0 ua:0 ap:0 gauges of currently pending requests, see below. ,- resync: used:0/31 hits:335 misses:109 starving:0 dirty:0 changed:109 |- act_log: used:0/1801 hits:6527480 misses:13790 starving:0 dirty:97 changed:13693 cache statistics for the resync and activity log in memory cache. you can safely ignore these.
[*]: four characters showing certain flag bits
Unconfigured |
Device waits for configuration. |
StandAlone |
Not trying to connect to peer, IO requests are only passed on locally. |
Unconnected |
Transitory state, while bind() blocks. |
WFConnection |
Device waits for configuration of other side. |
WFReportParams |
Transitory state, while waiting for first packet on a new TCP connection. |
Connected |
Everything is fine. |
Timeout, BrokenPipe, NetworkFailure |
Transitory states when connection was lost. |
|
|
SyncingAll |
All blocks of the primary node are being copied to the secondary node. |
SyncingQuick |
The secondary is updated, by copying the blocks which were updated since the now secondary node has left the cluster. |
SyncPaused |
Sync of this device has paused while higher priority (lower sync-group value) device is resyncing. |
|
|
WFBitMap{S,T} |
Transitory state when synchronization starts; "dirty"-bits are exchanged. |
SyncSource |
Synchronization in progress, this node has the good data. |
SyncTarget |
Synchronization in progress, this node has inconsistent data. |
PausedSync{S,T} |
see SyncPaused. |
SkippedSync{S,T} |
you should never see this. "Developers only" |
Primary |
the active node; may access the device. |
Secondary |
the passive node; must not access the device; expects mirrored writes from the other node. |
Unconfigured |
this is not a role, obviously. |
Diskless |
No storage attached, or storage had IO errors previously and got detached. |
Attaching |
in the process of attaching the local storage |
Failed |
storage had io errors |
Negotiating |
storage attached, but is not yet decided whether it is UpToDate |
Inconsistent |
storage is Inconsistent (e.g. half way during bitmap based resync) |
Outdated |
storage is consistent, but not UpToDate |
DUnknown |
(peer's) storage state is not known |
Consistent |
storage is consistent, not yet decided whether it is UpToDate or Outdated |
storage is good |
ns |
network send |
nr |
network receive |
dw |
disk write |
dr |
disk read |
al |
activity log updates (0.7 and later) |
bm |
bitmap updates (0.7 and later) |
lo |
reference count on local device |
pe |
pending (waiting for ack) |
ua |
unack'd (still need to send ack) |
ap |
application requests expecting io-completion |