This article explains how the discovery and monitoring of datanodes work. This is a function of the namenode.
Multicast discovery
Multicast discovery is the default. It works out of the box if there is only a Level-2 switch between the namenode and the datanodes. It needs tweaking if there is a router between the nodes, and it is not working if the network does not support multicast (e.g. the Amazon EC2 network does not).
It works as follows: The namenode sends multicast messages to
224.0.0.1:2728
, and hopes to get responses by the datanodes in the
network. All responding IP addresses are more closely checked. The
TTL value is set to 1 by default, i.e. routers cannot be passed.
This default works automatically for many users, but not for everybody. The following subsections will guide you through the configuration. If it turns out that multicasting does not work for you, the alternative is to enter the IP addresses of all datanodes directly into the configuration file. This is described at the end of this article.
There is another advantage of the multicast discovery: If more datanodes are added to the system, it is not necessary to restart the namenode. The new datanodes will automatically be found, and it is sufficient to enable them (with a command).
Inspect your network
If you are in a company network, you should ask the operator whether multicasting is possible and how.
If you are the operator, please note:
Especially check whether eth0 (or your default network device) has multicast enabled:
$ /sbin/ifconfig eth0
eth0 Link encap:Ethernet HWaddr bc:ae:c5:6c:b4:1e
inet addr:192.168.5.10 Bcast:192.168.5.255 Mask:255.255.255.0
inet6 addr: fe80::beae:c5ff:fe6c:b41e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:645651689 errors:0 dropped:0 overruns:0 frame:0
TX packets:980018099 errors:0 dropped:4360 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:248979503115 (231.8 GiB) TX bytes:1133161951156 (1.0 TiB)
Interrupt:251 Base address:0x6000
The keyword "MULTICAST" indicates this. Some devices do not support multicast (e.g. lo, and Wifi cards).
Also, there must be a route for "224.0.0.0/4" pointing to eth0. It is sufficient when the default route is set, e.g. when you have a line for "0.0.0.0" in
$ /sbin/route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.5.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
0.0.0.0 192.168.5.1 0.0.0.0 UG 0 0 0 eth0
If there is no such line, you need to set a special route for multicast traffic:
$ /sbin/route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
Inspect the Plasma configuration
The configuration of the namenode is relevant here (namenode.conf).
Check the following settings (all in the datanodes
section):
discovery
: This parameter says to which IP address the
multicast messages are sent. The default is "224.0.0.1", which is
the "all site" address. This means that all hosts of the site are
automatically member of this group, or in other words, that messages
sent to this IP reach all hosts like a site-wide broadcast. In order
to change it, use the syntax
discovery { addr = "224.0.0.2" };
multicast_ttl
: This parameter limits the number of hops. You should
set this to n+1
when n
is the maximum number of routers between the
namenode and any datanode.After doing the necessary changes and redeploying, you should check the log files of the namenode to see if any datanodes are discovered. These messages look like
[Mon Jan 30 21:05:54 2012] [Nn_monitor] [info] Discovered datanode f758c8022530c
e0ea8c23b35f28dedb1 at 127.0.0.1:2728 with size 6553600 (enabled)
If there is an error with multicasting, one possible error message is
[Mon Jan 30 18:34:38 2012] [Nn_monitor] [alert] Datanode discovery: Multicast re
quest to 224.0.0.1 cannot be routed. This seems to be an error in the multicast
configuration of this host
Not all problems can be discovered, however. Sometimes, the only visible effect is that no datanodes are found.
Unicast discovery
The alternate solution to the discovery problem is to enter the IP's to check manually. The list of IP's should at minimum include all hosts that are currently used as datanodes, but it is also allowed to add more IP's for hosts that might become datanode in the future.
There are two methods: First, put the IP's directly into namenode.conf.
The datanodes
section must be extended by discovery
subsections
with all IP's (or hostnames), as in
datanodes {
discovery { addr="10.0.0.1" };
discovery { addr="10.0.0.2" };
discovery { addr="10.0.0.3" };
...
}
The second method is to only enter the name of a file containing the IP's or hostnames:
datanodes {
discovery_list = "/path/to/file";
}
The file should contain one IP or hostname per line. Of course, one
possible candidate for this file is datanode.hosts
.