Today, I would like to introduce the configuration of an HDFS server. For reference, you can check out the previous articles:
Setting Up an HDFS Server Using Apache Hadoop 2.6.0
1) Preparing the Hadoop Cluster Environment
- Operating System User: gbase
- SSH trust has been established between all cluster nodes.
- The C3 tool is already configured in the cluster.
-
Open Source Product Versions:
- Apache Hadoop 2.6.0
- JVM Version 1.6 or 1.7
Example Configuration:
IP | Hostname | Role |
---|---|---|
192.168.10.114 | ch-10-114 | NameNode, DataNode |
192.168.10.115 | ch-10-115 | DataNode |
192.168.10.116 | ch-10-116 | DataNode |
2) Configuring Hostnames
Each node needs to have the correct hostname configuration. For example, on the node 192.168.10.114
, the configuration should be as follows. Other nodes can directly copy this configuration.
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.10.114 ch-10-114
192.168.10.115 ch-10-115
192.168.10.116 ch-10-116
Note: If the first line is configured as shown below, there will be an issue where the Hadoop DataNode cannot connect to the NameNode after installation.
127.0.0.1 ch-10-114 localhost localhost.localdomain localhost4 localhost4.localdomain4
If the cluster does not have a DNS server to resolve the hostnames of Hadoop's NameNode and DataNode, you need to configure the /etc/hosts
file on every coordinator node executing the load task and every data node in the cluster. Add the mappings of the IP addresses and hostnames of the NameNode and DataNode as shown above. If the /etc/hosts
file is not configured, an error like “Couldn't resolve hostname” will be reported when loading files from the HDFS server.
Check Method:
Use the jps
command to check. If you find that the DataNode has started but its log shows continuous attempts to connect to the NameNode's port 9000 (HDFS's RPC port), check the NameNode node with netstat -an
. You should see something like this:
$ netstat -an | grep 9000
tcp 0 0 127.0.0.1:9000 0.0.0.0:* LISTEN
Error Reason: The IP address for the TCP listener is 127.0.0.1
, causing only the local machine to connect to port 9000. This is due to an incorrect configuration of the /etc/hosts
file on the NameNode.
Solution: Remove the red text (ch-10-114
) from the first line, or move the contents of the first line to a later position.
Correct configuration:
192.168.10.114 ch-10-114
192.168.10.115 ch-10-115
192.168.10.116 ch-10-116
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
Restart HDFS and check again with netstat -an | grep 9000
. The port and IP should now be correct:
$ netstat -an | grep 9000
tcp 0 0 192.168.10.114:9000 0.0.0.0:* LISTEN
3) Directory Planning
Directory | Purpose |
---|---|
/home/gbase/bin | Stores the Hadoop ecosystem, including Hadoop itself |
/home/gbase/hdfs | Stores HDFS files, including tmp, name, and data |
Add the environment variable ${HADOOP_HOME}
:
$ echo "export HADOOP_HOME=/home/gbase/bin/Hadoop-2.6.0" >> ~/.bashrc
$ . ~/.bashrc
Note: ${HADOOP_HOME}
refers to /home/gbase/bin/Hadoop-2.6.0
below.
4) Preparing Hadoop 2.6.0
Unzip hadoop-2.6.0.tar.gz
to /home/gbase/bin
on each node.
$ tar xfz hadoop-2.6.0.tar.gz -C /home/gbase/bin
5) Configuring hadoop-env.sh
File path: ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
$ cd ${HADOOP_HOME}
$ vi etc/hadoop/hadoop-env.sh
Configure both NameNode and DataNode as follows.
Change export JAVA_HOME=$JAVA_HOME
to:
export JAVA_HOME=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64
Change export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
to:
export HADOOP_CONF_DIR=/home/gbase/bin/hadoop-2.6.0/etc/hadoop
6) Configuring core-site.xml
File path: ${HADOOP_HOME}/etc/hadoop/core-site.xml
$ cd ${HADOOP_HOME}
$ vi etc/hadoop/core-site.xml
Configure both NameNode and DataNode as follows:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ch-10-114:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/gbase/hdfs/tmp</value>
</property>
</configuration>
7) Configuring hdfs-site.xml
File path: ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml
$ cd ${HADOOP_HOME}
$ vi etc/hadoop/hdfs-site.xml
NameNode Configuration:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:/home/gbase/hdfs/name</value>
<description>name node dir </description>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
DataNode Configuration:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>file:/home/gbase/hdfs/data</value>
<description>data node dir</description>
</property>
</configuration>
8) Configuring Masters and Slaves
File paths:
${HADOOP_HOME}/etc/hadoop/masters
${HADOOP_HOME}/etc/hadoop/slaves
Only need to configure on the NameNode node.
$ cd ${HADOOP_HOME}
$ vi etc/hadoop/masters
Contents of ${HADOOP_HOME}/etc/hadoop/masters
:
ch-10-114
$ cd ${HADOOP_HOME}
$ vi etc/hadoop/slaves
Contents of ${HADOOP_HOME}/etc/hadoop/slaves
:
ch-10-114
ch-10-115
ch-10-116
9) Formatting the NameNode
NameNode formatting needs to be done before starting HDFS.
$ cexec rm -fr /home/gbase/hdfs/*
$ cd ${HADOOP_HOME}
$ bin/hdfs namenode -format
10) Starting HDFS
$ cd ${HADOOP_HOME}
$ sbin/start-dfs.sh
After starting, use the jps
command to check the processes on each node. The following output indicates successful startup:
$ cexec jps
************************* test *************************
--------- 192.168.10.114---------
31318 SecondaryNameNode
31133 NameNode
31554 Jps
--------- 192.168.10.115---------
10835 DataNode
11000 Jps
--------- 192.168.10.116---------
10145 DataNode
10317 Jps
11) Stopping HDFS
$ cd ${HADOOP_HOME}
$ sbin/stop-dfs.sh