Wednesday, March 11, 2015

Introduction to Parallel Computing Part 1b - Creating a Hadoop cluster (the hard way)

Before we start deploying, we need to plan a few things ahead. For instance, is your system supported by Hadoop?

In theory, pretty much any system that can run Java can run Hadoop. In practice, it's best to use one of the following, 64-bit operating systems:
  • Red Hat Enterprise Linux (RHEL) v6.x
  • Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
  • CentOS v6.x
  • CentOS v5.x (deprecated)
  • Oracle Linux v6.x
  • Oracle Linux v5.x (deprecated)
  • SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
  • Ubuntu Precise v12.04

If you intend to deploy on one of the distros, then great. I'm going to use RHEL 6.6.

Let's press on with the environment set up. For this, I will have just one namenode (hadoop1) and 5 datanodes (hadoop4, hadoop5, hadoop6, hadoop7 and hadoop8).

Hadoop Cluster
Node Type and Number Node Name IP
Namenode #1 hadoop1 192.168.0.101
Datanode #1 hadoop4 192.168.0.104
Datanode #2 hadoop5 192.168.0.105
Datanode #3 hadoop6 192.168.0.106
Datanode #4 hadoop7 192.168.0.107
Datanode #5 hadoop8 192.168.0.108


First of all, let's update everything (This should be issued on every server in our cluster):

[root@hadoop1 ~]# yum -y update

We'll need these (This should be issued on every server in our cluster.):

[root@hadoop1 ~]# yum -y install openssh-clients.x86_64
[root@hadoop1 ~]# yum -y install wget

Note: It's always best to actually have your node FQDNs on your DNS server and skip the next two steps (editing the /etc/hosts and the /etc/host.conf files). 

Now, let's edit our /etc/hosts to reflect our cluster (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/hosts
192.168.0.101   hadoop1
192.168.0.104   hadoop4
192.168.0.105   hadoop5
192.168.0.106   hadoop6
192.168.0.107   hadoop7
192.168.0.108   hadoop8

We should also check our /etc/host.conf and our /etc/nsswitch.conf, unless we want to have resolvable hostnames:

[hadoop@hadoop1 ~]$ vi /etc/host.conf
multi on
order hosts bind
[hadoop@hadoop1 ~]$ vi /etc/nsswitch.conf
....
#hosts:     db files nisplus nis dns
hosts:      files dns
....

We'll need a large number of file descriptors (This should be issued on every server in our cluster):

[root@hadoop1 ~]# vi /etc/security/limits.conf
....
* soft nofile 65536
* hard nofile 65536
....

We should make sure that our network interface comes up automatically:

[root@hadoop1 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0
....
ONBOOT="yes"
....

And of course make sure our other networking functions, such as our hostname are correct:

[root@hadoop1 ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=hadoop1
GATEWAY=192.168.0.1


We'll need to log in using SSH as root, so for the time being let's allow root logins. We might want to turn that off after we're done, as this is as insecure as they come:

[root@hadoop1 ~]# vi /etc/ssh/sshd_config
....
PermitRootLogin yes
....
[root@hadoop1 ~]# service sshd restart

NTP should be installed on every server in our cluster. Now that we've edited our hosts file things are much easier though:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "exec yum -y install ntp ntpdate"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chkconfig ntpd on; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" ntpdate pool.ntp.org; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" service ntpd start; done

Let's set up our hadoop user:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" groupadd hadoop; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" useradd hadoop -g hadoop -m -s /bin/bash; done       
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" passwd hadoop; done

Set up passwordless ssh authentication:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ ssh-keygen -t rsa 
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh "$host" mkdir -p .ssh; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh-copy-id -i ~/.ssh/id_rsa.pub "$host"; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 700 .ssh; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" chmod 640 .ssh/authorized_keys; done

Time to tone down our security a bit so that our cluster runs without problems. My PC's IP is 192.168.0.55 so I will allow that as well:

[root@hadoop1 ~]# iptables -F
[root@hadoop1 ~]# iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -i lo -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.101,192.168.0.104,192.168.0.105,192.168.0.106,192.168.0.107,192.168.0.108 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -s 192.168.0.55 -j ACCEPT
[root@hadoop1 ~]# iptables -A INPUT -j DROP
[root@hadoop1 ~]# iptables -A FORWARD -j DROP
[root@hadoop1 ~]# iptables -A OUTPUT -j ACCEPT
[root@hadoop1 ~]# iptables-save > /etc/sysconfig/iptables
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/iptables "$host":/etc/sysconfig/iptables; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "iptables-restore < /etc/sysconfig/iptables"; done

Let's disable SELinux:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" setenforce 0; done
[root@hadoop1 ~]# vi /etc/sysconfig/selinux
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
#     targeted - Targeted processes are protected,
#     mls - Multi Level Security protection.
SELINUXTYPE=targeted
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp /etc/sysconfig/selinux "$host":/etc/sysconfig/selinux; done

Turn down swappiness:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" "echo vm.swappiness = 1 >> /etc/sysctl.conf"; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" sysctl -p; done

Hadoop needs the Java SDK installed. Although it will probably work if you just pick the latest one, you'd better make sure which one fits your version better. Οpenjdk 1.7.0 was recommended for my version (2.6.0).

Download the latest Java SDK:

a) Go to http://www.oracle.com/technetwork/java/javase/downloads/index.html
b) Select the proper Java SDK version (make sure it's the x64 version)
c) Accept the licence
d) Select the the .tar.gz archive that is the proper architecture for your system
e) Copy its link location
f) And:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" 'wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/7u75-b13/jdk-7u75-linux-x64.tar.gz'; done

I'm going to install Java SDK in /opt/jdk:

[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" mkdir /opt/jdk; done
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh "$host" 'tar -zxf jdk-7u75-linux-x64.tar.gz -C /opt/jdk'; done
[root@hadoop1 ~]# /opt/jdk/jdk1.7.0_75/bin/java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)

Let's prepare our system environment variables. Here I'm going to install Hadoop in /opt/hadoop.

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ vi ~/.bashrc
....
export HADOOP_HOME=/opt/hadoop
export HADOOP_PREFIX=$HADOOP_HOME  
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop/lib/native
export YARN_HOME=$HADOOP_HOME
export JAVA_HOME=/opt/jdk/jdk1.7.0_75
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HIVE_HOME=/opt/hive
export PATH=$PATH:/opt/jdk/jdk1.7.0_75/bin:/opt/hadoop/bin:/opt/hadoop/sbin:/opt/pig/bin:/opt/hive/bin:/opt/hive/hcatalog/bin:/opt/hive/hcatalog/sbin
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp ~/.bashrc "$host":~/.bashrc; done
[hadoop@hadoop1 ~]$ source ~/.bashrc

Time to download and extract the latest Hadoop:

Go to:

http://hadoop.apache.org/releases.html

select "Download a release now", select a mirror and copy its link location

[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh hadoop@"$host" 'wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz'; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" mkdir /opt/hadoop; done
[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" 'tar -zxf /home/hadoop/hadoop-2.6.0.tar.gz -C /opt/hadoop/ --strip-components=1'; done
[hadoop@hadoop1 ~]$ exit

Let's edit our config files:

core-site.xml

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/core-site.xml
<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://hadoop1:9000/</value>
      <description>Hadoop Filesystem</description>
   </property>
   <property>
      <name>dfs.permissions</name>
      <value>false</value>
   </property>
</configuration>

Notice that I have put my intended namenode as a fs.default.name value.

hdfs-site.xml 

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
   <property>
      <name>dfs.datanode.dir</name>
      <value>/opt/hadoop/hdfs/datanode</value>       
      <description>DataNode directory for storing data chunks.</description>
      <final>true</final>
   </property>
   <property>
      <name>dfs.namenode.dir</name>
      <value>/opt/hadoop/hdfs/namenode</value>  
      <description>NameNode directory for namespace and transaction logs storage.</description>
      <final>true</final>
   </property>
   <property>
      <name>dfs.replication</name>
      <value>2</value>
      <description>Level of replication for each chunk.</description>
   </property>
</configuration>

The replication factor is a property that can be set in the HDFS configuration file that will allow you to
adjust the global replication factor for the entire cluster.

For each block stored in HDFS, there will be n–1 duplicated blocks distributed across the cluster.
For example, if the replication factor was set to 3 (default value in HDFS) there would be one original block
and two replicas.

mapred-site.xml

[root@hadoop1 ~]# cp /opt/hadoop/etc/hadoop/mapred-site.xml.template /opt/hadoop/etc/hadoop/mapred-site.xml
[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/mapred-site.xml
<configuration>
   <property>
      <name>mapreduce.jobtracker.address</name>
      <value>hadoop1:9001</value>
   </property>
</configuration>

This is the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. 

yarn-site.xml 

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/yarn-site.xml
<configuration>
   <property>
     <name>yarn.resourcemanager.hostname</name>
     <value>hadoop1</value>
     <description>The hostname of the ResourceManager</description>
   </property>
   <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
     <description>shuffle service for MapReduce</description>
   </property>
</configuration>


hadoop-env.sh  

[root@hadoop1 ~]# vi /opt/hadoop/etc/hadoop/hadoop-env.sh
....
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/opt/jdk/jdk1.7.0_75
....
#export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
....

Let's create the namenode and datanode directories that we set in hdfs-site.xml and do the rest of the menial configuration tasks:

[root@hadoop1 ~]# mkdir -p /opt/hadoop/hdfs/namenode
[root@hadoop1 ~]# mkdir -p /opt/hadoop/hdfs/datanode
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do scp -r /opt/hadoop/etc/* "$host":"/opt/hadoop/etc/."; done
[root@hadoop1 ~]# echo "hadoop1" > /opt/hadoop/etc/hadoop/masters
[root@hadoop1 ~]# echo $'hadoop4\nhadoop5\nhadoop6\nhadoop7\nhadoop8' > /opt/hadoop/etc/hadoop/slaves
[root@hadoop1 ~]# for host in $(grep hadoop /etc/hosts | awk '{print $2}'); do ssh root@"$host" chown -R hadoop:hadoop /opt/hadoop/; done

Okay, let's start it all up and see if it works:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ hdfs namenode -format

This will format our HDFS filesystem. And:

[hadoop@hadoop1 ~]$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/03/10 17:43:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop1]
hadoop1: starting namenode, logging to /opt/hadoop/logs/hadoop-hadoop-namenode-hadoop1.out
hadoop4: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop4.out
hadoop7: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop7.out
hadoop5: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop5.out
hadoop6: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop6.out
hadoop8: starting datanode, logging to /opt/hadoop/logs/hadoop-hadoop-datanode-hadoop8.out

Looks like we're up and running, let's make sure:

[hadoop@hadoop1 ~]$ jps
13247 ResourceManager
12922 NameNode
13620 Jps
13113 SecondaryNameNode

Yup, how about on the other nodes?

[hadoop@hadoop1 ~]$ for host in $(grep hadoop /etc/hosts | awk '{print $2}' | grep -v hadoop1); do ssh hadoop@"$host" jps; done
1897 DataNode
1990 NodeManager
2159 Jps
2164 Jps
1902 DataNode
1995 NodeManager
2146 Jps
1884 DataNode
1977 NodeManager
2146 Jps
1884 DataNode
1977 NodeManager
12094 DataNode
12795 Jps
12217 NodeManager
[hadoop@hadoop1 ~]$ exit


Great. Now you can navigate to your namenode IP:50070 with your web browser and you should have an overview of your cluster.

To finish this off, let's install Pig and Hive. WebHCat and HCatalog are installed with Hive, starting with Hive release 0.11.0.

Go to http://pig.apache.org/releases.html, choose a version, choose a mirror and download:

[root@hadoop1 ~]# wget http://www.eu.apache.org/dist/pig/latest/pig-0.14.0.tar.gz
[root@hadoop1 ~]# mkdir /opt/pig
[root@hadoop1 ~]# tar -zxf pig-0.14.0.tar.gz -C /opt/pig --strip-components=1
[root@hadoop1 ~]# chown -R hadoop:hadoop /opt/pig

I've already taken care of the environment variables earlier (you need to add /opt/pig/bin to the PATH), so all I need to do now is:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ pig -x mapreduce
grunt> QUIT;
[hadoop@hadoop1 ~]$ exit

Great. Now go to https://hive.apache.org/downloads.html, choose download, choose mirror:

[root@hadoop1 ~]# wget http://www.eu.apache.org/dist/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz
[root@hadoop1 ~]# mkdir /opt/hive
[root@hadoop1 ~]# tar -zxf apache-hive-1.1.0-bin.tar.gz -C /opt/hive --strip-components=1
[root@hadoop1 ~]# chown -R hadoop:hadoop /opt/hive

I've already taken care of the environment variables earlier (you need to add /opt/hive/bin to the PATH and export HIVE_HOME=/opt/hive), so all I need to do now is:

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ hadoop fs -mkdir /tmp
[hadoop@hadoop1 ~]$ hadoop fs -mkdir -p /user/hive/warehouse
[hadoop@hadoop1 ~]$ hadoop fs -chmod g+w /tmp
[hadoop@hadoop1 ~]$ hadoop fs -chmod g+w /user/hive/warehouse
[hadoop@hadoop1 ~]$ hcat
usage: hcat { -e "" | -f "" } [ -g "" ] [ -p "" ] [ -D"=" ]
 -D    use hadoop value for given property
 -e              hcat command given from command line
 -f              hcat commands in file
 -g             group for the db/table specified in CREATE statement
 -h,--help             Print help information
 -p             permissions for the db/table specified in CREATE statementgrunt> QUIT;

Yup, looks like it all works.

No comments:

Post a Comment