Install Hadoop 2.5.0 on Arch Linux

This post is roughly based on the Apache Hadoop 2.5.0 documentation of Single Node Cluster.

1 Preparation

Before installing Hadoop, we should make sure that ssh, rsync and OpenJDK are already installed on the computer. According to HadoopJavaVersions, OpenJDK 7, which is offered by Arch Official Repository, is good to work with Hadoop. However, there is a minor problem during recent java-common's upgrading from 1.6 to 1.7, which causes the problem that the default jvm might not be correctly configured. Run

archlinux-java status

to see whether the default jvm has already been chosen. If not, Run

archlinux-java fix

to fix it.

We also want a dedicated user and group for Hadoop. This is quite easy:

sudo groupadd hadoop
sudo useradd -m -g hadoop hduser

OK, done. Let's go ahead and install Hadoop.

Note: I did try the instructions on Hadoop ArchWiki Page and installed the Hadoop package from AUR. But the wiki page seems out-of-date and it is hard to configure the AUR package for pseudo-distributed mode since I didn't manage to find useful documentation about that.

2 Install Hadoop

2.1 Download the Distribution

First download Hadoop 2.5.0 distribution from Apache Download Mirrors. Extract it to hadoop-2.5.0 directory and move it to a directory you like. I moved it to /usr/local/, and ran the following command(may need root privilege):

ln -s /usr/local/hadoop-2.5.0 /usr/local/hadoop
chown -R hduser:hadoop /usr/local/hadoop-2.5.0

2.2 Change Environment Variables

The most important environment variable is JAVA_HOME in /usr/local/hadoop/etc/hadoop/hadoop-env.sh. Usually changing it to /usr/lib/jvm/default would be enough for Arch Linux users.

The following change is not mandatory, but it may simplify our typing…I Added these lines to the initial script of my favorite shell(as user hduser):

HADOOP_PREFIX=/usr/local/hadoop
PATH=$PATH:${HADOOP_PREFIX}/bin:${HADOOP_PREFIX}/sbin

2.3 Standalone Mode

Now Hadoop should work in a non-distributed mode. The following commands are based on Apache Hadoop Documentation.

mkdir input
cp ${HADOOP_PREFIX}/etc/hadoop/*.xml input
hadoop jar ${HADOOP_PREFIX}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar grep input output 'dfs[a-z.]+'
cat output/*

The following sections are only essential for the pseudo-distributed operation.

2.4 Setup Paraphraseless SSH

Now ssh onto localhost as hduser, then

ssh-keygen -P ''
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

2.5 Configuration for Pseudo-Distributed Mode

I added the following lines to ${HADOOP_PREFIX}/etc/hadoop/core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hduser/hdfs</value>
  </property>
</configuration>

The hadoop.tmp.dir defaults to /tmp/hadoop-${user.name}. I changed it because the contents under /tmp would be removed on every reboot, which is not I want. Then add these lines to ${HADOOP_PREFIX}/etc/hadoop/hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Then the following commands from the Apache Hadoop Documentation should work.

hdfs namenode -format
start-dfs.sh
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/hduser
hdfs dfs -put ${HADOOP_PREFIX}/etc/hadoop input
hadoop jar ${HADOOP_PREFIX}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar grep input output 'dfs[a-z.]+'
hdfs dfs -get output output
cat output/*
stop-dfs.sh
Junpeng Qiu 05 September 2014
blog comments powered by Disqus