This post is roughly based on the Apache Hadoop 2.5.0 documentation of Single Node Cluster.
Before installing Hadoop, we should make sure that ssh
, rsync
and OpenJDK
are already installed on the computer. According to HadoopJavaVersions, OpenJDK
7, which is offered by Arch Official Repository, is good to work with Hadoop.
However, there is a minor problem during recent java-common
's upgrading from
1.6 to 1.7, which causes the problem that the default jvm might not be correctly
configured. Run
archlinux-java status
to see whether the default jvm has already been chosen. If not, Run
archlinux-java fix
to fix it.
We also want a dedicated user and group for Hadoop. This is quite easy:
sudo groupadd hadoop sudo useradd -m -g hadoop hduser
OK, done. Let's go ahead and install Hadoop.
Note: I did try the instructions on Hadoop ArchWiki Page and installed the
Hadoop
package from AUR. But the wiki page seems out-of-date and it is hard to
configure the AUR package for pseudo-distributed mode since I didn't manage to
find useful documentation about that.
First download Hadoop 2.5.0 distribution from Apache Download Mirrors. Extract
it to hadoop-2.5.0
directory and move it to a directory you like. I moved it
to /usr/local/
, and ran the following command(may need root privilege):
ln -s /usr/local/hadoop-2.5.0 /usr/local/hadoop chown -R hduser:hadoop /usr/local/hadoop-2.5.0
The most important environment variable is JAVA_HOME
in
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
. Usually changing it to
/usr/lib/jvm/default
would be enough for Arch Linux users.
The following change is not mandatory, but it may simplify our typing…I Added
these lines to the initial script of my favorite shell(as user hduser
):
HADOOP_PREFIX=/usr/local/hadoop PATH=$PATH:${HADOOP_PREFIX}/bin:${HADOOP_PREFIX}/sbin
Now Hadoop should work in a non-distributed mode. The following commands are based on Apache Hadoop Documentation.
mkdir input cp ${HADOOP_PREFIX}/etc/hadoop/*.xml input hadoop jar ${HADOOP_PREFIX}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar grep input output 'dfs[a-z.]+' cat output/*
The following sections are only essential for the pseudo-distributed operation.
Now ssh onto localhost as hduser
, then
ssh-keygen -P '' cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
I added the following lines to ${HADOOP_PREFIX}/etc/hadoop/core-site.xml
:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hdfs</value> </property> </configuration>
The hadoop.tmp.dir
defaults to /tmp/hadoop-${user.name}
. I changed it
because the contents under /tmp
would be removed on every reboot, which is not
I want. Then add these lines to ${HADOOP_PREFIX}/etc/hadoop/hdfs-site.xml
:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Then the following commands from the Apache Hadoop Documentation should work.
hdfs namenode -format start-dfs.sh hdfs dfs -mkdir /user hdfs dfs -mkdir /user/hduser hdfs dfs -put ${HADOOP_PREFIX}/etc/hadoop input hadoop jar ${HADOOP_PREFIX}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar grep input output 'dfs[a-z.]+' hdfs dfs -get output output cat output/* stop-dfs.sh