Wednesday, December 28, 2011

How to Set up Hadoop 1.0.0 on RHEL/CentOS

Finally, Hadoop 1.0.0 was released yesterday after six years :).

My research field is Natural Language Processing, and I have used Nutch (with Hadoop) since 2007. I have been following news and articles on Hadoop for a long time, but I have not set up any new version for about two years. However, I were very surprised when I visited Hadoop website today because of... the version 1.0.0. It is obviously that I immediately downloaded and installed it on my server.

Now, I will note steps that I used to setup Hadoop to run on single node, fix some minor errors, and test it. It is very easy to install Hadoop nowadays, much much much easier than two years ago.

Some information about my server:
  • Server: DELL, Intel Xeon X3450 @ 2.67GHz (4 cores, HT enabled).
  • RAM: 8.0 GB
  • OS: Red Hat Enterprise Linux Server release 6.2 (Santiago)
Based on the information that I collected in the installation process, there is no difference if you use CentOS 6, or even RHEL/CentOS 5.

Requirements:
  • Oracle JDK 1.6 (I am not sure whether it works well with OpenJDK or Oracle JDK 1.7.)
Now, we begin...

Step 1. Download and install Oracle JDK 1.6 from Oracle Java SE site. I used jdk-6u30-linux-x64-rpm.bin. I do the following commands in my Linux box.

# wget http://download.oracle.com/otn-pub/java/jdk/6u30-b12/jdk-6u30-linux-x64-rpm.bin
# chmod +x jdk-6u30-linux-x64-rpm.bin
# ./jdk-6u30-linux-x64-rpm.bin

Make sure that JDK was installed to directory /usr/java/default, in which default is a symbolic link to latest and latest is, in turn, a symbolic link to jdk1.6.0_30. Hadoop will use this default location to run regardless of the default version of Java on your system, such as OpenJDK.

Step 2. Download and install Hadoop. Hadoop 1.0.0 is shipped with many packages including source, general binary, rpm, and deb for both 32-bit OS (i386) and 64-bit OS (amd64). I used hadoop-1.0.0-1.amd64.rpm because my OS is RHEL 64-bit. If you use Debian, please use .deb package.

# wget http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-1.0.0/hadoop-1.0.0-1.amd64.rpm
# rpm -ivh hadoop-1.0.0-1.amd64.rpm

Very easy, and very fast, huh!!!

Step 3. Set up Hadoop for using on single node. This step shows how the installation is too easy.

# hadoop-setup-single-node.sh

Answer "yes" (y) for all questions. After the setup has finished, services of Hadoop will be started automatically, including hadoop-namenode, hadoop-datanode, hadoop-jobtracker, and hadoop-tasktracker.

Step 3a. Fix a minor bug.

All four services seem to be started successfully. However, Hadoop is at 1.0.0, it should have some bugs :D. Indeed, there is at least a bug related to the mapredsystem directory on HDFS that makes hadoop-jobtracker cannot start although it said OK when you started the service. You can check the log file for more details. It is located under /var/log/hadoop/mapred/ with the name hadoop-mapred-jobtracker-$HOSTNAME.log. The log might have a portion look as follows.

WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (hdfs://localhost:8020/mapred/mapredsystem) because of permissions.
WARN org.apache.hadoop.mapred.JobTracker: Manually delete the mapred.system.dir (hdfs://localhost:8020/mapred/mapredsystem) and then start the JobTracker.
WARN org.apache.hadoop.mapred.JobTracker: Bailing out ...
org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode="":hdfs:supergroup:rwxr-xr-x

The log said that Hadoop cannot operate the system directory of Map-Reduce /mapred/mapredsystem since user is mapred and owner is hdfs. Furthermore, the directory owner does not allow others to write on it. I check the details by using the following command.

# hadoop fs -ls /
drwxrwxrwx   - hdfs   supergroup          0 2011-12-28 20:08 /tmp
drwxr-xr-x   - hdfs   supergroup          0 2011-12-28 21:46 

What??? I did not see the directory /mapred. Hadoop might have a bug in the installation script. I fixed it by creating /mapred using hdfs user and change the owner of that directory to mapred user.

# sudo -u hdfs hadoop fs -mkdir /mapred
# sudo -u hdfs hadoop fs -chown mapred /mapred

Restart JobTracker.

# service hadoop-jobtracker restart

I checked the log file again, and everything was ok.

Step 4. Test Hadoop.

Hadoop 1.0.0 has been shipped with an example to validate the setup process. Run it as follow.

# hadoop-validate-setup.sh --user=hdfs

If you get "teragen, terasort, teravalidate passed." near the end of the output, everything is ok.

You can check the progress by accessing JobTracker website via http://localhost:50030/.

You can also check nodes by accessing NameNode website via http://localhost:50070/.

I will try to setup a new Hadoop cluster in this New Year Holiday, and put my notes on this blog ASAP.

Enjoy your New Year Holiday!!!

4 comments:

Anonymous said...

Do you know how to solve below problems when I run script, hadoop-setup-single-node.sh? Thanks,

/usr/sbin/hadoop-setup-single-node.sh: line 179: /etc/init.d/hadoop-namenode: No such file or directory
/usr/sbin/hadoop-setup-single-node.sh: line 180: /etc/init.d/hadoop-datanode: No such file or directory
/usr/sbin/hadoop-setup-single-node.sh: line 187: /etc/init.d/hadoop-jobtracker: No such file or directory
/usr/sbin/hadoop-setup-single-node.sh: line 188: /etc/init.d/hadoop-tasktracker: No such file or directory

Unknown said...

Hi,

Which distribution do you use? There might be some error during the installation process. There is no needed service (hadoop-{namenode,datanode,jobtracker,tasktraker}) in the right location (/etc/init.d).

Another try is to run the following command to locate them.

find / -name hadoop-namenode

Jinesh M K said...

Thanks for sharing thsi.
How to configure multi-node cluster using hadoop 1.0?

Unknown said...

Great post Viet, especially the fix to allow JT to properly start.

Do you know how to create an account for yourself so that you don't have to run hadoop as hdfs or mapred. Otherwise, trying to run hadoop as youself returns:

-bash-4.1$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Bad connection to FS. command aborted. exception: java.io.IOException: Unknown protocol to job tracker: org.apache.hadoop.hdfs.protocol.ClientProtocol
at org.apache.hadoop.mapred.JobTracker.getProtocolVersion(JobTracker.java:344)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

-bash-4.1$


Thank you.