Learn to build and use multi-node Hadoop clusters running in Amazon EC2. A few bits of knowledge are assumed in this article, first a basic knowledge of Hadoop. If you haven't used hadoop before you probably want to read Intro To Hadoop Article first.
Also the article uses EC2 so reading EC2 Article. The article focuses on OpenSolaris, but the concepts are similar. This tutorial can be followed on a series of local linux boxes as well. The EC2 instance being used is ami-fa6a8e93 hadoop-images/hadoop-0.19.0-i386.manifest.xml.
See the installing Hadoop article for installation instructions if you're working in a standalone environment. Note this example uses an older version of Hadoop that uses different names for certain configuration files. Make sure that Hadoop is installed in the same directory on all nodes. Larger environments might want to look at storing Hadoop on NFS. If using EC2 launch the first instance of ami-fa6a8e93. In the tutorial this instance has the IP address of 10.251.90.244. This instance will serve as the NameNode server and first slave.
Also if you wish to look at secondary NameNode failover create and assign an elastic IP address to this instance. This process requires reassigning the IP address and that must be done with a public elastic IP. Also whenever the internal master IP address is displayed replace it with the elastic IP address.
First create the hadoop user and group.
# groupadd hadoop # useradd -g hadoop
Next setup key based password-less ssh.
# su - hadoop $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ chmod 600 ~/.ssh/*
For more information about ssh keys see SSH article. Now take a second to chown the hadoop configuration files over to the hadoop user. This is being done just to make copying the config files around easier after rebundling the instance.
# chown -R hadoop:hadoop /usr/local/hadoop-0.19.0/conf
At this point rebundle the configured instance. If you aren't using EC2 skip this step. This will allow us to spin up multiple slaves already configured with our user setup.
Running EC2 Instance:
# rm -f /root/.*hist* /home/hadoop/.*hist* # rm -f /var/log/*.gz # mkdir /mnt/keys # mv /root/.ssh/authorized_keys /mnt
$ scp /path/to/cert email@example.com:/mnt/keys $ scp /path/to/key firstname.lastname@example.org:/mnt/keys
Cleanup the running instance. Then upload your private Amazon AWS certificate/key to the running instance. The directory /mnt/keys will hold the private data. When rebundling the image /mnt will be excluded from the bundle. You should never create an ami with these keys stored or your private login key stored. After rebundling the image roots authorized_keys file will be copied back.
# export BUCKET=higherpass-hadoop # export IMAGE=higherpass-hadoop0.19.0-linux # export EC2_PRIVATE_KEY=/mnt/keys/key.pem # export EC2_CERT=/mnt/keys/cert.pem # export EC2_KEYID= # export EC2_KEY= # export EC2_USERID= # export ARCH=i386
Setup a number of environment variables to be used during rebundling. The EC2_KEYID is set to the access key ID and the EC2_KEY is set to the secret access key. This can be found by going to Your Account > Security Creditionals in the Amazon AWS web console.
# ec2-bundle-vol -r $ARCH -d /mnt -p $IMAGE -u $EC2_USERID -k $EC2_PRIVATE_KEY -c $EC2_CERT -s 10240 -e /mnt,/tmp # ec2-upload-bundle -b $BUCKET -m /mnt/$IMAGE.manifest.xml -a $EC2_KEYID -s $EC2_KEY
Create the EC2 AMI bundle excluding /mnt & /tmp then upload the bundled AMI into Amazon S3. Since the base image doesn't include the ec2-register script, register the AMI from the Amazon AWS web console. You'll need to provide the path to the manifest.xml file stored in S3. Enter the values of $BUCKET/$IMAGE.manifest.xml.
Now that we have a clean AMI copy back the root authorized_keys file back to /root/.ssh. Also feel free to delete the contents of /mnt. On the next page before we start the slaves we'll take a second to start using Hadoop.