The AMI we chose to use includes Hadoop pre-configured for standalone usage. The only configuration option to change is to store the HDFS filesystem in /mnt. Make the directory /mnt/hadoop and chown it to the hadoop user.
# mkdir /mnt/hadoop # chown hadoop:hadoop /mnt/hadoop
Update the Hadoop configuration to point to the new location in /usr/local/hadoop-0.19.0/conf/hadoop-default.xml. Also update fs.default.name as shown. This will point it to HDFS instead of local file access.
<property> <name>hadoop.tmp.dir</name> <value>/mnt/hadoop</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310/</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>
Next format the HDFS filesystem as the hadoop user. Then take a second before importing any data look and see how much space the HDFS filesystem is consuming with du or df.
$ /usr/local/hadoop-0.19.0/bin/hadoop namenode -format
In the Amazon AWS console create EBS volume from EBS snapshot snap-92d333fb This is the 2000 US Census Data. This data is provided by Amazon free of charge! Although there are usage costs associated with the EBS volume created.
Before we can copy the file we need to attach and mount the new EBS volume. In the Instances pane take note of the running instance i-XXXXXX ID. Then go to the volumes pane, select the newly created volume and click attach volume. Select the correct instance and /dev/sdb for the device. Also for some reason /tmp isn't 777 so we make it that way since Hadoop writes to it.
# chmod 777 /tmp
# mkdir /mnt/input
# mount /dev/sdb /mnt/input
# su - hadoop
$ /usr/local/hadoop-0.19.0/bin/start-all.sh
$ /usr/local/hadoop-0.19.0/bin/hadoop dfs -mkdir /census_2000
$ for state in `ls /mnt/input/census_2000/datasets/100_and_sample_profile/ |
egrep -v "^0"`; do unzip /mnt/input/census_2000/datasets/100_and_sample_profile/$state/2k*zip;
/usr/local/hadoop-0.19.0/bin/hadoop dfs -put 2kh*csv
/census_2000; rm 2kh*csv; done
Here we created /mnt/input to mount the EBS volume to. Then become the hadoop user and start Hadoop. Next make an input directory to store the census data in. The loop will uncompress all the census state data to use and copy them into the Hadoop HDFS filesystem for use. Once all the data has been copied into the HDFS filesystem the EBS volume is no longer required. Also check the disk usage of the HDFS filesystem.
Finally time to run a MapReduce job. Get code from Java Hadoop Article and compile it as shown. The complete code is available on the last page of the tutorial Java tutorial.
Now build the java and Execute the job:
$ mkdir Gender_classes $ /usr/local/jdk1.6.0_10/bin/javac -classpath /usr/local/hadoop-0.19.0/hadoop-0.19.0-core.jar -d Gender_classes Gender.java $ /usr/local/jdk1.6.0_10/bin/jar -cvf Gender.jar -C Gender_classes/ . $ /usr/local/hadoop-0.19.0/bin/hadoop jar Gender.jar Gender female /census_2000 /out $ /usr/local/hadoop-0.19.0/bin/hadoop dfs -cat /out/part-00000
First create a directory Gender_classes to store the java class files. Then compile the java file including the hadoop core jar file. Next create a jar file, Gender.jar, that contains everything needed in a single package. Now that the code is ready to be run execute a Hadoop job. The jar option specifies that a jar will be run, this is followed by the name of the jar. Then the name of the class that should be executed is passed. The final 3 options are related to the code. This code parses the input files and finds areas with a larger population of the specified gender. Following the gender is the input path to the census files and the job output path.
Before we start preparing Hadoop for a multi-node setup stop Hadoop.
$ /usr/local/hadoop-0.19.0/bin/stop-all.sh
First we need to setup the hadoop-site.xml configuration file. In later versions of Hadoop this was renamed to core-site.xml.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/mnt/hadoop</value> </property> <property> <name>fs.default.name</name> <value>hdfs://10.251.90.244:54310</value> </property> <property> <name>mapred.job.tracker</name> <value>hdfs://10.251.90.244:54311</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration>
Set the value of fs.default.name to point to the HDFS filesystem on this first node. This will become our NameNode server and Job Tracker.
Next configure masters file. This file should contain the hostname/IP address of the secondary-NameNode server. Set this to the IP address of the current host. This host will function as the secondary-NameNode server for now. In single node and smaller multi node clusters the secondary-NameNode service runs on the main NameNode server.
10.251.90.244
Before we configure the slaves file lets spin up a second instance. Launch an instance from the AMI we created earlier. The slaves file should contain the hostname of all the slaves in the cluster. At this point just put the IP or hostname of the master and first slave that was just brought up. If you don't include the master the HDFS filesystem will come up in safemode. Normally in a multi-node environment the master would NOT be included as a slave.
10.251.90.244 10.251.121.177
Now a couple of final things to do. First, copy all the Hadoop configuration files to the slave:
$ scp /usr/local/hadoop-0.19.0/conf/* hadoop@10.251.121.177:/usr/local/hadoop-0.19.0/conf/
Also on the new slave create /mnt/hadoop and chown it to the hadoop user. Similar to before with the first server chmod 777 /tmp.
slave# mkdir /mnt/hadoop slave# chown hadoop:hadoop /mnt/hadoop slave# chmod 777 /tmp
Now restart Hadoop
hadoop@master$ /usr/local/hadoop-0.19.0/bin/start-all.sh
After you restart hadoop. Login to the slave and after a minute or two you should notice the HDFS fileystem starts consuming more space as replication occurs. Run the job again.
$ /usr/local/hadoop-0.19.0/bin/hadoop jar Gender.jar Gender female /census_2000 /out1
Continue doing this until you feel you have enough running slaves. Average run time on a 10 node cluster of small Amazon EC2 instances was 40 seconds.