The pseudo distributed mode of operation gives a taste of the distributed daemons in Hadoop. In this mode though all the daemons run on a single host. There are a few additional configuration files that need to be updated. These are very simple working versions of these configuration files. In this case they actually could all be combined into the configuration tags of core-site.xml.
First edit conf/core-site.xml:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Set the URI of the Hadoop NameNode.
Next edit conf/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Set the default replication value for each file. This can be overridden with per file replication settings.
Finally edit conf/mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
Next setup password-less ssh with keys. Hadoop communicates over ssh so you need to setup ssh keys to allow Hadoop to ssh as needed. For more information see the Setting Up SSH Keys, Using SSH Agents, and Tunnels Tutorial
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Format the HDFS filesystem to prepare it for use. This simply creates the needed directory structures for the HDFS filesystem.
$ hadoop namenode -format
Now that everything is configured and setup start the daemons.
Running a job in cluster mode is the same as launching a standalone job. Except that the dataset needs to be copied into HDFS before processing it with Hadoop. This example will process the text of The Federalist Papers available from The Gutenberg Project at Archive.org.
$ wget http://ia331428.us.archive.org/2/items/thefederalistpap00018gut/feder16.txt $ hadoop dfs -put feder16.txt / $ hadoop jar hadoop-0.20.1-examples.jar wordcount /feder16.txt /out
First use wget to download the Federalist Papers from The Gutenberg project archive. Then copy the file into the hadoop filesystem. Then run the wordcount job out of the example jar file. The example jar file used includes several examples worth looking at. The source is included in the hadoop package in src/examples/org/apache/hadoop/examples/ in standard java class directory structure. The wordcount function simply counts the times each word occurs in The Federalist Papers. The output is stored in the /out directory.
The streaming module allows any executable to be used in MapReduce jobs. It's included as part of the hadoop package in the contrib/streaming directory.
$ hadoop jar contrib/streaming/hadoop-0.20.1-streaming.jar -input feder16.txt -output out2 -mapper "grep -i liberty" -reducer "uniq -c"
Again run a hadoop job against The Federalist Papers file. This job greps the file for the word liberty case-insensitive. The job then reduces the list of lines containing liberty to a unique list of those lines counting the number of occurances.