Hadoop is an open-source distributed computing framework built by the Apache project. It is useful for processing large datasets across one or more computers and includes custom filesystem to store data including replication across multiple nodes. There is no direct access to HDFS filesystem (you can't mount it) included in Hadoop.
The flow of a MapReduce job. First the data is split into chunks to be processed in parrallel in the Map job. Then the hadoop framework then takes that data and feeds that into the reduce job. Next the reduce job aggregates all the map jobs back into a single data set. This is discussed in more detail in the streaming jobs section.
The first thing to do is download and unpack Hadoop. Download Hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/core/. Then copy to desired install directory. In a clustered environment Hadoop should be installed into the same path on all nodes.
$ wget http://www.gtlib.gatech.edu/pub/apache/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz $ tar -zxvf hadoop-0.20.1.tar.gz $ sudo cp -r hadoop-0.20.1/ /usr/local
As of this writing 0.20.1 is the current stable version of Hadoop. Use wget to download Hadoop then unpack it and copy it to /usr/local.
There are a few environment variables that need to be setup. First setup the HADOOP_HOME environment variable to the install directory. Then append $HADOOP_HOME/bin to your PATH environment variable. It is advisable to add these to profile setup scripts.
$ export HADOOP_HOME=/usr/local/hadoop-0.20.1 $ export PATH=$PATH:$HADOOP_HOME/bin
For the first example of a standalone setup the only file that needs to be configured is conf/hadoop-env.sh. This file contains the Hadoop environment settings. The JAVA_HOME variable must be set. Clustered setups require a few additional file configurations, conf/core-site.xml which contains core daemon configs, conf/hdfs-site.xml which holds settings for the hdfs filesystem, conf/mapred-site.xml which configures specific mapreduce options.
Run hadoop and if the help was displayed hadoop is ready for standalone operation. Standalone operation doesn't utilize the HDFS filesystem. This is useful for development and testing. The Hadoop job output has been pruned for brevity.
$ hadoop-0.20.1$ hadoop jar hadoop-0.20.1-examples.jar grep /var/log/httpd/access_log out 'favicon.ico'
This example simply greps an apache webserver error log located in /var/log/httpd/access_log for favicon.ico and returns the count of lines that match. Hadoop is executed with the jar option which tells hadoop to execute the following jar file. The grep option passed to the Hadoop binary tells Hadoop which Java class in the jar file to use. The next two options are the input files to read and the output directory to store the output. Finally the last option is the regex to match.