Linux Tutorials
Building Hadoop Clusters On Linux In EC2
Installing And Using Hadoop
Setting Up SSH Keys Using SSH Agents And Tunnels
Creating OpenSSL Certificates and Certificate Authorities
Installing and configuring Xen
IPTables Primer
Linux Basic Bash Scripting

Installing And Using Hadoop

Hadoop Pseudo Distributed Cluster

The pseudo distributed mode of operation gives a taste of the distributed daemons in Hadoop. In this mode though all the daemons run on a single host. There are a few additional configuration files that need to be updated. These are very simple working versions of these configuration files. In this case they actually could all be combined into the configuration tags of core-site.xml.

First edit conf/core-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Set the URI of the Hadoop NameNode.

Next edit conf/hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Set the default replication value for each file. This can be overridden with per file replication settings.

Finally edit conf/mapred-site.xml:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

Next setup password-less ssh with keys. Hadoop communicates over ssh so you need to setup ssh keys to allow Hadoop to ssh as needed. For more information see the Setting Up SSH Keys, Using SSH Agents, and Tunnels Tutorial

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Format the HDFS filesystem to prepare it for use. This simply creates the needed directory structures for the HDFS filesystem.

$ hadoop namenode -format

Now that everything is configured and setup start the daemons.

$ start-all.sh

Running a job in cluster mode is the same as launching a standalone job. Except that the dataset needs to be copied into HDFS before processing it with Hadoop. This example will process the text of The Federalist Papers available from The Gutenberg Project at Archive.org.

$ wget http://ia331428.us.archive.org/2/items/thefederalistpap00018gut/feder16.txt
$ hadoop dfs -put feder16.txt /
$ hadoop jar hadoop-0.20.1-examples.jar wordcount /feder16.txt /out

First use wget to download the Federalist Papers from The Gutenberg project archive. Then copy the file into the hadoop filesystem. Then run the wordcount job out of the example jar file. The example jar file used includes several examples worth looking at. The source is included in the hadoop package in src/examples/org/apache/hadoop/examples/ in standard java class directory structure. The wordcount function simply counts the times each word occurs in The Federalist Papers. The output is stored in the /out directory.

Streaming

The streaming module allows any executable to be used in MapReduce jobs. It's included as part of the hadoop package in the contrib/streaming directory.

$ hadoop jar contrib/streaming/hadoop-0.20.1-streaming.jar -input feder16.txt -output out2 -mapper "grep -i liberty" -reducer "uniq -c"

Again run a hadoop job against The Federalist Papers file. This job greps the file for the word liberty case-insensitive. The job then reduces the list of lines containing liberty to a unique list of those lines counting the number of occurances.

Installing And Configuring Hadoop <<  1 2 3  >> Using HDFS CLI And Web GUI
New Content