Java Tutorials
Building Hadoop MapReduce Jobs In Java
Building XML With Java And DOM
Creating Java Servlets With NetBeans
Basic File Manipulation With Java
Introduction To Java

Building Hadoop MapReduce Jobs In Java

Introduction

Hadoop is a parallel job processing framework from the Apache foundation. The hadoop framework is written in java and supports jar files for job execution. This tutorial is going to cover building a MapReduce job in java. The dataset being used will be the 2000 US Census available as an EBS volume snapshot on Amazon EC2. The census dataset is extemely large, and only a small part of the overall dataset will be explained.

Required libraries

All required libraries are available in the Hadoop packages. The Hadoop java classes are broken up into grouping by function. The ones used in this article are org.apache.hadoop.fs.Path, org.apache.hadoop.conf.*, org.apache.hadoop.io.*, org.apache.hadoop.mapred.*, and org.apache.hadoop.util.*.

Anatomy of a MapReduce Program

Each job is contained in it's own class and should contain static sub-classes for any included mapper, combiner, and reducer functions. These sub-classes aren't required though, and there are several classes to handle these functions included in Hadoop. The mapper class in the first example will be called Map and will extend the MapReduceBase class and implement the methods of the abstract Mapper class. The combiner and reducer functions will be performed by classes from Hadoop. There is also a standard main method.

Building the Job

Create the file, main class, and includes. The first job to be created will return locations that have more of one gender or another. The job will accept female or male as a parameter to determine what to find. Also expected is the path on the HDFS filesystem to locate the census data.

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class Gender {
    
    private static String genderCheck = "female";
    
    public static class Map extends MapReduceBase implements Mapper {

    }

    public static void main(String[] args) throws Exception {
        
    }
}

This is the structure of the java, but lets take a second to look at the census dataset before we build the job.

Disecting the census dataset

The census dataset is a ton of data, for simplicity only a subset used/covered. There are pdfs included with the dataset that helps explain some of the data. These column mappings were deduced through checking values in the CSV files to the PDF reports. This isn't meant to be a tutorial on data parsing so that code isn't elegant but does work.

Important columns

Column Contents Column Contents
2 Level of detail code (State, County, etc) 15 Location name (State; County, State; etc)
16 Total Population 17 Male Population
18 Female Population 19 Under 5 years
20 Age 5-9 years 21 10-14 years
22 15-19 years 23 20-24 years
24 25-34 years 25 35-44 years
26 45-54 years 27 55-59 years
28 60-64 years 29 65-74 years
30 75-84 years 31 85 years and older
32 Median age

The map code

The map code will read in each line and rebuild a smaller CSV with only the 3 fields needed (location, male pop., femaile pop.). The Map class is built off the Hadoop framework. The Hadoop engine expects an internal method map method to perform the mapping. There is an issue with this code. The CSV structure from the census report quotes the location field since it may contain a comma itself. Since this tutorial is focusing on Hadoop MapReduce jobs and not CSV libraries/parsers simple hacks have been put into place. Consider it a homework assignment to fix this if you wish.

    public static class Map extends MapReduceBase implements Mapper {
        private final static IntWritable one = new IntWritable(1);
        private Text locText = new Text();
        
        public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
            String line = value.toString();
            String location = line.split(",")[14] + "," + line.split(",")[15];
            long male = 0L;
            long female = 0L;
            if (line.split(",")[17].matches("\d+") && line.split(",")[18].matches("\d+")) {
                male = Long.parseLong(line.split(",")[17]);
                female = Long.parseLong(line.split(",")[18]);
            }
            long diff = male - female;
            locText.set(location);
            if (Gender.genderCheck.toLowerCase().equals("female") && diff < 0) {
                output.collect(locText, new LongWritable(diff * -1L));
            }
            else if (Gender.genderCheck.toLowerCase().equals("male") && diff > 0) {
                output.collect(locText, new LongWritable(diff));               
 
            }
        }
    }

Place this code in the empty stub from before. This code isn't perfect at parsing records and this is worked around by matching that the expected numbers are actually numbers. The map function simply receives each record one at a time, converts Text to String and performs a nasty split to build the location. Next initialize 2 longs to store the male and female population totals. Since parsing wasn't concentrated on verify that the data inputs are actually numbers.

Now figure out the difference in population by gender. Set the Text variable locText to location. Depending on if the gender with the greater population is found add the record to the job output with the collect method.

Setting up and running the job in main.

The main method in the outer class is executed when the job is loaded. Arguments to the job are passed to this function.

    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(Gender.class);
        conf.setJobName("gender");
        conf.setOutputKeyClass(Text.class); 
        conf.setOutputValueClass(LongWritable.class);
        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);
        conf.setMapperClass(Map.class);

        if (args.length != 3) {
            System.out.println("Usage:");
            System.out.println("[male/female] /path/to/2kh/files /path/to/output");
            System.exit(1);
        }

        if (!args[0].equalsIgnoreCase("male") && !args[0].equalsIgnoreCase("female")) {
            System.out.println("first argument must be male or female");
            System.exit(1);
        }         
        Gender.genderCheck = args[0];

        FileInputFormat.setInputPaths(conf, new Path(args[1]));
        FileOutputFormat.setOutputPath(conf, new Path(args[2]));
        JobClient.runJob(conf);
    }

The first thing the job needs to do is setup a JobConf instance. Use the JobConf to set the job name, input/output data formats, and the Mapper class to use. The setOutputKeyClass method sets the output data key type to Text. The setOutputValueClass method sets the output data value type to LongWritable. The setInputFormat and setOutputFormat methods are used to specify the input and output data formats. In this example we're simply working with text files. Next set the input and output paths and finally run the job.

Build jar file

Before hadoop can use the java code it must be compiled into a jar file.

 $ mkdir Gender_classes
 $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d Gender_classes Gender.java
 $ jar -cvf /home/hadoop/Gender.jar -C Gender_classes/ . 

Make a directory Gender_classes to store the class files written by javac. Then create a jar file from the class files. This jar file will not properly execute on it's own. It should be run by Hadoop.

Run the job

Now launch the job into the hadoop cluster.

$ hadoop gender.jar Gender female /2kh*csv /out
 1 2  >> Sample Data and Code
New Content