My First Program on Hadoop – Word Count – Ubuntu

Pre-requisite:

First of all you should have Hadoop access to run any program on it, therefore, get PuTTy copied to your desktop and connect it with ec2 server or else you can install Hadoop on Ubuntu (or any LINUX OS) on your own machine.

Now since we have Hadoop, we can write our first program, that is very famous WordCount problem.

Step1: Create below .java file.

[ec2-user@ip-10-186-16-175 Roma]$ cat WordCountOld.java
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountOld extends Configured implements Tool {
public static class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private Text mapKey = new Text();
private IntWritable mapValue = new IntWritable(1);
private int TEXT_IDX = 2;
@Override
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException {
String[] lineSplit = value.toString().split(“t”);
StringTokenizer wordList = new StringTokenizer(lineSplit[TEXT_IDX]);
while (wordList.hasMoreTokens()) {
mapKey.set(wordList.nextToken());
collector.collect(mapKey, mapValue);}
}
}
public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, LongWritable> {
private LongWritable reduceValue = new LongWritable();
@Override
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, LongWritable> collector, Reporter reporter) throws IOException {
long counter = 0;
while (values.hasNext())
counter += values.next().get();
reduceValue.set(counter);
collector.collect(key, reduceValue);}
}
@Override
public int run(String[] arg) throws Exception {
// TODO Auto-generated method stub
JobConf job = new JobConf(getConf(), WordCountOld.class);
job.setJobName(“WordCount”);
job.setJarByClass(WordCountOld.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountOld.WordCountMapper.class);
job.setReducerClass(WordCountOld.WordCountReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(arg[0]));
FileOutputFormat.setOutputPath(job, new Path(arg[1]));
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new WordCountOld(), args);
}

Step2: Compile the above java code with the following command.

The creation of .class file signifies that code has successfully compiled.

Step3: Creation of .jar file (which is executable format of java file) by using following command.

NOTE: Till now we are on local machine (not on HDFS)

Step4: Check what’s on Hadoop file system.

Create a directory on HDFS:

Step5: Copying file to HDFS.

By mistake I have created the Input file to local directory, so moving that to HDFS (this is a way to show how to copy files from local to HDFS J )

Step6: Executing the program.

hadoop jar WordCountOld.jar WordCountOld /user/ec2-user/Roma/input.txt /user/ec2-user/Roma/output.txt

Step7: Checking the created output file.

“output.txt” named directory got created in HDFS for output:

Step8: Copying output file from HDFS to local directory to view it.

Step9: Output will be in following format

Each word with its count (no of times it appeared in Input File):

Share the joy
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

avatar
  Subscribe  
Notify of