Word Count Program in Hadoop – Hortonworks

HadoopWordCount“ Word Count ” , as the name says we will try to count the number of words, but not using traditional programming. Yes, of course it is possible to get frequency of words using traditional programming, but it will take lot of time if the list of words is even in MBs. So here we will try use Hadoop to utilize its parallel computing feature which is called “Map and Reduce”.

Now if you want to know what is Hadoop and What is HDFS file system, you can go through this link

We will follow below steps to accomplish our target.

Step 1 – Download and Install Hadoop in Oracle Virtual box:

We will use hortonworks to use Hadoop platform. Hortonworks provides all Hadoop experiences under one roof, which can be downloaded for free and install in oracle virtual box.

Hortonworks image file can be downloaded from below link

http://hortonworks.com/hdp/downloads/

Oracle Virtual box screen shot after “hortonworks” installed in it.

Hotonworks_install

After we start the “hortonworks” machine in Oracle Virtual Box, we will get below screen which will show instructions how to start “Hortonworks Sandbox Sessions”.

Hotonworks_started

Step 2 – Start Hortonworks:

As instructed we can start the session using http:// 127.0.0.1:8888/, will get the screen as below

Hotonworks_session

In the right hand side we can see one orange button called “View Advanced Options”, if we clicked that we will get few options or path to use different feature in hortonworks.

Like for this exercise we will use Secure Shell (SSH) [url: http:// 127.0.0.1:4200] and Hue [url: http:// 127.0.0.1:8000]

Hotonworks_features_paths

We will first open the URL http:// 127.0.0.1:4200 to open SSH and will get screen like below where we are supposed to ask sandbox login user and password, where we need to put “root” as user and “hadoop” as password.

Hadoop_Shell

Step 3: Upload java programs to the Hadoop system.

Now we have 3 java programs for Map, Reduce and main function of Word Count which we need to push in the Hadoop system using shell commands. Java programs are named as WordMapper.java [for Map], SumReducer.java [to reduce] and last one is WordCount.java [main program to count the words]

Java programs are available in this link

First we will use “Vi WordMapper.java” to open file editor for WordMapper program like below

Open_VI_EDITOR

and we will get below screen where we need to press “I” to get into the “insert” mode.

Vi_editor_in_insert_mode

Unfortunately “Control +V” will not work in this screen so to paste the java programs we need to do a “right click” and select the option called “paste from browser.

Vi_editor_paste_options

We will get below field to paste the java code inside and click ok

Vi_editor_paste_code

To save the code press “ESC” followed “:wq” and press “ENTER”

Vi_editor_java_programs

We will repeat same step for other 2 java programs to upload them in the system.

To verify that java files are successfully created or not we will run command “ls-l” like belowjava_programs_uploaded_verification

 

Step 4 – Compile Java programs

Java programs need to be compiled before execution but before that we need to create a directory where system can store class files after compile.

Verify_classes_after_compile

To compile the programs we find out some core paths in the system (which are different version to version ) which need to be passed in the compile commands like below.

Java compile

After above java compile command 3 class files should be created in the WCclasses directory, we will verify that like below.

Verify_classes_after_compile

Now our class files are ready but class files are not executable, so we need to create jar file for the main program WordCount.java

create_jar_file

 

Step 5 – Upload input files

Now our executable programs are ready but we are here to count the words, right? So where are those words, where are those input files? Yes we need to upload those input files as well in the system before running the program.

Input files are available in this link

But for this we don’t have to use shell commands and hortonworks gives interface to upload the files in “hue” system.

Hotonworks_features_paths

We will follow the url:  http:// 127.0.0.1:8000 to login in hue with user name as “Hue” and password as “1111” in below screen

Hue_login

In the below screen click on the icon in blue circle to open “file browser”

Hue_file_browser

and will get below screen with some directory listing. Where we have created “wc” directory to upload the input files.

Hue_file_browser2

Inside “wc” directory we have uploaded input files which will look like below.

Hue_input_files

 

Step 6 -Execute WordCount.jar to get the final result:

As our inputs are uploaded and java programs are compiled and bind successfully we are ready to execute the final step to get the result but before that we need to delete the out directory if it is present using below command

hdfs –rm –r wc-out2” where wc-out2 is our output directory.

Final command to execute the WordCount.jar and store the result in wc-out2 directory using the input files stored in wc directory will be like below.

Running_final_step_wordcount

 

Step 7 – Verify the job

As the command ran without any error, we will now monitor the job which has been submitted in the background, to do so we will again enter into “Hue” and click “job browser” icon.  By default the user in job browser is “hue” so we need to change it to “root” as we have submitted the commands under “root” user. There should be a job either in “running” or “succeeded” state.

Running_jobs_word_count

 

Step 8 – Verify the output

If it is already in “succeeded” state, we will again go to file browser to check the final output.

There should be a new directory present called “wc-out” , we will enter into that directory and  get the below screen where one text file called “part-r-00000” should be present which actually contains the result.

Final_output_file_word_count

 

In the “part-r-00000” file,  we will see two columns, where 1st column is the word and 2nd is the count for corresponding words.

Final_output_word_count.PNG

Share the joy
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
Somenath

Leave a Reply

avatar
  Subscribe  
Notify of