“ Word Count ” , as the name says we will try to count the number of words, but not using traditional programming. Yes, of course it is possible to get frequency of words using traditional programming, but it will take lot of time if the list of words is even in MBs. So here we will try use Hadoop to utilize its parallel computing feature which is called “Map and Reduce”.
Now if you want to know what is Hadoop and What is HDFS file system, you can go through this link
We will follow below steps to accomplish our target.
Step 1 – Download and Install Hadoop in Oracle Virtual box:
We will use hortonworks to use Hadoop platform. Hortonworks provides all Hadoop experiences under one roof, which can be downloaded for free and install in oracle virtual box.
Hortonworks image file can be downloaded from below link
Oracle Virtual box screen shot after “hortonworks” installed in it.
After we start the “hortonworks” machine in Oracle Virtual Box, we will get below screen which will show instructions how to start “Hortonworks Sandbox Sessions”.
Step 2 – Start Hortonworks:
As instructed we can start the session using http:// 127.0.0.1:8888/, will get the screen as below
In the right hand side we can see one orange button called “View Advanced Options”, if we clicked that we will get few options or path to use different feature in hortonworks.
Like for this exercise we will use Secure Shell (SSH) [url: http:// 127.0.0.1:4200] and Hue [url: http:// 127.0.0.1:8000]
We will first open the URL http:// 127.0.0.1:4200 to open SSH and will get screen like below where we are supposed to ask sandbox login user and password, where we need to put “root” as user and “hadoop” as password.
Step 3: Upload java programs to the Hadoop system.
Now we have 3 java programs for Map, Reduce and main function of Word Count which we need to push in the Hadoop system using shell commands. Java programs are named as WordMapper.java [for Map], SumReducer.java [to reduce] and last one is WordCount.java [main program to count the words]
Java programs are available in this link
First we will use “Vi WordMapper.java” to open file editor for WordMapper program like below
and we will get below screen where we need to press “I” to get into the “insert” mode.
Unfortunately “Control +V” will not work in this screen so to paste the java programs we need to do a “right click” and select the option called “paste from browser.
We will get below field to paste the java code inside and click ok
To save the code press “ESC” followed “:wq” and press “ENTER”
We will repeat same step for other 2 java programs to upload them in the system.
To verify that java files are successfully created or not we will run command “ls-l” like below
Step 4 – Compile Java programs
Java programs need to be compiled before execution but before that we need to create a directory where system can store class files after compile.
To compile the programs we find out some core paths in the system (which are different version to version ) which need to be passed in the compile commands like below.
After above java compile command 3 class files should be created in the WCclasses directory, we will verify that like below.
Now our class files are ready but class files are not executable, so we need to create jar file for the main program WordCount.java
Step 5 – Upload input files
Now our executable programs are ready but we are here to count the words, right? So where are those words, where are those input files? Yes we need to upload those input files as well in the system before running the program.
Input files are available in this link
But for this we don’t have to use shell commands and hortonworks gives interface to upload the files in “hue” system.
We will follow the url: http:// 127.0.0.1:8000 to login in hue with user name as “Hue” and password as “1111” in below screen
In the below screen click on the icon in blue circle to open “file browser”
and will get below screen with some directory listing. Where we have created “wc” directory to upload the input files.
Inside “wc” directory we have uploaded input files which will look like below.
Step 6 -Execute WordCount.jar to get the final result:
As our inputs are uploaded and java programs are compiled and bind successfully we are ready to execute the final step to get the result but before that we need to delete the out directory if it is present using below command
“hdfs –rm –r wc-out2” where wc-out2 is our output directory.
Final command to execute the WordCount.jar and store the result in wc-out2 directory using the input files stored in wc directory will be like below.
Step 7 – Verify the job
As the command ran without any error, we will now monitor the job which has been submitted in the background, to do so we will again enter into “Hue” and click “job browser” icon. By default the user in job browser is “hue” so we need to change it to “root” as we have submitted the commands under “root” user. There should be a job either in “running” or “succeeded” state.
Step 8 – Verify the output
If it is already in “succeeded” state, we will again go to file browser to check the final output.
There should be a new directory present called “wc-out” , we will enter into that directory and get the below screen where one text file called “part-r-00000” should be present which actually contains the result.
In the “part-r-00000” file, we will see two columns, where 1st column is the word and 2nd is the count for corresponding words.