Lets start with Hive – Hadoop

hive-logoJava, Pig , R those are all programming language, but what if you are not comfortable with those regular programming language, what if you only know SQL. There is still a way out for you in Hadoop, it’s called Hive. It is old SQL in different packet called HQL (Hadoop Query Language).

As an elementary task in Hive we are going to do the same kind data processing task as we did with Pig

 

 

 

Steps we will follow:

  1. We have several files of baseball statistics that we are going to upload into Hive.
  2. Do some simple computing with them.
  3. Find the player with the highest runs for each year.
  4. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

This file has all the statistics from 1871–2011 and contains more than 90,000 rows.

Input file path:

http://seanlahman.com/files/database/lahman591-csv.zip

Step 1 –  Load input file:

We need to unzip it into a directory. We will be uploading just the Master.csv and Batting.csv files from the dataset in “file browser” like below

hive3

In Hue there is a button called “Hive” and inside Hive there are query options like “Query Editor”, “My Queries” and “Tables” etc.

On left there is a “query editor”. A query may span multiple lines, there are buttons to Execute the query, Explain the query, Save the query with a name and to open a new window for another query.

Pig is a scripting language so there all data objects are operated on in the script. Once the script is complete all data objects are deleted unless you stored them.

In the case of Hive we are operating on the Apache Hadoop data store. Any query you make, table that you create, data that you copy persists from query to query. 

 

Step 2 – Create empty table and load data in Hive

In “Table” we need to select “Create a new table from a file”, which will lead us to the “file browser”, where we will select “batting.csv” file and we will name the new table as “temp_batting”

Else we can select “query editor” and run “create” query to create the table.

Create table temp_batting (col_value STRING);

 

hive4 Next we load the contents from ‘Batting.csv’ into temp_batting table, through the following command which need to be executed through the Query Editor

LOAD DATA INPATH ‘/user/admin/Batting.csv’ OVERWRITE INTO TABLE temp_batting;

Once data has been loaded, the file (batting.csv) will be deleted by HIVE, and it will no longer be seen in the file browser.

hive6

Now we know that we have loaded the data, we have to verify the same. To do so we execute the following command, this will show us the first 100 rows from the table.

 

SELECT * from temp_batting LIMIT 100;

 hive7

 

The results of the query should look like:

 

hive8

 

Step 3 – Create a batting table and transfer data from the temporary table to batting table

Now we will extract the contents of temp_batting into a new table called ‘batting’ which should contain the following columns:

a)  player_id
b)  year
c)  runs

hive9

Next object is to create the ‘batting’ table and insert in it from ‘temp_batting’ (player_id, year and run) using regular expression.

create table batting (player_id STRING, year INT, runs INT);

 

insert overwrite table batting 

SELECT    

 regexp_extract(col_value, ‘^(?:([^,]*),?){1}’, 1) player_id,    

regexp_extract(col_value, ‘^(?:([^,]*),?){2}’, 1) year,    

regexp_extract(col_value, ‘^(?:([^,]*),?){9}’, 1) run 

from temp_batting 

Step 4 – Create a query to show the highest score per year

Next is simple command to do a “group by” in ‘batting’ by year, so that we have the highest scores by year.    

    SELECT year, max(runs) FROM batting GROUP BY year;  

Result of executing the above query is shown below:

hive10

 

hive11

Step 5 – Get final result (who scored the maximum runs, year-wise)

As our year wise maximum runs are ready, we will execute final query which will show the player who scored the maximum runs in a year.

    SELECT a.year, a.player_id, a.runs from batting a 

    JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b 

    ON (a.year = b.year AND a.runs = b.runs) ;

 

     The result of the above query

Hive_final1

Hive_final2

Share the joy
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *