First Program in Pig – Hadoop

pig-on-hadoopWe have run our first program in Hadoop called “WordCount” using JAVA. But apart from Java there are many application or programming languages are already present in Hadoop, like Pig, Hive etc.

Let’s first see what is PIG?

Like other languages like java, C, R, Pig is a high level scripting language that is used with Hadoop. It is generally used in data analysis problems like we use R or SAS.  In normal windows platform what data manipulations we do using R, we can do all those data manipulations in Apache Hadoop with Pig.

Pig Latin includes operators for many of the data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions.

Objective:

We have an input file for baseball statistics from 1871–2011 and it contains over 90,000 rows. Our objective is to compute the highest runs by a player for each year. Once we have the highest runs we will extend the script to translate a player id field into the first and last names of the players.

Link to download the input csv file:

http://hortonassets.s3.amazonaws.com/pig/lahman591-csv.zip

We will unzip and upload the master.csv and batting.csv files in “file browser” like below

 

File_Browser

 

Code to achieve above objective:

Open “PIG” interface from hotonworks like below.

PigScript01

 

Step 1:

We will load the data and for that we will use “PigStorage” function.

batting = load ‘Batting.csv’ using PigStorage(‘,’);
Step  2:

Next step is to name the fields. We will use “Generate” statement for assigning names to all fields.

runs = FOREACH raw_runs GENERATE $0 as playerID, $1 as year, $8 as runs;

 

Step 3:

Grouping each statement by the “year” field.

grp_data = GROUP runs by (year);
Step 4:

finding maximum runs for above grouped data.

max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;

 

Step 5:

we will  join this with the runs data object in order pick up the player id.

Then we “Dump” data in the output.

join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);

join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;

DUMP join_data;

 

Output

After executing above code,  we have to check the “job browser” if the jobs have been ran successfully or not like below.

Job Browser - pig

If the jobs showed successful execution then we can proceed to check the output down below the code in the “pig” screen, it should show as below.

 

PigResult01

So, with this we have executed our first program in pig.

Share the joy
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *