Monday, February 20, 2017

Apache Pig CLI basic commands

Basic Pig scripting steps -

Pig CLI (commandline environment)

  You could get into Pig in both local or mapreduce mode -

pig -x local -- local mode. 

when you get in "local mode" it expects all input files to be local files.

pig -x mapreduce or pig -- will take you to distributed mode and will deal with hdfs files 


Common Pig statements - 

1. All pig statements except load and store will need a left hand operand 

x = load '<path>/file' using PigStorage('<delimiter>') as( structure of file);

x -> is a relation which is an output of  load statement and carries the value of file.

PigStorage - defines delimiter of columns

as - describe the data structure

2. foreach -> very common statement to loop through the objects inside a tuple in a relation (think tuple as a record inside a table and relation is like a database table)

example -

y = foreach x generate $0..  ;

  this will give you all columns in the relation in 'x'

y = foreach x generate $0..$4;

  this will give you 1st to 5th columns in the relation in 'x'

  you can perform computations and apply any business logic on the columns inside foreach statement

example -

y = foreach x generate $0*10 as A, $1/10 as B, $3..;

 multiply $0 with 10 or divide $1 with 10 and get all the rest of the columns as is.

or you can do like this.. 

y = foreach x generate $0*$1 as B, $2..;


You can apply string, math and other functions on the columns (like flatten, TOKENIZE, split etc..) on the columns inside foreach..


3. group by - you group the tuples by one or more field in the relation


  z = group y by $0;

  This will group 'y' by column $0

4. Aggregate operations -

  we can perform aggregate operations like SUM, COUNT, MIN, MAX on the grouped result

L = foreach z generate group as <alias>, SUM($1) as <alias>;

 "generate" statement will be followed by "group" keyword in aggregate statements. But you can add alias column to this with "as" statement

5. dump/store - to output to console and store to file..

dump L;

store L into <path> using PigStorage('delimiter');   ---  PigStorage is optional for setting your own delimiter

6. Describe --  describes the data structure of a relation

describe X;




No comments:

Post a Comment