Basic Pig scripting steps -
Pig CLI (commandline environment)
You could get into Pig in both local or mapreduce mode -
pig -x local -- local mode.
when you get in "local mode" it expects all input files to be local files.
pig -x mapreduce or pig -- will take you to distributed mode and will deal with hdfs files
Common Pig statements -
1. All pig statements except load and store will need a left hand operand
x = load '<path>/file' using PigStorage('<delimiter>') as( structure of file);
x -> is a relation which is an output of load statement and carries the value of file.
PigStorage - defines delimiter of columns
as - describe the data structure
2. foreach -> very common statement to loop through the objects inside a tuple in a relation (think tuple as a record inside a table and relation is like a database table)
example -
y = foreach x generate $0.. ;
this will give you all columns in the relation in 'x'
y = foreach x generate $0..$4;
this will give you 1st to 5th columns in the relation in 'x'
you can perform computations and apply any business logic on the columns inside foreach statement
example -
y = foreach x generate $0*10 as A, $1/10 as B, $3..;
multiply $0 with 10 or divide $1 with 10 and get all the rest of the columns as is.
or you can do like this..
y = foreach x generate $0*$1 as B, $2..;
You can apply string, math and other functions on the columns (like flatten, TOKENIZE, split etc..) on the columns inside foreach..
3. group by - you group the tuples by one or more field in the relation
z = group y by $0;
This will group 'y' by column $0
4. Aggregate operations -
we can perform aggregate operations like SUM, COUNT, MIN, MAX on the grouped result
L = foreach z generate group as <alias>, SUM($1) as <alias>;
"generate" statement will be followed by "group" keyword in aggregate statements. But you can add alias column to this with "as" statement
5. dump/store - to output to console and store to file..
dump L;
store L into <path> using PigStorage('delimiter'); --- PigStorage is optional for setting your own delimiter
6. Describe -- describes the data structure of a relation
describe X;