Big Data Titbits

Monday, February 20, 2017

Apache Pig CLI basic commands

Basic Pig scripting steps -

Pig CLI (commandline environment)

You could get into Pig in both local or mapreduce mode -

pig -x local -- local mode.

when you get in "local mode" it expects all input files to be local files.

pig -x mapreduce or pig -- will take you to distributed mode and will deal with hdfs files

Common Pig statements -

1. All pig statements except load and store will need a left hand operand

x = load '<path>/file' using PigStorage('<delimiter>') as( structure of file);

x -> is a relation which is an output of load statement and carries the value of file.

PigStorage - defines delimiter of columns

as - describe the data structure

2. foreach -> very common statement to loop through the objects inside a tuple in a relation (think tuple as a record inside a table and relation is like a database table)

example -

y = foreach x generate $0.. ;

this will give you all columns in the relation in 'x'

y = foreach x generate $0..$4;

this will give you 1st to 5th columns in the relation in 'x'

you can perform computations and apply any business logic on the columns inside foreach statement

example -

y = foreach x generate $0*10 as A, $1/10 as B, $3..;

multiply $0 with 10 or divide $1 with 10 and get all the rest of the columns as is.

or you can do like this..

y = foreach x generate $0*$1 as B, $2..;

You can apply string, math and other functions on the columns (like flatten, TOKENIZE, split etc..) on the columns inside foreach..

3. group by - you group the tuples by one or more field in the relation

z = group y by $0;

This will group 'y' by column $0

4. Aggregate operations -

we can perform aggregate operations like SUM, COUNT, MIN, MAX on the grouped result

L = foreach z generate group as <alias>, SUM($1) as <alias>;

"generate" statement will be followed by "group" keyword in aggregate statements. But you can add alias column to this with "as" statement

5. dump/store - to output to console and store to file..

dump L;

store L into <path> using PigStorage('delimiter'); --- PigStorage is optional for setting your own delimiter

6. Describe -- describes the data structure of a relation

describe X;

Tuesday, September 20, 2016

Custom Record Delimiter in Apache Pig

There would be situations where in your big data project you need to ingest data files with custom new line delimiter. In Pig in particular, PigStoarge class implements "\n" as the default record delimiter for data loading. But if your datafiles contain "\n" as part of data fields, it is imperative to either curate the data files and remove the "\n" characters from the data or create a new PigStorage class to consume the datafile with a custom record limiter.

Recommended steps to implement custom record delimiter in Pig

1. Create a custom InputFomat build from a custom record reader object
2. Create a custom PigStorage class using the newly created Inputformat
3. Use the new PigStorage class to load the data file with custom record delimiter

1. Custom InputFormat Java Class - This piece of code implements a custom InputFormat with ctrl-b or '\u0002' as the new record delimiter.

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;

public class LoaderInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
String delimiter = "\u0002";
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes();
return new LineRecordReader(recordDelimiterBytes);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
return codec == null;
}
}

2. Custom PigStoage Class -

import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.pig.builtin.PigStorage;

public class NewPigStorage extends PigStorage{
public NewPigStorage(String delimiter){
super(delimiter);
}
public NewPigStorage(){
super();
}
@SuppressWarnings("rawtypes")
@Override
public InputFormat getInputFormat() {
return new LoaderInputFormat();
}
}

3. Using the custom Pigstorage in Pig -

grunt> recs = load '<data-file>' using NewPigStorage(<column delimiter>) as ( <record structure>)
# verify the records structure
grunt> dump recs;

Sunday, September 18, 2016

Join optimization in Apache Pig

Join optimization in Apache Pig

In traditional Hadoop world, Apache Pig plays a crucial role in establishing the data pipe line . Pig supports a variety of use friendly constructs and operators that enables data ingestion, transformation and storage of the data passing through a batch process. It gives the developers the power to orchastrate the data flow in a seamless sequence of steps that could mimic equivalent sql functions like join, filter, group, order by and many other such tasks. In doing so it hides the low level abstraction of Map reduce.

In Pig we could join datasets in couple of different ways

1. Reduce side join
2. Mapside join or Replicate join

Reduce side join -

This is the default join approach used inside Pig when you join 2 or more relations. This is also known as shuffle join. In a typical Map Reduce life cycle, as the datasets flow from inputsplit, mappers to reducers, the join of the datasets will happen inside the reducer nodes. And this makes sense as all the similar keys end up in the same reducer node. But on the contrary, this is the most inefficient type of join in mapreduce. The reason being, underlying data has to traverse the full life cycle before the required fields get projected or reduced. It has to bear the overhead of IO to temp local files, data movement over the network and sorting operation on memory spilled over disk.

Mapside join or Replicated join -

Replicated join is a specialized type of join that works well when one of the joining datasets is small enough to into the memory. In such situations Map Reduce can perform the join on the mappers and reduces the overhead of IO and network trafiic during the subsequent stages.

big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';

Both the smaller relations in the above join must fit into the memory for the join to execute successfully. Otherwise error will be generated

Wednesday, June 3, 2015

Roadmap to a Successful Career in Big Data

The Hype in Big Data has subsided and the reality has kicked in. As more and more companies are realizing the true potential of Big Data Technologies, they are joining in droves to experiment, adopt and grow their Big Data portfolio. As a result there has been a noticeable spurt in Big Data requirements popping up in the market place. As with any technology when they are new and hot, the supply generally lags the demand. In fact the situation with Big Data technology is so bad, that any average IT guy with Hadoop Certification on his or her Resume can comfortably land in an 80k salary job. Add a good chunk of DW/BI and/or Java skillset or experience, the salary ask can easily jump another 20%.

The need for a structured learning roadmap is critical for success in new Big Data Technologies. Here is a nice blog that goes step by step into the preparation and execution of this strategy.

URL to the Blog: The Roadmap to a successful Big Data Career

Ratikant Pratapsingh
Big Data Evangelist and Educator
@KnowledgePact.com, a Premier Big Data training and consulting firm

Wednesday, March 19, 2014

Bigdata TitBits

Aloha .. Welcome Big data lovers. This is my first attempt to share stuffs that I love to do at work and outside work. And that is "Big Data" off course. A big thanks to the data explosion that each one of us has been witnessing in recent years and the resulting rush to digest and derive insights from the otherwise mundane details of our blogs, machine logs and everything around us. Truely amazing how our thought process shifts so dramatically and breaks all barriers of structured data and formatted outputs to fall in love with unstructured and semi-structured ocean of data. But as it happens always, from the ocean of confusion and lack of standards, will arise the nuggets of truth and innovations will drive new tools and ammunitions for Bigdata to carve a space for itself. In the meantime for commoners like you and me, we will for sure like to drench our feet in the ocean and taste the salt that is going to keep our thirst unsatiated for a long time to come.. For the timebeing, I would say - Enjoy the ride !! Ratikant