In traditional Hadoop world, Apache Pig plays a crucial role in establishing the data pipe line . Pig supports a variety of use friendly constructs and operators that enables data ingestion, transformation and storage of the data passing through a batch process. It gives the developers the power to orchastrate the data flow in a seamless sequence of steps that could mimic equivalent sql functions like join, filter, group, order by and many other such tasks. In doing so it hides the low level abstraction of Map reduce.
In Pig we could join datasets in couple of different ways
1. Reduce side join
2. Mapside join or Replicate join
Reduce side join -
This is the default join approach used inside Pig when you join 2 or more relations. This is also known as shuffle join. In a typical Map Reduce life cycle, as the datasets flow from inputsplit, mappers to reducers, the join of the datasets will happen inside the reducer nodes. And this makes sense as all the similar keys end up in the same reducer node. But on the contrary, this is the most inefficient type of join in mapreduce. The reason being, underlying data has to traverse the full life cycle before the required fields get projected or reduced. It has to bear the overhead of IO to temp local files, data movement over the network and sorting operation on memory spilled over disk.
Mapside join or Replicated join -
Replicated join is a specialized type of join that works well when one of the joining datasets is small enough to into the memory. In such situations Map Reduce can perform the join on the mappers and reduces the overhead of IO and network trafiic during the subsequent stages.
big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
Both the smaller relations in the above join must fit into the memory for the join to execute successfully. Otherwise error will be generated
No comments:
Post a Comment