Working with Spark – Big Data Hadoop MapReduce

Introduction


Before moving to Spark RDD concept, which is the base line of Spark, we need to understand the concept of Hadoop Mapredue. RDD is the improvement of Hadoop Mapreduce to get 100 outputs in memory. In this blog post we are going to discuss about Hadoop Mapredue to understand the concept only.

Hope it will be interesting.

 

What is MapReduce


MapReduce is a software framework and programming module to handle huge data. Hadoop is capable to run MapReduce programs which are written in different language like Java, Ruby, Python and C++.


How MapReduce Works


As the name specified, MapReduce is a combination of Mapping and Reducing.

 

We can divide the MapReduc into following Sections.


·         Input Splits

·         Mapping

·         Shuffling

·         Reducing

 

Before examining those, let’s take an example.

 

Please consider the following input data as Map Reduce program.

 

Mayuree how are you

We welcome Mayuree in India

Mayaure is a best developer

 

Please follow the bellow diagram.

 


 

 

Final Output:

 



 

Input Splits

An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map

Mapping

This is the very first phase in the execution of map-reduce program. In this phase data in each split is passed to a mapping function to produce output values.

Shuffling

This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output.

Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete datasets.

 

 

 

 

 

Hope you like it.

 


Comments

Popular Posts

Working with Python -- File Handling

Incremental Load by using ADF

Copying Multiple File from Blob Storage to Single SQL Table – Part-2