Working with Spark – Spark RDD

Introduction

In my previous blog post, we are discussing about Hadoop Map Reduce. Now in this blog post we are going to discuss about Spark Fundamentals concept like RDD.

RDD stands for Resilient Distributed Dateset, these are the elements that run and operate on multiple nodes to do parallel processing on a cluster. 

Hope it will be interesting.

What is RDD


According to Apache Spark documentation - 

"Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat".

RDDs are immutable elements, which means once we create an RDD we cannot change it. RDDs are fault tolerant as well, hence in case of any failure, they recover automatically. We can apply multiple operations on these RDDs to achieve a certain task.

We can apply operation in RDD in two ways.

1.       Transformation

2.       Action

 

Transformation


These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and map are the examples of transformations.

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.


Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.

 

Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately.


After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (filter, count, distinct, sample),  bigger (flatMap(), union(), Cartesian()) or the same size (map)

 

There are two type of Transformation

Narrow Transformation


In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().

 

 


Wide transformation


In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().

 

 

 

Summarized the Basic facts about Spark RDD


  • Resilient Distributed Datasets (RDDs) are basically an immutable collection of elements which is used as fundamental data structure in Apache Spark.
  • We can create RDDs by two methods - Parallelize collection & referencing external datasets.
  • RDDs are immutable i.e. read only data structures so you can't change original RDD. But we can always create a new one.
  • RDD supports two types of Spark operations - Transformations & Actions.

 

 

Hope you like it.


Comments

Popular Posts

Vertipaq Engine – Data Compression

Azure Data Bricks Architecture Part-1

Azure DataBricks Accessing Data Lake (Using Access Key)