Working with Spark – Spark RDD
Introduction
In my previous blog post, we are discussing about Hadoop Map
Reduce. Now in this blog post we are going to discuss about Spark Fundamentals
concept like RDD.
RDD stands for Resilient Distributed Dateset, these are the
elements that run and operate on multiple nodes to do parallel processing on a
cluster.
Hope it will be interesting.
What
is RDD
According to Apache Spark documentation -
"Spark
revolves around the concept of a resilient distributed dataset (RDD), which is
a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs: parallelizing an existing collection in your
driver program, or referencing a dataset in an external storage system, such as
a shared filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat".
RDDs are immutable elements, which means once we create an RDD
we cannot change it. RDDs are fault tolerant as well, hence in case of any
failure, they recover automatically. We can apply multiple operations on these
RDDs to achieve a certain task.
We can apply operation in RDD in two ways.
1. Transformation
2. Action
Transformation
These are the operations, which are applied on a RDD to create a
new RDD. Filter, groupBy and map are the examples of transformations.
Spark Transformation is a function that produces new RDD
from the existing RDDs. It takes RDD as input and produces one or more RDD as
output. Each time it creates new RDD when we apply any transformation. Thus,
the so input RDDs, cannot be changed since RDD are immutable in nature.
Applying transformation built an RDD lineage, with the
entire parent RDDs of the final RDD(s). RDD lineage, also known
as RDD operator graph or RDD dependency graph. It is a
logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.
Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately.
After the transformation, the resultant RDD is always different from its parent
RDD. It can be smaller (filter, count, distinct, sample),
bigger (flatMap(), union(), Cartesian()) or the same size (map)
There are two type of Transformation
Narrow
Transformation
In Narrow transformation, all the elements that are
required to compute the records in single partition live in the single
partition of parent RDD. A limited subset of partition is used to calculate the
result. Narrow transformations are the result of map(),
filter().
Wide
transformation
In wide transformation, all the elements that are required to
compute the records in the single partition may live in many partitions of
parent RDD. The partition may live in many partitions of parent RDD. Wide
transformations are the result
of groupbyKey() and reducebyKey().
Summarized
the Basic facts about Spark RDD
- Resilient Distributed Datasets (RDDs) are
basically an immutable collection of elements which is used as fundamental
data structure in Apache Spark.
- We can create RDDs by two methods -
Parallelize collection & referencing external datasets.
- RDDs are immutable i.e. read only data
structures so you can't change original RDD. But we can always create a
new one.
- RDD supports two types of Spark operations -
Transformations & Actions.
Hope you like it.
Comments
Post a Comment