Lec 10
Lec 10
Lec 10
Lecture -10
Introduction to Spark
Refer slide time: (0:13)
Introduction to Spark.
The need of his Spark. Apache Spark is a big data, analytics framework that was originally developed, at
University of California, Berkeley, at AMP Lab, in 2012. Since then, it has gained a lot of attraction, both
in the academia and in the industry. It is, an another system for big data analytics. Now, before this Spark,
the Map Reduce was already in use. Now, why is not the Map Reduce good enough? That we you are
going to explore, to understand, the need of the Spark system, before that we have to understand that,
Map Reduce simplifies, the batch processing on a large commodity clusters.
So the solution: which is Spark has given, is in the form of an abstract data type, which is called,
‘Resilient Distributed Data Set’. So the SPARK provides, a new way, of supporting the abstract data type,
which is called resilient distributed data set or in short it is RDDs. Now, we will see in this part of the
discussion, how this RDDs are going to solve ,the drawbacks of the batch processing Map Reduce, into
an more efficient data sharing and also going to be supporting, the iterative and interactive applications.
Now, this particular RDDs, are resilient distributed data sets, they are immutable, immutable means, we
cannot change. It cannot be changed, so that means once an RDD is formed, so it will be an immutable.
Now, in this particular way, this immutable resilient distributed data set can be partitioned in a various
ways, across different cluster nodes. So partition collection of Records, so partitioning can happen, in s
for example, if this is a cluster, of machines which are connected, with each other. So the RDDs can be
partitioned and stored, at different places, at different segments. Hence, the immutable partition collection
of records is possible and in this particular scenario, that is called, ‘RDDs’. Now, another thing is, once an
RDD is formed, then it will be formed using, it will be built, RDDs will be built, through a course gain
transformations, such as, map, join and so on. Now, these RDDs can be cached for efficient reuse, so that
we are going to see that, lot of new operations can be performed on it, so again let us summarize, that the
Spark has, given a solution, in the form of an abstract data type ,which is called as a, ‘RDD.’ and our
RDD can be built, using the transformations and also can be changed, can be changed into another form,
that is RDD can become another, RDD by making various transformations, such as map, join and so on.
That we will see in due course of action, these RDDs are, are immutable, partition collection of record.
Means that, once an RDD is formed, so as an immutable, immutable means, we cannot change, this entire
collection of records, can be stored and in a convenient manner onto the, onto the cluster system. Hence,
this is an immutable partition collection of the record, these RDDs can be cached, can be cached in
memory, for efficient reuse. So as we have seen that, Map Reduce lacks, this data shearing and now using
the RDDs, a Spark will provide, the efficient, sharing of data in the form of, all RDDs.
Now let us see, through an example of a word count. Word count example, to understand the need of
Spark. So in the word count application, the data set, is installed through the HDFS file system and is
being read, so after reading this particular data, set from the file system, this particular reading operation,
will build the RDDs. and which is shown here, in this block number 1. So that means once, the data is
read, it will build, the RDDs from the word-count data set. Once these RDDs are built, then they can be
stored at different places, so it's an immutable partitioned collection and now various operations we can
perform. So we know that, these RDDs, we can perform various transformations and first transformation
which we are going to perform on these RDDs, which are, which is called a, ‘Map Function’. Map
function for, the word count program, is being applied on different RDDs, which are restored. So after
applying the map function, this particular output, will be stored in memory and then again, the reduced
function will be applied, on these RDDs, which is the output? Or which is the transformations? Which is
the transform or RDDs, out of the map function? Again the reduce function will apply. And the data and
the result of this reduce, will remain in cache. so that, it can be, used up by different application .so you
can see that, this particular transformation, which is changing the RDDs from one form, to another form
that means after reading, from the file, it will become an RDD and after applying the map function, it will
change to another RDD and map function and after applying the reduce function. it will change to,
another form and the final output, will be remained in the cache memory. Output will remain in the cache,
so that, whatever application requires, this output can be used up, so this particular pipeline, which we
have shown, is quite, easily, understandable and is convenient to manage and part and to store, in the
partition, collection manner, in this cluster computing system.
Refer slide time: (18:11)
So, we have seen that, this RDDs has simplified this particular task and has also made this operation
efficient. Hence, RDDs is an immutable, partition collection of the records. And they are built through the
coarse grained transformation that we have seen in the previous example. Now, another question is, since
the Map Reduce was storing ,the intermediate results of a map, before it is being used in the reduced
function, into an SDFS file system, for ensuring the fault tolerance .now since, by produced since the
SPARK is not using, this intermediate results storage, through the HDFS rather, it will be in memory
storage, so there will be how this Spark ensures the fault tolerance that we have to understand now, the
concept which Spark uses, for fault tolerance is called, ‘Lineage’. So this park uses the lineage to achieve
the fault tolerance and that we have to understand now, so what Spark does is? It locks, the coarse grained
operations, which are applied, to the partition data set. meaning to say that, all the operation like, reading
of a file and that becomes an RDD and then making a transformation on an RDD using map function and
then again another transformation, of RDDs using reduced function, join function and so on. All these
operations they form, the course gained operations and they are to be logged into a file, of before applying
it. So if the data is, so basically ,if the data is, lost or if the, if the system, crashes the node crashes, they
simply recomputed, these lost partition and whenever there is a failure, if there is no failure, obviously no
extra cost ,is required in this process.
Refer slide time: (20:37)
Let us see this, through an example.
So again, let us explain that, lineage is in the form of the course grained, you said, it is a log of a coarse
grained operation. Now this particular, lineage will, keep a record of all these operations, coarse-grained
operations, which are applied and that will be kept in a log file. Now we will see, how using this lineage
or a log files, the fault tolerance can be achieved.
Now we will see that, what more we can do here in the Spark? So RDDs transfer, which RDDs provide
various operations and all these operations are divided into two different categories, the first category is
called, ‘Transformations’. Which we can apply, as an RDD operation. second is called, ‘Actions’, which
we can perform using RDDs, operations so as far as the transformations, which RDD supports is in the
form of filter, join, map, group by all these are different transformations, which RDD supports, in the
Spark system .Now, another set of, operation which RDD supports is called, ‘Actions’. so actions, in the
sense the output of some, some operations, is whenever there then it is called, ‘Action’. For example,
count, print and so on. Now, then another thing which, Spark can provide is called, ‘Control Operations’,
to the programmer level.
So there, are two interesting control, which is being provided by the Spark, to the programmers. The first
is called, ‘partitioning’. So, the Spark gives the control or how you can partition your RDDs, across
different cluster systems. and second one is called the, ‘Persistence’. Persistence allows you to choose,
whether you want to persist RDDs on to the disk or not. So by default, it is not persisted, but if you allow,
if you choose this persistent, RDDs then the, RDDs I have to be stored in HDFS. Hence, the persistent
and partitioning both controls are, given to the, to the programmer the user in buys by the Spark system.
There are various other, Spark applications, where Spark can be used first, these applications are such as,
Twitter respond classification, algorithm for traffic prediction, k-means clustering algorithms, alternating
least square matrix factorization, in memory OLAP aggregation on his, pond hive data and SQL on
Spark.
Now, these executors will execute and give the result back to the drivers.
Refer slide time: (27:22)
And then again, the driver will give further operations or the functions.
Refer slide time: (27:34)
And these, particular functions are used by the shuffle and again given back to the
So these are all, a diagram I have already explained, that these are all RDDs and these are all
transformations, the arrows are transformations, from one RDDs to another RDDs and if this is an, action.
On these RDDs, it will be performed on the action.
Refer slide time: (30:44)
So let us see, a simple word count application, which is written in, a Spark, using Flume Java.
Refer slide time: (32:44)
Let us, see the SPARK implementation, in more details.
So, Spark ideas are an expressive computing system, which is not limited by, the Map Reduce model.
That means, beyond Map Reduce also, the programming can be now done, in the SPARK. Now, this
Spark will facilitate the system memory and it will avoid saving that immediate, results to the disk, it will
cache for repeated, repetitive queries, that means, the output of the transform actions or the
transformation will remain in the cache, so that, iterative applications can, can make use of this fast or
efficient data sharing .
Refer slide time: (33:31)
The SPARK is also compatible with, with the Hadoop system. RDDs is an abstraction as I told you, it is a
resilient distributed datasets, a partition collection of record ,they are spread across the cluster, they are
here only and caching data sets, in the possible, in memory and different storage levels are possible.
As I told you that, the transformations and actions, there are two operations, RDD supports and
transformations include map filter joint, they are lazy operations and actions, include the count collect
sale and they are trigger executions. Spark components, let us go and discuss
And different tasks are being executed. similarly job scheduling, that means, once an RDD, is that means
operations are, given automatically it will build, the DAG and DAG will be given to the DAG scheduler
and that, DAG scheduler will split the graph into the stages of, task and submit each stage as it is ready.
So task set is created and given to the task a scheduler and as far as the cluster manager, is concerned it
launches, the tasks, via the cluster manager and retry the field or straggling task. And this task is, given to
the worker that we have seen in the previous slide and this particular workers will create the thread and
execute them, there.
There are different APIs, which are available and you can write, these APIs is using Java program, scala
or a Python. There is also an interactive interpreter: available, access through the scalar and Python.
Standby applications are, there are many applications and performance: if we see that Java and C are
faster and thanks to the static typing
.
Refer slide time: (40:26)
Now let us see, the hands-on session, how we can perform, using Spark. So as Spark, we can run, as a
scalar, so Spark shall, will be created.
Refer slide time: (40:44)
And we can download, the ,the, the file, that data set file and then, it can be, built using ,the package and
then this particular task or a data file can be submitted ,to the, to the master node of that Spark system. So,
so directly Spark, can store it into the system and now it can perform, various operations, on this
particular data set, using a Scala program.