Packt - Hands On - Big.data - Analytics.with - Pyspark.2019
Packt - Hands On - Big.data - Analytics.with - Pyspark.2019
Packt - Hands On - Big.data - Analytics.with - Pyspark.2019
Analyze large datasets and discover techniques for testing, immunizing, and
parallelizing Spark jobs
Rudy Lai
Bartłomiej Potaczek
BIRMINGHAM - MUMBAI
Hands-On Big Data Analytics with
PySpark
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or
reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the
information contained in this book is sold without warranty, either express or implied. Neither the authors nor Packt Publishing or
its dealers and distributors will be held liable for any damages caused or alleged to have been caused directly or indirectly by this
book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this
book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-83864-413-0
www.packtpub.com
mapt.io
Mapt is an online digital library that gives you full access to over 5,000
books and videos, as well as industry-leading tools to help you plan your
personal development and advance your career. For more information,
please visit our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks
and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
At www.packt.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters, and receive exclusive discounts and
offers on Packt books and eBooks.
Contributors
About the authors
Colibri Digital is a technology consultancy company founded in 2015 by
James Cross and Ingrid Funie. The company works to help its clients
navigate the rapidly changing and complex world of emerging technologies,
with deep expertise in areas such as big data, data science, machine
learning, and cloud computing. Over the past few years, they have worked
with some of the world's largest and most prestigious companies, including
a tier 1 investment bank, a leading management consultancy group, and one
of the world's most popular soft drinks companies, helping each of them to
better make sense of their data, and process it in more intelligent ways. The
company lives by its motto: Data -> Intelligence -> Action.
This book will help you implement some practical and proven techniques to
improve aspects of programming and administration in Apache Spark. You
will not only learn how to use Spark and the Python API to create high-
performance analytics with big data, but also discover techniques to test,
immunize, and parallelize Spark jobs.
This book covers the installation and setup of PySpark, RDD operations,
big data cleaning and wrangling, and aggregating and summarizing data
into useful reports. You will learn how to source data from all popular data
hosting platforms, including HDFS, Hive, JSON, and S3, and deal with
large datasets with PySpark to gain practical big data experience. This book
will also help you to work on prototypes on local machines and
subsequently go on to handle messy data in production and on a large scale.
Who this book is for
This book is for developers, data scientists, business analysts, or anyone
who needs to reliably analyze large amounts of large-scale, real-world data.
Whether you're tasked with creating your company's business intelligence
function, or creating great data platforms for your machine learning models,
or looking to use code to magnify the impact of your business, this book is
for you.
What this book covers
, Installing Pyspark and Setting up Your Development
Chapter 1
, Getting Your Big Data into the Spark Environment Using RDDs,
Chapter 2
explains how to get your big data into the Spark environment using RDDs
using a wide array of tools to interact and modify this data so that useful
insights can be extracted.
describes how to calculate averages with the map and reduce function,
perform faster average computation, and use a pivot table with key/value
pair data points.
and the operations of Spark API that should be used. We will then test
operations that cause a shuffle in Apache Spark to know which operations
should be avoided.
the correct format and also save data in plain text using Spark's standard
API.
, Testing Apache Spark Jobs, goes into further detail about testing
Chapter 12
Apache Spark jobs in different versions of Spark.
GraphX API. We will carry out experiments with the Edge API and Vertex
API.
To get the most out of this book
This book requires some basic programming experience in PySpark,
Python, Java, and Scala.
Download the example code files
You can download the example code files for this book from your account
at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.
com/support and register to have the files emailed directly to you.
Once the file is downloaded, please make sure that you unzip or extract the
folder using the latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/P
acktPublishing/Hands-On-Big-Data-Analytics-with-PySpark. In case there's an update
We also have other code bundles from our rich catalog of books and videos
available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the
screenshots/diagrams used in this book. You can download it here: http://ww
w.packtpub.com/sites/default/files/downloads/9781838644130_ColorImages.pdf.
Conventions used
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and
Twitter handles. Here is an example: "Mount the downloaded WebStorm-
10*.dmg disk image file as another disk in your system."
Bold: Indicates a new term, an important word, or words that you see on
screen. For example, words in menus or dialog boxes appear in the text like
this. Here is an example: "Select System info from
the Administration panel."
Warnings or important notes appear like this.
Tips and tricks appear like this.
Get in touch
Feedback from our readers is always welcome.
Errata: Although we have taken every care to ensure the accuracy of our
content, mistakes do happen. If you have found a mistake in this book, we
would be grateful if you would report this to us. Please visit www.packt.com/sub
mit-errata, selecting your book, clicking on the Errata Submission Form link,
and entering the details.
Piracy: If you come across any illegal copies of our works in any form on
the internet, we would be grateful if you would provide us with the location
address or website name. Please contact us at copyright@packt.com with a link
to the material.
An overview of PySpark
Setting up Spark on Windows and PySpark
Core concepts in Spark and PySpark
An overview of PySpark
Before we start with installing PySpark, which is the Python interface for
Spark, let's go through some core concepts in Spark and PySpark. Spark is
the latest big data tool from Apache, which can be found by simply going
to http://spark.apache.org/. It's a unified analytics engine for large-scale data
processing. This means that, if you have a lot of data, you can feed that data
into Spark to create some analytics at a good speed. If we look at the
running times between Hadoop and Spark, Spark is more than a hundred
times faster than Hadoop. It is very easy to use because there are very good
APIs for use with Spark.
Scala
Java
Python
Python is similar to PySpark integration, which we will cover soon. For
now, we will import some libraries from the PySpark package to help us
work with Spark. The best way for us to understand Spark is to look at an
example, as shown in the following screenshot:
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
In line two and line three, we have a MapReduce function. In line two, we
have mapped the length function using a lambda function to each line
of data.text. In line three, we have called a reduction function to add all
lineLengths together to produce the total length of the documents. While
Python's lines is a variable that contains all the lines in data.text, under the
hood, Spark is actually handling the distribution of fragments of data.text in
two different instances on the Spark cluster, and is handling the MapReduce
computation over all of these instances.
Spark SQL
Spark SQL is one of the four components on top of the Spark platform, as
we saw earlier in the chapter. It can be used to execute SQL queries or read
data from any existing Hive insulation, where Hive is a database
implementation also from Apache. Spark SQL looks very similar to
MySQL or Postgres. The following code snippet is a good example:
#Register the DataFrame as a SQL temporary view
df.CreateOrReplaceTempView("people")
#+----+-------+
#| age| name|
#+----+-------+
#+null|Jackson|
#| 30| Martin|
#| 19| Melvin|
#+----|-------|
You'll need to select all the columns from a certain table, such as people, and
using the Spark objects, you'll feed in a very standard-looking SQL
statement, which is going to show an SQL result much like what you would
expect from a normal SQL implementation.
Let's look at how DataFrames can be used. The following code snippet is a
quick example of a DataFrame:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
#+----+-------+
#| age| name|
#+----+-------+
#+null|Jackson|
#| 30| Martin|
#| 19| Melvin|
#+----|-------|
In the next section, we will see how we can set up Spark on Windows, and
set up PySpark as the interface.
Setting up Spark on Windows and
PySpark
Complete the following steps to install PySpark on a Windows machine:
3. Download and install Anaconda. If you need help, you can go through
the following tutorial: https://medium.com/@GalarnykMichael/install-python-on-w
indows-anaconda-c63c7c3d1444.
4. Close the previous command line and open a new command line.
5. Go to the Apache Spark website (https://spark.apache.org/).
6. To download Spark, choose the following from the drop-down menu:
A recent Spark release
A proper package type
10. Make sure you have Java installed on your machine. You can use the
following command to see the Java version:
java --version
12. Let's edit our environmental variables so that we can open Spark in any
directory, as follows:
setx SPARK_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7
setx HADOOP_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7
setx PYSPARK_DRIVER_PYTHON ipython
setx PYSPARK_DRIVER_PYTHON_OPTS notebook
13. Close the Terminal, open a new one, and type the following command:
--master local[2]
The PYSPARK_DRIVER_PYTHON and the PYSPARK_DRIVER_PYTHON_OPTS parameters are used to launch
the PySpark shell in Jupyter Notebook. The --master parameter is used for setting the
master node address.
14. The next thing to do is to run the PySpark command in the bin folder:
.\bin\pyspark
SparkContext
SparkConf
Spark shell
SparkContext
SparkContext is an object or concept within Spark. It is a big data analytical
engine that allows you to programmatically harness the power of Spark.
The power of Spark can be seen when you have a large amount of data that
doesn't fit into your local machine or your laptop, so you need two or more
computers to process it. You also need to maintain the speed of processing
this data while working on it. We not only want the data to be split among a
few computers for computation; we also want the computation to be
parallel. Lastly, you want this computation to look like one single
computation.
Let's consider an example where we have a large contact database that has
50 million names, and we might want to extract the first name from each of
these contacts. Obviously, it is difficult to fit 50 million names into your
local memory, especially if each name is embedded within a larger contacts
object. This is where Spark comes into the picture. Spark allows you to give
it a big data file, and will help in handling and uploading this data file,
while handling all the operations carried out on this data for you. This
power is managed by Spark's cluster manager, as shown in the following
diagram:
The cluster manager manages multiple workers; there could be 2, 3, or even
100. The main point is that Spark's technology helps in managing this
cluster of workers, and you need a way to control how the cluster is
behaving, and also pass data back and forth from the clustered rate.
Let's see what this looks like in practice and see how to set up a
SparkContext:
We can see that we've started a shell session with Spark in the following
screenshot:
Spark is now available to us as a spark variable. Let's try a simple thing in
Spark. The first thing to do is to load a random file. In each Spark
installation, there is a README.md markdown file, so let's load it into our
memory as follows:
text_file = spark.read.text("README.md")
If we use spark.read.text and then put in README.md, we get a few warnings, but
we shouldn't be too concerned about that at the moment, as we will see later
how we are going to fix these things. The main thing here is that we can use
Python syntax to access Spark.
What we have done here is put README.md as text data read by spark into Spark,
and we can use text_file.count() can get Spark to count how many characters
are in our text file as follows:
text_file.count()
We can also see what the first line is with the following:
text_file.first()
We can now count a number of lines that contain the word Spark by doing the
following:
lines_with_spark = text_file.filter(text_file.value.contains("Spark"))
Here, we have filtered for lines using the filter() function, and within
the filter() function, we have specified that text_file_value.contains includes
the word "Spark", and we have put those results into
the lines_with_spark variable.
We can modify the preceding command and simply add .count(), as follows:
text_file.filter(text_file.value.contains("Spark")).count()
We can see that 20 lines in the text file contain the word Spark. This is just a
simple example of how we can use the Spark shell.
SparkConf
SparkConf allows us to configure a Spark application. It sets various Spark
parameters as key-value pairs, and so will usually create a SparkConf object
with a SparkConf() constructor, which would then load values from the spark.*
underlying Java system.
There are a few useful functions; for example, we can use the sets()
function to set the configuration property. We can use the setMaster()
function to set the master URL to connect to. We can use the setAppName()
function to set the application name, and setSparkHome() in order to set the
path where Spark will be installed on worker nodes.
You can learn more about SparkConf at https://spark.apache.org/docs/0.9.0/api/pyspark/pysaprk.con
f.SparkConf-class.html.
Summary
In this chapter, we learned about the core concepts in Spark and PySpark.
We learned about setting up Spark and using PySpark on Windows. We also
went through the three main pillars of Spark, which are SparkContext,
Spark shell, and SparkConf.
In the next chapter, we're going to look at getting your big data into Spark
environments using RDDs.
Getting Your Big Data into the
Spark Environment Using RDDs
Primarily, this chapter will provide a brief overview of how to get your big
data into the Spark environment using resilient distributed datasets
(RDDs). We will be using a wide array of tools to interact with and modify
this data so that useful insights can be extracted. We will first load the data
on Spark RDDs and then carry out parallelization with Spark RDDs.
Let's start with an overview of the UCI machine learning data repository.
The UCI machine learning
repository
We can access the UCI machine learning repository by navigating to http
s://archive.ics.uci.edu/ml/. So, what is the UCI machine learning repository?
UCI stands for the University of California Irvine machine learning
repository, and it is a very useful resource for getting open source and free
datasets for machine learning. Although PySpark's main issue or solution
doesn't concern machine learning, we can use this as a chance to get big
datasets that help us test out the functions of PySpark.
Let's take a look at the KDD Cup 1999 dataset, which we will download,
and then we will load the whole dataset into PySpark.
Getting the data from the
repository to Spark
We can follow these steps to download the dataset and load it in PySpark:
You can see that there's kddcup.data.gz, and there is also 10% of that data
available in kddcup.data_10_percent.gz. We will be working with food
datasets. To work with the food datasets, right-click on kddcup.data.gz,
select Copy link address, and then go back to the PySpark console and
import the data.
Let's take a look at how this works using the following steps:
2. The next thing to do is use this request library to pull some resources
from the internet, as shown in the following code:
f = urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-
learning-databases/kddcup99-mld/kddcup.data.gz"),"kddcup.data.gz"
This command will take some time to process. Once the file has been
downloaded, we can see that Python has returned and the console is
active.
2. In the following command, we can see that the raw data is now in the
raw_data variable:
raw_data
Now that we know how to load the data into Spark, let's learn
about parallelization with Spark RDDs.
Parallelization with Spark RDDs
Now that we know how to create RDDs within the text file that we received
from the internet, we can look at a different way to create this RDD. Let's
discuss parallelization with our Spark RDDs.
What is parallelization?
How do we parallelize Spark RDDs?
On the other hand, if we look at the definition of parallelize, we can see that
this is creating an RDD by distributing a local Scala collection.
So, the main difference between using parallelize to create an RDD and
using the textFile to create an RDD is where the data is sourced from.
Let's look at how this works practically. Let's go to the PySpark installation
screen, from where we left off previously. So, we imported urllib, we used
urllib.request to retrieve some data from the internet, and we
used SparkContext and textFile to load this data into Spark. The other way to
do this is to use parallelize.
Let's look at how we can do this. Let's first assume that our data is already
in Python, and so, for demonstration purposes, we are going to create a
Python list of a hundred numbers as follows:
a = range(100)
a
Now, let's look at what we can do with this list. The first thing we can do is
count how many elements are present in list_rdd by using the following
command:
list_rdd.count()
We can see that list_rdd is counted at 100. If we run it again without cutting
through into the results, we can actually see that, since Scala is running in a
real time when going through the RDD, it is slower than just running the
length of a, which is instant.
However, RDD takes some time, because it needs time to go through the
parallelized version of the list. So, at small scales, where there are only a
hundred numbers, it might not be very helpful to have this trade-off, but
with larger amounts of data and larger individual sizes of the elements of
the data, it will make a lot more sense.
We can also take an arbitrary amount of elements from the list, as follows:
list_rdd.take(10)
When we run the preceding command, we can see that PySpark has
performed some calculations before returning the first ten elements of the
list. Notice that all of this is now backed by PySpark, and we are using
Spark's power to manipulate this list of 100 items.
Here, lambda takes two parameters, a and b. It simply adds these two numbers
together, hence a+b, and returns the output. With the RDD reduce call, we can
sequentially add the first two numbers of RDD lists together, return the
results, and then add the third number to the results, and so on. So,
eventually, you add all 100 numbers to the same results by using reduce.
Now, after some work through the distributed database, we can now see that
adding numbers from 0 to 99 gives us 4950, and it is all done using PySpark's
RDD methodology. You might recognize this function from the term
MapReduce, and, indeed, it's the same thing.
The reason why this is very important is that the documentation is the golden
source of how a function is defined and what it is designed to be used as. By
reading the documentation, we make sure that we are as close to the source
as possible in our understanding. The link to the relevant documentation is ht
tps://spark.apache.org/docs/latest/rdd-programming-guide.html.
So, let's start with the map function. The map function returns an RDD by
applying the f function to each element of this RDD. In other words, it
works the same as the map function we see in Python. On the other hand,
the filter function returns a new RDD containing only the elements that
satisfy a predicate, and that predicate, which is a Boolean, is often returned
by an f function fed into the filter function. Again, this works very similarly
to the filter function in Python. Lastly, the collect() function returns a list
that contains all the elements in this RDD. And this is where I think reading
the documentation really shines, when we see notes like this. This would
never come up in Stack Overflow or a blog post if you were simply googling
what this is.
So, we're saying that collect() should only be used if the resulting array is
expected to be small, as all the data is loaded in a driver's memory. What that
means is, if we think back on Chapter 01, Installing PySpark and Setting Up
Your Development Environment, Spark is superb because it can collect and
parallelize data across many different unique machines, and have it
transparently operatable from one Terminal. What collects notes is saying is
that, if we call collect(), the resulting RDD would be completely loaded into
the driver's memory, in which case we lose the benefits of distributing the
data around a cluster of Spark instances.
Now that we know all of this, let's see how we actually apply these three
functions to our data. So, go back to the PySpark Terminal; we have already
loaded our raw data as a text file, as we have seen in previous chapters.
We will write a filter function to find all the lines to indicate RDD data,
where each line contains the word normal, as seen in the following screenshot:
contains_normal = raw_data.filter(lambda line: "normal." in line)
Let's analyze what this means. Firstly, we are calling the filter function for
the RDD raw data, and we're feeding it an anonymous lambda function that
takes one line parameter and returns the predicates, as we have read in the
documentation, on whether or not the word normal exists in the line. At this
moment, as we have discussed in the previous chapters, we haven't actually
computed this filter operation. What we need to do is call a function that
actually consolidates the data and forces Spark to calculate something. In
this case, we can count on contains_normal, as demonstrated in the following
screenshot:
You can see that it has counted just over 970,000 lines in the raw data that
contain the word normal. To use the filter function, we provide it with
the lambda function and use a consolidating function, such as counts, that
forces Spark to calculate and compute the data in the underlying DataFrame.
For the second example, we will use the map. Since we downloaded the
KDD cup data, we know that it is a comma-separated value file, and so, one
of the very easy things for us to do is to split each line by two commas, as
follows:
split_file = raw_data.map(lambda line: line.split(","))
Let's analyze what is happening. We call the map function on raw_data. We feed
it an anonymous lambda function called line, where we are splitting
the line function by using ,. The result is a split file. Now, here the power of
Spark really comes into play. Recall that, in the contains_normal. filter, when
we called a function that forced Spark to calculate count, it took us a few
minutes to come up with the correct results. If we perform the map function, it
is going to have the same effect, because there are going to be millions of
lines of data that we need to map through. And so, one of the ways to
quickly preview whether our mapping function runs correctly is if we can
materialize a few lines instead of the whole file.
To do this, we can use the take function that we have used before, as
demonstrated in the following screenshot:
This might take a few seconds because we are only taking five lines, which
is our splits and is actually quite manageable. If we look at this sample
output, we can understand that our map function has been created
successfully. The last thing we can do is to call collect() on raw data as
follows:
raw_data.collect()
This is designed to move all of the raw data from Spark's RDD data structure
into the memory.
Summary
In this chapter, we learned how to load data on Spark RDDs and also
covered parallelization with Spark RDDs. We had a brief overview of the
UCI machine learning repository before loading the data. We had an
overview of the basic RDD operations, and also checked the functions from
the official documentation.
In the next chapter, we will cover big data cleaning and data wrangling.
Big Data Cleaning and Wrangling
with Spark Notebooks
In this chapter, we will learn about big data cleaning and wrangling with
Spark Notebooks. We will also look at how using Spark on a Notebook
application allows us to use RDDs effectively. We will use Spark
Notebooks for quick iteration of ideas and carry out sampling/filtering
RDDs to pick out relevant data points. We will also learn how to
split datasets and create new combinations with set operations.
So, we can use Spark Notebooks as it is, and all we need to do is go to the
Spark Notebook website and click on Quick Start to get the Notebook
started, as shown in the following screenshot:
We need to make sure that we are running Java 7. We can see that the setup
steps are also mentioned in the documentation, as shown in the following
screenshot:
The main website for Spark Notebook is spark-notebook.io, where we can see
many options. A few of them have been shown in the following screenshot:
We can download the TAR file and unzip it. You can use Spark Notebook,
but we will be using Jupyter Notebook in this book. So, going back to the
Jupyter environment, we can look at the PySpark-accompanying code files.
In Chapter 3 Notebook we have included a convenient way for us to set up the
environment variables to get PySpark working with Jupyter, as shown in the
following screenshot:
First, we need to create two new environment variables in our environments.
If you are using Linux, you can use Bash RC. If you are using Windows, all
you need to do is to change and edit your system environment variables.
There are multiple tutorials online to help you do this. What we want to do
here is to edit or include the PYSPARK_DRIVER_PYTHON variable and point it to your
Jupyter Notebook installation. If you are on Anaconda, you probably would
be pointed to the Anaconda Jupyter Bash file. Since we are on WinPython, I
have pointed it to my WinPython Jupyter Notebook Bash file. The second
environment variable we want to export is simply PYSPARK_DRIVER_PYTHON_OPTS.
One of the suggestions is that we include the Notebook folder and the
Notebook app in the options, ask it not to open in the browser, and tell it
what port to bind to. In practice, if you are on Windows and WinPython
environments then you don't really need this line here, and you can simply
skip it. After this has been done, simply restart your PySpark from a
command line. What will happen is that, instead of having the console that
we have seen before, it directly launches into a Jupyter Notebook instance,
and, furthermore, we can use Spark and SparkContext variables as in Jupyter
Notebook. So, let's test it out as follows:
sc
We instantly get access to our SparkContext that tells us that Spark is version
2.3.3, our Master is at local, and the AppName is the Python SparkShell
(PySparkShell), as shown in the following code snippet:
SparkContext
Spark UI
Version
v2.3.3
Master
local[*]
AppName
PySparkShell
Let's now check how sampling not only speeds up our calculations, but also
gives us a good approximation of the statistic that we are trying to calculate.
To do this, we first import the time library as follows:
from time import time
The next thing we want to do is look at lines or data points in the KDD
database that contains the word normal:
raw_data = sc.textFile("./kdd.data.gz")
Next, we need to time how long it would take for us to count the number of
rows in the sample:
t0 = time()
num_sampled = contains_normal_sample.count()
duration = time() - t0
We issue the count statement here. As you know from the previous section,
this is going to trigger all the calculations in PySpark as defined in
contains_normal_sample, and we're recording the time before the sample count
happens. We are also recording the time after the sample count happens, so
we can see how long it takes when we're looking at a sample. Once this is
done, let's take a look at how long the duration was in the following code
snippet:
duration
It took us 23 seconds to run this operation over 10% of the data. Now, let's
look at what happens if we run the same transform over all of the data:
contains_normal = raw_data.map(lambda x: x.split(",")).filter(lambda x: "normal"
in x)
t0 = time()
num_sampled = contains_normal.count()
duration = time() - t0
The last thing to look at is how we can use takeSample. All we need to do is
use the following code:
data_in_memory = raw_data.takeSample(False, 10, 42)
As we've learned earlier, when we present the new functions we call
takeSample, and it will give us 10 items with a random seed of 42, which we
will now put into memory. Now that this data is in memory, we can call the
same map and filter functions using native Python methods as follows:
contains_normal_py = [line.split(",") for line in data_in_memory if "normal" in
line]
len(contains_normal_py)
We can see that the calculation is completed in the previous code block, and
it took longer and used more memory than if we were doing it in PySpark.
And that's why we use Spark, because Spark allows us to parallelize any big
datasets and operate it on it using a parallel fashion, which means that we
can do more with less memory and with less time. In the next section, we're
going to talk about splitting datasets and creating new combinations with
set operations.
Splitting datasets and creating
some new combinations
In this section, we are going to look at splitting datasets and creating new
combinations with set operations. We're going to learn subtracts,
and Cartesian ones in particular.
Let's go back to Chapter 3 of the Jupyter Notebook that we've been looking at
lines in the datasets that contain the word normal. Let's try to get all the lines
that don't contain the word normal. One way is to use the filter function to
look at lines that don't have normal in it. But, we can use something different
in PySpark: a function called subtract to take the entire dataset and subtract
the data that contains the word normal. Let's have a look at the following
snippet:
normal_sample = sampled.filter(lambda line: "normal." in line)
We can then obtain interactions or data points that don't contain the word
normal by subtracting the normal ones from the entire sample as follows:
non_normal_sample = sampled.subtract(normal_sample)
We take the normal sample and we subtract it from the entire sample, which
is 10% of the entire dataset. Let's issue some counts as follows:
sampled.count()
As you can see, 10% of the dataset gives us 490705 data points, and within it,
we have a number of data points containing the word normal. To find out its
count, write the following code:
normal_sample.count()
This will give us the following output:
97404
So, here we have 97404 data points. If we count the on normal samples
because we're simply subtracting one sample from another, the count should
be roughly just below 400,000 data points, because we have 490,000 data
points minus 97,000 data points, which should result in something like
390,000. Let's see what happens using the following code snippet:
non_normal_sample.count()
Let's now discuss the other function, called cartesian. This allows us to give
all the combinations between the distinct values of two different features.
Let's see how this works in the following code snippet:
feature_1 = sampled.map(lambda line: line.split(",")).map(lambda features:
features[1]).distinct()
Here, we're splitting the line function by using ,. So, we will split the values
that are comma-separated—for all the features that we come up with after
splitting, we take the first feature, and we find all the distinct values of that
column. We can repeat this for the second feature as follows:
feature_2 = sampled.map(lambda line: line.split(",")).map(lambda features:
features[2]).distinct()
And so, we now have two features. We can look at the actual items in
feature_1 andfeature_2 as follows, by issuing the collect() call that we saw
earlier:
f1 = feature_1.collect()
f2 = feature_2.collect()
Let's look at each one as follows:
f1
f2has a lot more values, and we can use the cartesian function to collect all
the combinations between f1 and f2 as follows:
len(feature_1.cartesian(feature_2).collect())
This is how we use the cartesian function to find the Cartesian product
between two features. In this chapter, we looked at Spark Notebooks;
sampling, filtering, and splitting datasets; and creating new combinations
with set operations.
Summary
In this chapter, we looked at Spark Notebooks for quick iterations. We then
used sampling or filtering to pick out relevant data points. We also learned
how to split datasets and create new combinations with set operations.
In the next chapter, we will cover aggregating and summarizing data into
useful reports.
Aggregating and Summarizing
Data into Useful Reports
In this chapter, we will learn how to aggregate and summarize data into
useful reports. We will learn how to calculate averages with map and reduce
functions, perform faster average computation, and use pivot tables with
key-value pair data points.
The map function takes two arguments, one of which is optional. The first
argument to map is f, which is a function that gets applied to the RDD
throughout by the map function. The second argument, or parameter, is the
preservesPartitioning parameter, which is False by default.
If we look at the documentation, it says that map simply returns a new RDD
by applying a function to each element of this RDD, and obviously, this
function refers to f that we feed into the map function itself. There's a very
simple example in the documentation, where it says if we parallelize an
rdd method that contains a list of three characters, b, a, and c, and we map a
function that creates a tuple of each element, then we'll create a list of three-
tuples, where the original character is placed in the first elements of the
tuple, and the 1 integer is placed in the second as follows:
rdd = sc.paralleize(["b", "a", "c"])
sorted(rdd.map(lambda x: (x, 1)).collect())
Let's take an example using the KDD data we have been using. We launch
our Jupyter Notebook instance that links to a Spark instance, as we have
done previously. We then create a raw_data variable by loading a
kddcup.data.gz text file from the local disk as follows:
raw_data = sc.textFile("./kddcup.data.gz")
The next thing to do is to split this file into csv, and then we will filter for
rows where feature 41 includes the word normal:
csv = raw_data.map(lambda x: x.split(","))
normal_data = csv.filter(lambda x: x[41]=="normal.")
Then we use the map function to convert this data into an integer, and then,
finally, we can use the reduce function to compute the total_duration, and then
we can print the total_duration as follows:
duration = normal_data.map(lambda x: int(x[0]))
total_duration = duration.reduce(lambda x, y: x+y)
total_duration
And after a little computation, we would have created two counts using map
and reduce. We have just learned how we can calculate averages with
PySpark, and what the map and reduce functions are in PySpark.
Faster average computations with
aggregate
In the previous section, we saw how we can use map and reduce to calculate
averages. Let's now look at faster average computations with the aggregate
function. You can refer to the documentation mentioned in the previous
section.
The aggregate is a function that takes three arguments, none of which are
optional.
The first one is the zeroValue argument, where we put in the base case of the
aggregated results.
The second argument is the sequential operator (seqOp), which allows you to
stack and aggregate values on top of zeroValue. You can start with zeroValue,
and the seqOp function that you feed into aggregate takes values from your
RDD, and stacks or aggregates it on top of zeroValue.
The last argument is combOp, which stands for combination operation, where
we simply take the zeroValue argument that is now aggregated through the
seqOp argument, and combine it into one value so that we can use this to
So, here we are aggregating the elements of each partition and then the
results for all the partitions using a combined function and a neutral zero
value. Here, we have two things to note:
In this case, we all need one operation for merging a T into U, and one
operation for merging the two Us.
Let's go to our Jupyter Notebook to check how this is done. aggregate allows
us to calculate both the total duration and the count at the same time. We
call the duration_count function. We then take normal_data and we aggregate it.
Remember that there are three arguments to aggregate. The first one is the
initial value; that is, the zero value, (0,0). The second one is a sequential
operation, as follows:
duration_count = duration.aggregate(
(0,0),
(lambda db, new_value: (db[0] + new_value, db[1] + 1))
)
We need to specify a lambda function with two arguments. The first argument
is the current accumulator, or the aggregator, or what can also be called a
database (db). Then, we have the second argument in our lambda function
as new_value, or the current value we're processing in the RDD. We simply
want to do the right thing to the database, so to say, where we know that our
database looks like a tuple with the sum of duration on the first element and
the count on the second element. Here, we know that our database looks
like a tuple, where the sum of duration is the first element, and the count is
the second element. Whenever we look at a new value, we need to add the
new value to the current running total and add 1 to the current running
counts.
The running total is the first element, db[0]. And we then simply need to add
1 to the second element db[1], which is the count. That's the sequential
operation.
You can see that it returns the same results as we saw in the previous
section, which is great. In the next section, we are going to look at pivot
tabling with key-value paired data points.
Pivot tabling with key-value paired
data points
Pivot tables are very simple and easy to use. What we are going to do is use
big datasets, such as the KDD cup dataset, and group certain values by
certain keys.
For example, we have a dataset of people and their favorite fruits. We want
to know how many people have apple as their favorite fruit, so we will
group the number of people, which is the value, against a key, which is the
fruit. This is the simple concept of a pivot table.
We can use the map function to move the KDD datasets into a key-value pair
paradigm. We map feature 41 of the dataset using a lambda function in
the kv key value, and we append the value as follows:
kv = csv.map(lambda x: (x[41], x))
kv.take(1)
We use feature 41 as the key, and the value is the data point, which is x. We
can use the take function to take one of these transformed rows to see how it
looks.
Let's now try something similar to the previous example. To figure out the
total duration against each type of value that is present in feature 41, we can
use the map function again and simply take the 41 feature as our key. We can
take the float of the first number in the data point as our value. We will use
the reduceByKey function to reduce each duration by its key.
So, instead of just reducing all of the data points regardless of which key
they belong to, reduceByKey reduces duration numbers depending on which
key it is associated with. You can view the documentation at https://spark.apa
che.org/docs/latest/api/python/pyspark.html?highlight=map#pyspark.RDD.reduceByKey.
reduceByKey merges the values for each key using an associative and
commutative reduce function. It performs local merging on each mapper
before sending the results to the reducer, which is similar to a combiner in
MapReduce.
The reduceByKey function simply takes one argument. We will be using the
lambda function. We take two different durations and add them together, and
PySpark is smart enough to apply this reduction function depending on a
key, as follows:
kv_duration = csv.map(lambda x: (x[41], float(x[0]))).reduceByKey(lambda x, y:
x+y)
kv_duration.collect()
If we collect the key-value duration data, we can see that the duration is
collected by the value that appears in feature 41. If we are using pivot tables
in Excel, there is a convenience function that is the countByKey function,
which does the exact same thing, demonstrated as follows:
kv.countByKey()
You can see that calling the kv.countByKey() function is the same as calling
the reduceByKey function, preceded by a mapping from the key to the duration.
Summary
In this chapter, we have learned how to calculate averages with map and
reduce. We also learned faster average computations with aggregate. Finally,
we learned that pivot tables allow us to aggregate data based on different
values of features, and that, with pivot tables in PySpark, we can leverage
handy functions, such as reducedByKey or countByKey.
In the next chapter, we will learn about MLlib, which involves machine
learning, which is a very hot topic.
Powerful Exploratory Data
Analysis with MLlib
In this chapter, we will explore Spark's capability to perform regression
tasks with models such as linear regression and support-vector machines
(SVMs). We will learn how to compute summary statistics with MLlib, and
discover correlations in datasets using Pearson and Spearman correlations.
We will also test our hypothesis on large datasets.
MLlib is the machine learning library that comes with Spark. There has
been a recent new development that allows us to use Spark's data-
processing capabilities to pipe into machine learning capabilities native to
Spark. This means that we can use Spark not only to ingest, collect, and
transform data, but we can also analyze and use it to build machine learning
models on the PySpark platform, which allows us to have a more seamless
deployable solution.
Summary statistics are a very simple concept. We are familiar with average,
or standard deviation, or the variance of a particular variable. These are
summary statistics of a dataset. The reason why it's called a summary
statistic is that it gives you a summary of something via a certain statistic.
For example, when we talk about the average of a dataset, we're
summarizing one characteristic of that dataset, and that characteristic is the
average.
Let's check how to compute summary statistics in Spark. The key factor
here is the colStats function. The colStats function computes the column-
wise summary statistics for an rdd input. The colStats function accepts one
parameter, that is rdd, and it allows us to compute different summary
statistics using Spark.
Let's take the first feature x[0] of the data file; this feature represents the
duration, that is, aspects of the data. We will transform it into an integer here,
This helps us do summary statistics over multiple variables, and not just one
of them. To activate the colStats function, we need to import the Statistics
package, as shown in the following snippet:
from pyspark.mllib.stat import Statistics
If we don't index the summary statistics with [0], we can see that
summary.max() and summary.min() gives us back an array, of which the first
element is the summary statistic that we desire, as shown in the following
code snippet:
summary.max()
array ([58329.]) #output
summary.min()
array([0.]) #output
Using Pearson and Spearman
correlations to discover
correlations
In this section, we will look at two different ways of computing correlations
in your datasets, and these two methods are called Pearson and Spearman
correlations.
The Pearson correlation
The Pearson correlation coefficient shows us how two different variables
vary at the same time, and then adjusts it for how much they vary. This is
probably one of the most popular ways to compute a correlation if you have
a dataset.
The Spearman correlation
Spearman's rank correlation is not the default correlation calculation that is
built into PySpark, but it is very useful. The Spearman correlation
coefficient is the Pearson correlation coefficient between the ranked
variables. Using different ways of looking at correlation gives us more
dimensions of understanding on how correlation works. Let's look at how
we calculate this in PySpark.
Computing Pearson and Spearman
correlations
To understand this, let's assume that we are taking the first three numeric
variables from our dataset. For this, we want to access the csv variable that
we defined previously, where we simply split raw_data using a comma (,).
We will consider only the first three columns that are numeric. We will not
take anything that contains words; we're only interested in features that are
purely based on numbers. In our case, in kddcup.data, the first feature
is indexed at 0; feature 5 and feature 6 are indexed at 4 and 5, respectively
which are the numeric variables that we have. We use a lambda function to
take all three of these into a list and put it into the metrics variable:
metrics = csv.map(lambda x: [x[0], x[4], x[5]])
Statistics.corr(metrics, method="spearman")
If we run corr on metrics again, but specify that the method is pearson, then it
gives us Pearson correlations. So, let's examine why we need to be qualified
as the data scientist or machine learning researcher to call these two simple
functions and simply change the value of the second parameter. A lot of
machine learning and data science revolves around our understanding of
statistics, understanding how data behaves, an understanding of how
machine learning models are grounded, and what gives them their
predictive power.
For example, if we had a retail store without any footfall, and suddenly you
get footfall, how likely is it that this is random, or is there even any
statistically significant difference in the level of visitors that we are getting
now in comparison to before? The reason why this is called the chi-square
test is that the test itself references the chi-square distributions. You can refer
to online documentation to understand more about chi-square distributions.
There are three variations within Pearson's chi-square test. We will check
whether the observed datasets are distributed differently than in a theoretical
dataset.
Let's see how we can implement this. Let's start by importing the Vectors
package from pyspark.mllib.linalg. Using this vector, we're going to create a
dense vector of the visitor frequencies by day in our store.
Let's imagine that the frequencies go from 0.13 an hour to 0.61, 0.8,
and 0.5, finally ending on Friday at 0.3. So, we are putting these visitor
frequencies into the visitors_freq variable. Since we're using PySpark, it is
very simple for us to run a chi-square test from the Statistics package, which
we have already imported as follows:
from pyspark.mllib.linalg import Vectors
visitors_freq = Vectors.dense(0.13, 0.61, 0.8, 0.5, 0.3)
print(Statistics.chiSqTest(visitors_freq))
By running the chi-square test, the visitors_freq variable gives us a bunch of
useful information, as demonstrated in the following screenshot:
The preceding output shows the chi-square test summary. We've used the
pearson method, where there are 4 degrees of freedom in our Pearson chi-
square test, and the statistics are 0.585, which means that the pValue is 0.964.
This results in no presumption against the null hypothesis. In this way, the
observed data follows the same distribution as expected, which means our
visitors are not actually different. This gives us a good understanding of
hypothesis testing.
Summary
In this chapter, we learned summary statistics and computing the summary
statistics with MLlib. We also learned about Pearson and Spearman
correlations, and how we can discover these correlations in our datasets
using PySpark. Finally, we learned one particular way of performing
hypothesis testing, which is called the Pearson chi-square test. We then used
PySpark's hypothesis-testing functions to test our hypotheses on large
datasets.
In the next chapter, we're going to look at putting the structure on our big
data with Spark SQL.
Putting Structure on Your Big Data
with SparkSQL
In this chapter, we'll learn how to manipulate DataFrames with Spark SQL
schemas, and use the Spark DSL to build queries for structured data
operations. By now we have already learned to get big data into the Spark
Environment using RDDs and carried out multiple operations on that big
data. Let us now look that how to manipulate our DataFrames and build
queries for structured data operations.
The Spark SQL interface is very simple. For this reason, taking away labels
means that we are in unsupervised learning territory. Also, Spark has great
support for clustering and dimensionality reduction algorithms. We can
tackle learning problems effectively by using Spark SQL to give big data a
structure.
Let's take a look at the code that we will be using in our Jupyter
Notebook. To maintain consistency, we will be using the same KDD cup
data:
2. What's new here is that we are importing two new packages from
pyspark.sql:
Row
SQLContext
3. The following code shows us how to import these packages:
from pyspark.sql import Row, SQLContext
sql_context = SQLContext(sc)
csv = raw_data.map(lambda l: l.split(","))
4. We'll leverage our new important Row objects to create a new object that
has defined labels. This is to label our datasets by what feature we are
looking at, as follows:
rows = csv.map(lambda p: Row(duration=int(p[0]), protocol=p[1],
service=p[2]))
7. Now, when we call the show function, it gives us every single data point
that matches these criteria:
sql_context.sql("""SELECT duration FROM rdd WHERE protocol = 'tcp' AND
duration > 2000""").show()
8. We will then get the following output:
+--------+
|duration|
+--------+
| 12454|
| 10774|
| 13368|
| 10350|
| 10409|
| 14918|
| 10039|
| 15127|
| 25602|
| 13120|
| 2399|
| 6155|
| 11155|
| 12169|
| 15239|
| 10901|
| 15182|
| 9494|
| 7895|
| 11084|
+--------+
only showing top 20 rows
Using the preceding example, we can infer that we can use the
SQLContext variable from the PySpark package to package our data in a SQL
friendly format.
Therefore, not only does PySpark support using SQL syntax to query the
data, but it can also use the Spark domain-specific language (DSL) to
build queries for structured data operations.
Using Spark DSL to build queries
In this section, we will use Spark DSL to build queries for structured data
operations:
Here, we have the top 20 rows of the data points again that fit the
description of the code used to get this result.
Summary
In this chapter, we covered Spark DSL and learned how to build queries.
We also learned how to manipulate DataFrames with Spark SQL schemas,
and then we used Spark DSL to build queries for structured data operations.
Now that we have a good knowledge of Spark, let's look at a few tips,
tricks, and techniques in Apache Spark in the following chapters.
First, we need to initialize Spark. Every test we carry out will be the same.
We need to initialize it before we start using it, as shown in the following
example:
class DeferComputations extends FunSuite {
val spark: SparkContext =
SparkSession.builder().master("local[2]").getOrCreate().sparkContext
Then, we will have the actual test. Here, test is called should defer
computation. It is simple but shows a very powerful abstraction of Spark. We
start by creating an rdd of InputRecord, as shown in the following example:
test("should defer computations") {
//given
val input = spark.makeRDD(
List(InputRecord(userId = "A"),
InputRecord(userId = "B")))
It can be a random uuid if we are not supplying it and the required argument,
that is, userId. The InputRecord will be used throughout this book for testing
purposes. We have created two records of the InputRecord that we will apply a
transformation on, as shown in the following example:
//when apply transformation
val rdd = input
.filter(_.userId.contains("A"))
.keyBy(_.userId)
.map(_._2.userId.toLowerCase)
//.... built processing graph lazy
We will only filter records that have A in the userId field. We will then
transform it to the keyBy(_.userId) and then extract userId from value and map
it toLowerCase. This is our rdd. So, here, we have only created DAG, which we
have not executed yet. Let's assume that we have a complex program and
we are creating a lot of those acyclic graphs before the actual logic.
The pros of Spark are that this is not executed until action is issued, but we
can have some conditional logic. For example, we can get a fast path
execution. Let's assume that we have shouldExecutePartOfCode(), which can
check a configuration switch, or go to the rest service to calculate if the rdd
calculation is still relevant, as shown in the following example:
if (shouldExecutePartOfCode()) {
//rdd.saveAsTextFile("") ||
rdd.collect().toList
} else {
//condition changed - don't need to evaluate DAG
}
}
We have used simple methods for testing purposes that we are just returning
true for, but, in real life, this could be complex logic:
After it returns true, we can decide if we want to execute the DAG or not. If
we want to, we can call rdd.collect().toList or saveAsTextFile to execute the
rdd. Otherwise, we can have a fast path and decide that we are no longer
interested in the input rdd. By doing this, only the graph will be created.
When we start the test, it will take some time to complete and return the
following output:
"C:\Program Files\Java\jdk-12\bin\java.exe" "-javaagent:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\lib\idea_rt.jar=50627:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\bin" -Dfile.encoding=UTF-8 -classpath
C:\Users\Sneha\IdeaProjects\Chapter07\out\production\Chapter07 com.company.Main
Process finished with exit code 0
We can see that our test passed and we can conclude that it worked as
expected. Now, let's look at some transformations that should be avoided.
Avoiding transformations
In this section, we will look at the transformations that should be avoided.
Here, we will focus on one particular transformation.
We have created four transactions for userId = "A" , and one for userId .
= "B"
Now, let's consider that we want to coalesce transactions for a specific userId
to have the list of transactions. We have an input that we are grouping by
userId, as shown in the following example:
For every x element, we will create a tuple. The first element of a tuple is an
ID, while the second element is an iterator of every transaction for that
specific ID. We will transform it into a list using toList. Then, we will
collect everything and assign it to toList to have our result. Let's assert the
result. rdd should contain the same element as B, that is, the key and one
transaction, and A, which has four transactions, as shown in the following
code:
//then
rdd should contain theSameElementsAs List(
("B", List(UserTransaction("B", 13))),
("A", List(
UserTransaction("A", 1001),
UserTransaction("A", 100),
UserTransaction("A", 102),
UserTransaction("A", 1))
)
)
}
}
Let's start this test and check if this behaves as expected. We get the
following output:
"C:\Program Files\Java\jdk-12\bin\java.exe" "-javaagent:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\lib\idea_rt.jar=50822:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\bin" -Dfile.encoding=UTF-8 -classpath
C:\Users\Sneha\IdeaProjects\Chapter07\out\production\Chapter07 com.company.Main
At first glance, it has passed and it works as expected. But the question
arises as to why we want to group it. We want to group it to save it to the
filesystem or do some further operations, such as concatenating all the
amounts.
We can see that our input is not a normal distribution, since almost all the
transactions are for the userId = "A". Because of that, we have a key that is
skewed. This means that one key has the majority of the data in it and that
the other keys have a lower amount of data. When we use groupBy in Spark,
it takes all the elements that have the same grouping, which in this example
is userId, and sends those values to exactly the same executors.
We will first focus on the reduce API. First, we need to create an input of
UserTransaction. We have the user transaction A with amount 10, B with amount
1, and A with amount 101. Let's say that we want to find out the global
maximum. We are not interested in the data for the specific key, but in the
global data. We want to scan it, take the maximum, and return it, as shown
in the following example:
test("should use reduce API") {
//given
val input = spark.makeRDD(List(
UserTransaction("A", 10),
UserTransaction("B", 1),
UserTransaction("A", 101)
))
So, this is the reduced use case. Now, let's see how we can implement it, as
per the following example:
//when
val result = input
.map(_.amount)
.reduce((a, b) => if (a > b) a else b)
//then
assert(result == 101)
}
For the input, we need to first map the field that we're interested in. In this
case, we are interested in amount. We will take amount and then take the
maximum value.
In the preceding code example, reduce has two parameters, a and b. One
parameter will be the current maximum in the specific Lambda that we are
passing, and the second one will be our actual value, which we are
investigating now. If the value was higher than the maximum state until
now, we will return a; if not, it will return b. We will go through all the
elements and, at the end, the result will be just one long number.
So, let's test this and check whether the result is indeed 101, as shown in the
following code output. This means our test passed:
"C:\Program Files\Java\jdk-12\bin\java.exe" "-javaagent:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\lib\idea_rt.jar=50894:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\bin" -Dfile.encoding=UTF-8 -classpath
C:\Users\Sneha\IdeaProjects\Chapter07\out\production\Chapter07 com.company.Main
To achieve this, reduce is not a good choice because it will go through all of
the values and give us the global maximum. We have key operations in
Spark but, first, we want to do it for a specific group of elements. We need
to use keyBy to tell Spark which ID should be taken as the unique one and it
will execute the reduce function only within the specific key. So, we
use keyBy(_.userId) and then we get the reducedByKey function. The reduceByKey
function is similar to reduce but it works key-wise so, inside the Lambda,
we'll only get values for a specific key, as shown in the following example:
//when
val result = input
.keyBy(_.userId)
.reduceByKey((firstTransaction, secondTransaction) =>
TransactionChecker.higherTransactionAmount(firstTransaction,
secondTransaction))
.collect()
.toList
By doing this, we get the first transaction and then the second one. The first
one will be a current maximum and the second one will be the transaction
that we are investigating right now. We will create a helper function that is
taking those transactions and call it higherTransactionAmount.
If you are using the reduceByKey method from Spark, we need to return the
same type as that of the input arguments. If firstTransaction.amount is higher
than secondTransaction.amount, we will just return the firstTransaction since we
are returning the secondTransaction, so transaction objects not the total
amount. This is shown in the following example:
object TransactionChecker {
def higherTransactionAmount(firstTransaction: UserTransaction,
secondTransaction: UserTransaction): UserTransaction = {
if (firstTransaction.amount > secondTransaction.amount) firstTransaction
else secondTransaction
}
}
Now, we will collect, add, and test the transaction. After our test, we have
the output where, for the key B, we should get transaction ("B", 1) and, for
the key A, transaction ("A", 101). There will be no transaction ("A",
10) because we filtered it out, but we can see that for every key, we are able
to find out the maximums. This is shown in the following example:
//then
result should contain theSameElementsAs
List(("B", UserTransaction("B", 1)), ("A", UserTransaction("A", 101)))
}
}
We can see that the test passed and everything is as expected, as shown in
the following output:
"C:\Program Files\Java\jdk-12\bin\java.exe" "-javaagent:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\lib\idea_rt.jar=50909:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\bin" -Dfile.encoding=UTF-8 -classpath
C:\Users\Sneha\IdeaProjects\Chapter07\out\production\Chapter07 com.company.Main
In the next section, we will perform actions that trigger the computations of
our data.
Performing actions that trigger
computations
Spark has a lot more actions that issue DAG, and we should be aware of all
of them because they are very important. In this section, we'll understand
what can be an action in Spark, do a walk-through of actions, and test those
actions if they behave as expected.
The first action we covered is collect. We also covered two actions besides
that—we covered both reduce and reduceByKey in the previous section. Both
methods are actions because they return a single result.
First, we will create the input of our transactions and then apply some
transformations just for testing purposes. We will take only the user that
contains A, using keyBy_.userId, and then take only the amount of the required
transaction, as shown in the following example:
test("should trigger computations using actions") {
//given
val input = spark.makeRDD(
List(
UserTransaction(userId = "A", amount = 1001),
UserTransaction(userId = "A", amount = 100),
UserTransaction(userId = "A", amount = 102),
UserTransaction(userId = "A", amount = 1),
UserTransaction(userId = "B", amount = 13)))
The first action that we are already aware of is rdd.collect().toList. The next
one is count(), which needs to take all of the values and calculate how many
values are inside the rdd. There is no way to execute count() without
triggering the transformation. Also, there are different methods in Spark,
such as countApprox, countApproxDistinct, countByValue, and countByValueApprox. The
following example shows us the code for rdd.collect().toList:
//then
println(rdd.collect().toList)
println(rdd.count()) //and all count*
If we have a huge dataset and the approximate counter is enough, you can
use countApprox as it will be a lot faster. We then use rdd.first(), but this
option is a bit different because it only needs to take the first element.
Sometimes, if you want to take the first element and execute everything
inside our DAG, we need to be focus on that and check it in the following
way:
println(rdd.first())
Also, on the rdd, we have foreach(), which is a for loop to which we can pass
any function. A Scala function or a Java function is assumed to be Lambda,
but to execute elements of our result rdd, DAG needs to be
calculated because from here onwards, it is an action. Another variant of
the foreach() method is foreachPartition(), which takes every partition and
returns an iterator for the partition. Inside that, we have an iterator to carry
our iterations again and then print our elements. We also have our max() and
min() methods and, as expected, max() is taking the maximum value and min()
is taking the minimum value. But these methods are taking the implicit
ordering.
If we have an rdd of a simple primitive type, like Long, we don't need to pass
it here. But if we do not use map(), we need to define the ordering for the
UserTransaction for Spark to find out which element is max and which element
is min. These two things need to execute the DAG and so they are classed as
actions, as shown in the following example:
rdd.foreach(println(_))
rdd.foreachPartition(t => t.foreach(println(_)))
println(rdd.max())
println(rdd.min())
Now, let's start the test and see the output of implementing the previous
actions, as shown in the following screenshot:
List(1001, 100, 102 ,1)
4
1001
1001
100
102
1
The first action returns all values. The second actions return 4 as a count.
We will consider the first element, 1001, but this is a random value and it is
not ordered. We will then print all the elements in the loop, as shown in
the following output:
102
1
1001
1
List(1)
List(100, 1)
We then get max and min values like 1001 and 1, which are similar to first().
After that, we get an ordered list, List(1), and sample List(100, 1), which is
random. So, in the sample, we get random values from input data and
applied transformations.
In the next section, we will learn how to reuse the rdd for different actions.
Reusing the same rdd for different
actions
In this section, we will reuse the same rdd for different actions. First, we will
minimize the execution time by reusing the rdd. We will then look at
caching and a performance test for our code.
The following example is the test from the preceding section but a bit
modified, as here we take start by currentTimeMillis() and the result. So, we
are just measuring the result of all actions that are executed:
//then every call to action means that we are going up to the RDD chain
//if we are loading data from external file-system (I.E.: HDFS), every action
means
//that we need to load it from FS.
val start = System.currentTimeMillis()
println(rdd.collect().toList)
println(rdd.count())
println(rdd.first())
rdd.foreach(println(_))
rdd.foreachPartition(t => t.foreach(println(_)))
println(rdd.max())
println(rdd.min())
println(rdd.takeOrdered(1).toList)
println(rdd.takeSample(false, 2).toList)
val result = System.currentTimeMillis() - start
If someone doesn't know Spark very well, they will assume that all actions
are cleverly executed. We know that every action count means that we are
going up to the rdd in the chain, which means we are going to all
transformations to load data. In the production system, loading data will be
from an external PI system such as HDFS. This means that every action
causes the call to the filesystem, which retrieves all data and then applies
transformations, as shown in the following example:
//when apply transformation
val rdd = input
.filter(_.userId.contains("A"))
.keyBy(_.userId)
.map(_._2.amount)
Let's compare this with the caching use. Our test, at first glance, looks very
similar, but this is not the same because you are issuing cache() and we are
returning rdd. So, rdd will be already cached and every subsequent call to the
rdd will go through the cache, as shown in the following example:
The first action will execute DAG, save the data into our cache, and then
the subsequent actions will just retrieve the specific things according to the
method that was called from memory. There will be no HDFS lookup, so
let's start this test, as per the following example, and see how long it takes:
//then every call to action means that we are going up to the RDD chain
//if we are loading data from external file-system (I.E.: HDFS), every action
means
//that we need to load it from FS.
val start = System.currentTimeMillis()
println(rdd.collect().toList)
println(rdd.count())
println(rdd.first())
rdd.foreach(println(_))
rdd.foreachPartition(t => t.foreach(println(_)))
println(rdd.max())
println(rdd.min())
println(rdd.takeOrdered(1).toList)
println(rdd.takeSample(false, 2).toList)
val result = System.currentTimeMillis() - start
}
}
The first output will be as follows:
List(1)
List(100, 102)
time taken (no-cache): 585
List(1001, 100, 102, 1)
4
Without cache, the value is 585 milliseconds and with cache, the value is 336.
The difference is not much as we are just creating data in tests. However, in
real production systems, this will be a big difference because we need to
look up data from external filesystems.
Summary
So, let's sum up this chapter. Firstly, we used Spark transformations to defer
computation to a later time, and then we learned which transformations
should be avoided. Next, we looked at how to use reduceByKey and reduce to
calculate our result globally and per specific key. After that, we performed
actions that triggered computations then learned that every action means a
call to the loading data. To alleviate that problem, we learned how to reduce
the same rdd for different actions.
In the next chapter, we'll be looking at the immutable design of the Spark
engine.
Immutable Design
In this chapter, we will look at the immutable design of Apache Spark. We
will delve into the Spark RDD's parent/child chain and use RDD in an
immutable way. We will then use DataFrame operations for transformations
to discuss immutability in a highly concurrent environment. By the end of
this chapter, we will use the Dataset API in an immutable way.
Extending an RDD
Chaining a new RDD with the parent
Testing our custom RDD
Extending an RDD
This is a simple test that has a lot of hidden complexity. Let's start by
creating a list of the record, as shown in the following code block:
class InheritanceRdd extends FunSuite {
val spark: SparkContext = SparkSession
.builder().master("local[2]").getOrCreate().sparkContext
The Record is just a case class that has an amount and description, so the amount
is 1 and d1 is the description.
We then created MultipledRDD and passed rdd to it, and then set the multiplier
equal to 10, as shown in the following code:
val extendedRdd = new MultipliedRDD(rdd, 10)
We are passing the parent RDD because it has data that was loaded in
another RDD. In this way, we build the inheritance chain of two RDD's.
Chaining a new RDD with the
parent
We first created a multiple RDD class. In the MultipliedRDD class, we have
two things that pass the parameters:
RDD has a lot of methods and we can override any method we want.
However, this time, we are going with the compute method, where we will
override the compute method to calculate the multiplier. Here, we get a
Partition split and TaskContext. These are passed by this part execution engine
to our method, so we don't need to worry about this. However, we need to
return the iterator of the exact same type as the type that we pass through
the RDD class in the inheritance chain. This will be an iterator of the
record.
We then execute the first parent logic, where the first parent is just taking
that first RDD in our chain. The type here is Record, and we are taking an
iterator of split and context, where the split is just a partition that will be
executed. We know that the Spark RDD is partitioned by the partitioner,
but, here, we are just getting the specific partition that we need to split. So,
the iterator is taking the partition and task context, and so it knows which
values should be returned from that iterative method. For every record in
that iterator, which is a salesRecord, like amount and description, we are
multiplying the amount by the multiplier that was passed to the constructor to
get our Double.
By doing this, we have multiplied our amount by the multiplier, and we can
then return the new record which has the new amount. So, we now have an
amount of the old record multiplied by our multiplier and the description of
the salesRecord. For the second filter, what we need to override is getPartitions,
as we want to keep the partitioning of the parent RDD. If the previous RDD
has 100 partitions, for example, we also want our MultipledRDD to have 100
partitions. So, we want to retain that information about partitions rather than
losing it. For the same reason, we are just proxying it to the firstParent. The
firstParent of the RDD will then just take the previous partitions from that
specific RDD.
In this way, we have created a new multipliedRDD, which passes the parent and
multiplier. For our extendedRDD, we need to collect it and call toList, and our
list should have 10 and d1, as shown in the following example:
extendedRdd.collect().toList should contain theSameElementsAs List(
Record(10, "d1")
)
}
}
Compute was executed automatically when we created the new RDD, and so it is
always executed without the explicit method call.
Testing our custom RDD
Let's start this test to check if this has created our RDD. By doing this, we
can extend our parent RDD and add behavior to our RDD. This is shown in
the following screenshot:
"C:\Program Files\Java\jdk-12\bin\java.exe" "-javaagent:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\lib\idea_rt.jar=51687:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\bin" -Dfile.encoding=UTF-8 -classpath
C:\Users\Sneha\IdeaProjects\Chapter07\out\production\Chapter07 com.company.Main
Let's first understand directed acyclic graph immutability and what it gives
us. We will then be creating two leaves from one node RDD, and checking
if both leaves are behaving totally independently if we create a
transformation on one of the leaf RDD's. We will then examine results from
both leaves of our current RDD and check if any transformation on any leaf
does not change or impact the root RDD. It is imperative to work like this
because we have found that we will not be able to create yet another leaf
from the root RDD, because the root RDD will be changed, which means it
will be mutable. To overcome this, the Spark designers created an
immutable RDD for us.
There is a simple test to show that the RDD should be immutable. First, we
will create an RDD from 0 to 5, which is added to a sequence from the
Scala branch. to is taking the Int, and the first parameter is an implicit one,
which is from the Scala package, as shown in the following example:
class ImmutableRDD extends FunSuite {
val spark: SparkContext = SparkSession
.builder().master("local[2]").getOrCreate().sparkContext
Once we have our RDD data, we can create the first leaf. The first leaf is a
result (res) and we are just mapping every element multiplied by 2. Let's
create a second leaf, but this time it will be marked by 4, as shown in the
following example:
//when
val res = data.map(_ * 2)
So, we have our root RDD and two leaves. First, we will collect the first
leaf and see that the elements in it are 0, 2, 4, 6, 8, 10, so everything here is
multiplied by 2, as shown in the following example:
//then
res.collect().toList should contain theSameElementsAs List(
0, 2, 4, 6, 8, 10
)
However, even though we have that notification on the res, the data is still
exactly the same as it was in the beginning, which is 0, 1, 2, 3, 4, 5, as
shown in the following example:
data.collect().toList should contain theSameElementsAs List(
0, 1, 2, 3, 4, 5
)
}
}
When we run the test, we will see that every path in our execution, the root,
that is, data, or the first leaf and second leaf, behave independently from
each other, as shown in the following code output:
"C:\Program Files\Java\jdk-12\bin\java.exe" "-javaagent:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\lib\idea_rt.jar=51704:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\bin" -Dfile.encoding=UTF-8 -classpath
C:\Users\Sneha\IdeaProjects\Chapter07\out\production\Chapter07 com.company.Main
So, we'll have the first DataFrame with one column, the second one with
result and source, and the third one with only one result. Let's look at the
code for this section.
Our result should have only one record and so we are dropping two
columns, but still, the userData source that we created will have 3 rows. So,
modifying it by filtering created yet another DataFrame that we call the res
without modifying the input userData, as shown in the following example:
assert(res.count() == 1)
assert(userData.count() == 3)
}
}
So, let's start this test and see how immutable data from API behaves, as
shown in the following screenshot:
"C:\Program Files\Java\jdk-12\bin\java.exe" "-javaagent:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\lib\idea_rt.jar=51713:C:\Program
Files\JetBrains\IntelliJ IDEA 2018.3.5\bin" -Dfile.encoding=UTF-8 -classpath
C:\Users\Sneha\IdeaProjects\Chapter07\out\production\Chapter07 com.company.Main
As we can see, our test passes, and, from the result (res), we know that our
parent was not modified. So, for example, if we want to map something
on res.map(), we can map the userData column, as shown in the following
example:
res.map(a => a.getString("userId") + "can")
Another leaf will have an additional column without changing the userId
source code, so that was the immutability of DataFrame.
Immutability in the highly
concurrent environment
We saw how immutability affects the creation and design of programs, so
now we will understand how it is useful.
Then, another thread starts, and also uses countDown by signaling that it is
ready to start. But first, it checks whether the list contains "A" and, if not, it
appends that "A", as shown in the following example:
executors.submit(new Runnable {
override def run(): Unit = {
latch.countDown()
if(!listMutable.contains("A")) {
listMutable += "A"
}
}
})
We then use await() to wait until countDown is issued and, when it is issued, we
are able to progress with the verification of our program, as shown in the
following example:
latch.await()
But there is a race condition here. There could be a possibility that, after the
check if(!listMutable.contains("A")), the run() thread will add the "A" element to
the list. But we are inside if, so we will add another "A" by/using listMutable
+= "A". Because of the mutability of the state and the fact that it was modified
We need to be careful while using mutable state since we cannot have such a
corrupted state. To alleviate this problem, we can use java.util collections
and synchronized lists on it.
But if we have the synchronized block, then our program will be very slow
because we will need to coordinate access to that exclusively. We can also
employ lock from the java.util.concurrent.locks package. We can use an
implementation, like ReadLock or WriteLock. In the following example, we will
use WriteLock:
val lock = new WriteLock()
We also need to lock our lock() and only then proceed, as shown in the
following example:
lock.lock()
Later, we can use unlock(). However, we should also do this in the second
thread so that our list only has one element, as shown in the following
example:
lock.unlock()
Dataset immutability
Creating two leaves from the one root dataset
Adding a new column by issuing transformation
The test case for the dataset is quite similar, but we need to do a toDS() for
our data to be type safe. The type of dataset is userData, as shown in the
following example:
import com.tomekl007.UserData
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
Now, we will issue a filter of userData and specify isin, as shown in the
following example:
//when
val res = userData.filter(userData("userId").isin("a"))
It will return the result (res), which is a leaf with our 1 element. userData will
still have 3 elements because of this apparent root. Let's execute this
program, as shown in the following example:
assert(res.count() == 1)
assert(userData.count() == 3)
}
}
We can see our test passed, which means that the dataset is also an
immutable abstraction on top of the DataFrame, and employs the same
characteristics. userData has a something very useful known as a typeset,
and, if you use the show() method, it will infer the schema and know that the
"a" field is a string or another type, as shown in the following example:
userData.show()
In the next chapter, we'll look at how to avoid shuffle and reduce personal
expense.
Avoiding Shuffle and Reducing
Operational Expenses
In this chapter, we will learn how to avoid shuffle and reduce the
operational expense of our jobs, along with detecting a shuffle in a process.
We will then test operations that cause a shuffle in Apache Spark to find out
when we should be very careful and which operations we should avoid.
Next, we will learn how to change the design of jobs with wide
dependencies. After that, we will be using the keyBy() operations to reduce
shuffle and, in the last section of this chapter, we'll see how we can use
custom partitioning to reduce the shuffle of our data.
We will load randomly partitioned data to see how and where the data is
loaded. Next, we will issue a partition using a meaningful partition key. We
will then repartition data to the proper executors using the deterministic and
meaningful key. In the end, we will explain our queries by using the explain()
method and understand the shuffle. Here, we have a very simple test.
ID in user_1, and the last record for user_2. Let's imagine that this data is
loaded through the external data system. The data can be loaded from HDFS
or from a database, such as Cassandra or NoSQL:
class DetectingShuffle extends FunSuite {
val spark: SparkSession = SparkSession.builder().master("local[2]").getOrCreate()
test("should explain plan showing logical and physical with UDF and DF") {
//given
import spark.sqlContext.implicits._
val df = spark.sparkContext.makeRDD(List(
InputRecord("1234-3456-1235-1234", "user_1"),
InputRecord("1123-3456-1235-1234", "user_1"),
InputRecord("1123-3456-1235-9999", "user_2")
)).toDF()
While using userID, we will use repartition in a way that the result will record
the data that has the same user ID. So user_1, for example, will end up on the
first executor:
//when
val q = df.repartition(df("userId"))
The first executor will have all the data for userID. If InputRecord("1234-3456-
1235-1234", "user_1") is on executor 1 and InputRecord("1123-3456-1235-1234",
We can repartition the data further, but it should be done at the beginning of
our chain. Let's start the test to explain our query:
q.explain(true)
A join is a specific operation that causes shuffle, and we will use it to join
our two DataFrames. We will first check whether it causes shuffle and then
we will check how to avoid it. To understand this, we will use two
DataFrames that are partitioned differently and check the operation of
joining two datasets or DataFrames that are not partitioned or partitioned
randomly. It will cause shuffle because there is no way to join two datasets
with the same partition key if they are on different physical machines.
Before we join the dataset, we need to send them to the same physical
machine. We will be using the following test.
We need to create UserData, which is a case class that we have seen already. It
has the user ID and data. We have user IDs, that is, user_1, user_2, and user_4:
test("example of operation that is causing shuffle") {
import spark.sqlContext.implicits._
val userData =
spark.sparkContext.makeRDD(List(
UserData("user_1", "1"),
UserData("user_2", "2"),
UserData("user_4", "200")
)).toDS()
We use the joinWith transaction on UserData by using the userID column from
UserData and transactionData. Since we have issued an inner join, the result has
two elements because there is a join between the record and the transaction,
that is, UserData, and UserTransaction. However, UserData has no transaction and
Usertransaction has no user data:
//shuffle: userData can stay on the current executors, but data from
//transactionData needs to be send to those executors according to joinColumn
//causing shuffle
//when
val res: Dataset[(UserData, UserTransaction)]
= userData.joinWith(transactionData, userData("userId") ===
transactionData("userId"), "inner")
When we were joining the data, the data was not partitioned because this
was some random data for Spark. It was unable to know that the user ID
column is the partition key, as it cannot guess this. Since it is not pre-
partitioned, to join the data from two datasets, will need to send data from
the user ID to the executor. Hence, there will be a lot of data shuffling from
the executor, which is because the data is not partitioned.
Let's explain the query, perform an assert, and show the results by starting
the test:
//then
res.show()
assert(res.count() == 2)
}
}
In the next section, we'll learn how to change the design of jobs with wide
dependencies, so we will see how to avoid unnecessary shuffling when
performing a join on two datasets.
Changing the design of jobs with
wide dependencies
In this section, we will change the job that was performing the join on non-
partitioned data. We'll be changing the design of jobs with wide
dependencies.
Here, we have our example test case, with the data we used previously in
the Testing operations that cause a shuffle in Apache Spark section. We have
UserData with three records for user ID – user_1, user_2, and user_4 – and
the UserTransaction data with the user ID – that is, user_1, user_2, user_3:
test("example of operation that is causing shuffle") {
import spark.sqlContext.implicits._
val userData =
spark.sparkContext.makeRDD(List(
UserData("user_1", "1"),
UserData("user_2", "2"),
UserData("user_4", "200")
)).toDS()
Then, we need to repartition the data, which is the first very important thing
to do. We are repartitioning our userData using the userId column:
val repartitionedUserData = userData.repartition(userData("userId"))
Then, we will repartition our data using the userId column, this time for
transactionData:
val repartitionedTransactionData =
transactionData.repartition(transactionData("userId"))
Once we have our data repartitioned, we have the assurance that any data
that has the same partition key – in this example, it's userId – will land on the
same executor. Because of that, our repartitioned data will not have the
shuffle, and the joins will be faster. In the end, we are able to join, but this
time we are joining the pre-partitioned data:
//when
//data is already partitioned using join-column. Don't need to shuffle
val res: Dataset[(UserData, UserTransaction)]
= repartitionedUserData.joinWith(repartitionedTransactionData, userData("userId")
=== transactionData("userId"), "inner")
We have a SortMergeJoin operation, and we are sorting our data that is already
pre-partitioned in the previous step of our execution engine. In this way, our
Spark engine will perform the sort-merge join, where we don't need to hash
join. It will sort data properly and the join will be faster.
In the next section, we'll be using keyBy() operations to reduce shuffle even
further.
Using keyBy() operations to reduce
shuffle
In this section, we will use keyBy() operations to reduce shuffle. We will
cover the following topics:
We will load randomly partitioned data, but this time using the RDD API.
We will repartition the data in a meaningful way and extract the information
that is going on underneath, similar to DataFrame and the Dataset API. We
will learn how to leverage the keyBy() function to give our data some structure
and to cause the pre-partitioning in the RDD API.
Here is the test we will be using in this section. We are creating two random
input records. The first record has a random user ID, user_1, the second one
has a random user ID, user_1, and the third one has a random user ID, user_2:
test("Should use keyBy to distribute traffic properly"){
//given
val rdd = spark.sparkContext.makeRDD(List(
InputRecord("1234-3456-1235-1234", "user_1"),
InputRecord("1123-3456-1235-1234", "user_1"),
InputRecord("1123-3456-1235-9999", "user_2")
))
At this point, our data is spread randomly and the records for the user ID
field could be on different executors because the Spark execution engine
cannot guess whether user_1 is a meaningful key for us or whether 1234-3456-
1235-1234 is. We know that 1234-3456-1235-1234 is not a meaningful key, and that
There is no possibility for Spark to know that data for the same user ID will
land on the same executor, and that's why we need to use the user ID field,
either user_1, user_1, or user_2, when partitioning the data. To achieve that in
the RDD API, we can use keyBy(_.userId) in our data, but this time it will
change the RDD type:
val res = rdd.keyBy(_.userId)
If we check the RDD type, we'll see that this time, an RDD is not an input
record, but it is an RDD of the string and input record. The string is a type of
the field that we expected here, and it is userId. We will also extract
information about the keyBy() function by using toDebugString on the result:
println(res.toDebugString)
Once we use keyBy(), all the records for the same user ID will land on the
same executor. As we have discussed, this can be dangerous because if we
have a skew key, it means that we have a key that has very high cardinality,
and we can run out of memory. Also, all operations on a result will be key-
wise, so we'll be on the pre-partitioned data:
res.collect()
In the next section, we'll use a custom partitioner to reduce shuffle even
further.
Using a custom partitioner to
reduce shuffle
In this section, we will use a custom partitioner to reduce shuffle. We will
cover the following topics:
We will implement a custom partitioner with our custom logic, which will
partition the data. It will inform Spark where each record should land and on
which executor. We will be using the partitionBy method on Spark. In the end,
we will validate that our data was partitioned properly. For the purposes of
this test, we are assuming that we have two executors:
import com.tomekl007.UserTransaction
import org.apache.spark.sql.SparkSession
import org.apache.spark.{Partitioner, SparkContext}
import org.scalatest.FunSuite
import org.scalatest.Matchers._
Let's assume that we want to split our data evenly into 2 executors and that
the instances of data with the same key will land on the same executor. So,
our input data is a list of UserTransactions: "a", "b", "a", "b", and "c". The values
are not so important, but we need to keep them in mind to test the
behavior later. The amount is 100, 101, 202, 1, and 55 for the given UserTransactions:
val data = spark
.parallelize(List(
UserTransaction("a", 100),
UserTransaction("b", 101),
UserTransaction("a", 202),
UserTransaction("b", 1),
UserTransaction("c", 55)
The getPartition method takes a key, which will be the userId. The key will be
passed here and the type will be a string:
override def getPartition(key: Any): Int = {
key.hashCode % numberOfExecutors
}
})
The signature of these methods is Any, so we need to override it, and also
override the number of partitions.
We then print our two partitions, and numPartitions returns the value of 2:
println(data.partitions.length)
We will then map every partition for the respective partition as we get an
iterator. Here, we are taking amount for a test purpose:
//when
val res = data.mapPartitions[Long](iter =>
iter.map(_._2).map(_.amount)
).collect().toList
In the end, we assert 55, 100, 202, 101, and 1; the order is random, so there is no
need to take care of the order:
//then
res should contain theSameElementsAs List(55, 100, 202, 101, 1)
}
}
If we still want to, we should use a sortBy method. Let's start this test and see
whether our custom partitioner works as expected. Now, we can start. We
have 2 partitions, so it works as expected, as shown in the following
screenshot:
Summary
In this chapter, we learned how to detect shuffle in a process. We covered
testing operations that cause a shuffle in Apache Spark. We also learned
how to employ partitioning in the RDD. It is important to know how to use
the API if partitioned data is needed, because RDD is still widely used, so
we use the keyBy operations to reduce shuffle. We also learned how to use
the custom partitioner to reduce shuffle.
In the next chapter, we'll learn how to save data in the correct format using
the Spark API.
Saving Data in the Correct Format
In the previous chapters, we were focusing on processing and loading data.
We learned about transformations, actions, joining, shuffling, and other
aspects of Spark.
In this chapter, we will learn how to save data in the correct format and also
save data in plain text format using Spark's standard API. We will also
leverage JSON as a data format, and learn how to use standard APIs to save
JSON. Spark has a CSV format and we will leverage that format as well.
We will then learn more advanced schema-based formats, where support is
required to import third-party dependencies. Following that, we will use
Avro with Spark and learn how to use and save the data in a columnar
format known as Parquet. By the end of this chapter, we will have also
learned how to retrieve data to validate whether it is stored in the proper
way.
We will save our data in plain text format and investigate how to save it into
the Spark directory. We will then load the plain text data, and then test and
save it to check whether we can yield the same results code. This is
our SavePlainText.scala file:
package com.tomekl007.chapter_4
import java.io.File
import com.tomekl007.UserTransaction
import org.apache.spark.sql.SparkSession
import org.apache.spark.{Partitioner, SparkContext}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import org.scalatest.Matchers._
import scala.reflect.io.Path
//when
rdd.coalesce(1).saveAsTextFile(FileName)
We will need a FileName variable, which, in our case, will be a folder name,
and Spark will then create a couple of files underneath:
import java.io.File
import com.tomekl007.UserTransaction
import org.apache.spark.sql.SparkSession
import org.apache.spark.{Partitioner, SparkContext}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import org.scalatest.Matchers._
import scala.reflect.io.Path
class SavePlainText extends FunSuite with BeforeAndAfterEach{
val spark: SparkContext =
SparkSession.builder().master("local[2]").getOrCreate().sparkContext
private val FileName = "transactions.txt"
We will use BeforeAndAfterEach in our test case to clean our directory after
every test, which means that the path should be deleted recursively. The
whole path is deleted after the test, as it is required to rerun the tests without
a failure. We need to comment the following code out for the first run to
investigate the structure of the saved text file:
//override def afterEach() {
// val path = Path (FileName)
// path.deleteRecursively()
// }
We will then coalesce our data to one partition. coalesce() is a very important
aspect. If we want to save our data in a single file, we need to coalesce it into
one, but there is an important implication of doing so:
rdd.coalesce(1).saveAsTextFile(FileName)
If we coalesce it to a single file, then only one executor can save the data to
our system. This means that saving the data will be very slow and, also,
there will be a risk of being out of memory because all data will be sent to
one executor. Generally, in the production environment, we save it as many
partitions, based on the executors available, or even multiplied by its own
factor. So, if we have 16 executors, then we can save it to 64. But this results
in 64 files. For test purposes, we will save it to one file, as shown in the
preceding code snippet:
rdd.coalesce (numPartitions = 1).saveAsTextFile(FileName)
Now, we'll load the data. We only need to pass the filename to the TextFile
method and it will return fromFile:
val fromFile = spark.textFile(FileName)
The important thing to note is that for a list of strings, Spark doesn't know
the schema of our data because we are saving it in plain text.
This is one of the points to note when it comes to saving plain text, because
loading the data is not easy, since we need to manually map every string to
UserTransaction. So, we will have to parse every record manually, but, for test
purposes, we will treat our transaction as strings.
Now, let's start the test and see the structure of the folder that was created:
In the preceding screenshot, we can see that our test passed and that we get
transactions.txt. Inside the folder, we have four files. The first one is
._SUCCESS.crc, which means that the save succeeded. Next, we have .part-
00000.crc, to control and validate that everything worked properly, which
means that the save was proper. We then have _SUCCESS and part-00000, where
both files have checksum, but part-00000 has all the data as well. Then, we
also have UserTransaction(a,100) and UserTransaction(b,200):
In the next section, we will learn what will happen if we increment the
number of partitions.
Leveraging JSON as a data format
In this section, we will leverage JSON as a data format and save our data in
JSON. The following topics will be covered:
This data is human-readable and gives us more meaning than simple plain
text because it carries some schema information, such as a field name. We
will then learn how to save data in JSON format and load our JSON data.
We will then issue coalesce() and, this time, we will take the value as 2, and
we will have two resulting files. We will then issue the write.format method
and, for the same, we need to specify a format, for which we will use the json
format:
rdd.coalesce(2).write.format("json").save(FileName)
If we use the unsupported format, we will get an exception. Let's test this by
entering the source as not:
rdd.coalesce(2).write.format("not").save(FileName)
In the preceding code output, we can see that our test passed and that the
DataFrame includes all the meaningful data.
From the output, we can see that DataFrame has all the schema required. It
has amount and userId, which is very useful.
The transactions.json folder has two parts—one part is r-00000, and the other
part is r-00001, because we issued two partitions. If we save the data in a
production system with 100 partitions, we will end up with 100 part files
and, furthermore, every part file will have a CRC checksum file.
Here, we have a JSON file with schema and, hence, we have a userID field
and amount field.
On the other hand, we have the second file with the second record with userID
and amount as well:
{"userId":"b","amount":"200"}
The advantage of this is that Spark is able to infer the data from the schema
and is loaded in a formatted DataFrame with proper naming and types. The
disadvantage, however, is that every record has some additional overhead.
Every record needs to have a string in it and, in each string, if we have a file
that has millions of files and we are not compressing it, there will be
substantial overhead, which is not ideal.
Saving CSV files is even more involved than JSON and plain text because
we need to specify whether we want to retain headers of our data in our CSV
file.
Then, we will use the write format CSV. We also need to specify that we
don't want to include the header option in it:
//when
rdd.coalesce(1)
.write
.format("csv")
.option("header", "false")
.save(FileName)
We will then perform a test to verify whether the condition is true or false:
//when
rdd.coalesce(1)
.write
.format("csv")
.option("header", "true")
.save(FileName)
In the preceding code output, we can see that the data is loaded, but we lost
our schema. c0 and c1 are the aliases for column 0 (c0) and column 1 (c1) that
were created by Spark.
So, if we are specifying that the header should retain that information, let's
specify the header at the write and also at the read:
val fromFile = spark.read.option("header", "true).csv(FileName)
We will specify that the header should retain our information. In the following
output, we can see that the information regarding the schema was perceived
throughout the read and write operation:
+------+------+
|userId|amount|
+------+------+
| a| 100|
| b| 200|
+------+------+
Let's see what happens if we write with the header and read without it. Our test
should fail, as demonstrated in the following code screenshot:
In the preceding screenshot, we can see that our test failed because we don't
have a schema as we were reading without headers. The first record, which
was a header, was treated as the column value.
Let's try a different situation, where we are writing without header and
reading with header:
//when
rdd.coalesce(1)
.write
.format("csv")
.option("header", "false")
.save(FileName)
Our test will fail again because this time, we treated our first record as the
header record.
Let's set both the read and write operations with header and test our code after
removing the comment we added previously:
override def afterEach() {
val path = Path(FileName)
path.deleteRecursively()
}
The CSV and JSON files will have schema, but with less overhead.
Therefore, it could be even better than JSON.
In the next section, we'll see how we can use a schema-based format as a
whole with Spark.
Using Avro with Spark
So far, we have looked at text-based files. We worked with plain text, JSON,
and CSV. JSON and CSV are better than plain text because they carry some
schema information.
Avro has a schema and data embedded within it. This is a binary format and
is not human-readable. We will learn how to save data in Avro format, load
it, and then test it.
While using CSV, we specified the format like CSV, and, when we specified
JSON, this, too, was a format. But in Avro, we have a method. This method
is not a standard Spark method; it is from a third-party library. To have Avro
support, we need to access build.sbt and add spark-avro support from
com.databricks.
We then need to import the proper method. We will import
com.databricks.spark.avro._ to give us the implicit function that is extending the
Spark DataFrame:
import com.databricks.spark.avro._
We are actually using an Avro method and we can see that implicit
class takes a DataFrameWriter class, and writes our data in Spark format.
In the coalesce code we used previously, we can use write, specify the format,
and execute a com.databricks.spark.avro class. avro is a shortcut to not write
com.databricks.spark.avro as a whole string:
//when
rdd.coalesce(2)
.write.format(com.databricks.spark.avro)
.avro(FileName)
Let's comment out the code and remove Avro to check how it saves:
// override def afterEach() {
// val path = Path(FileName)
// path.deleteRecursively()
// }
The first part will have binary data. It consists of a number of binary records
and some human-readable data, which is our schema:
We have two fields—user ID, which is a type string or null, and name: amount,
which is an integer. Being a primitive type, JVM cannot have null values.
The important thing to note is that, in production systems, we have to save
really large datasets, and there will be thousands of records. The schema is
always in the first line of every file. If we check the second part as well, we
will see that there is exactly the same schema and then the binary data.
Usually, we have only one or more lines if you have a complex schema, but
still, it is a very low amount of data.
We can see that in the resulting dataset, we have userID and amount:
+------+------+
|userId|amount|
+------+------+
| a| 100|
| b| 200|
+------+------+
In the preceding code block, we can see that the schema was portrayed in the
file. Although it is a binary file, we can extract it.
This is a columnar format, as the data is stored column-wise and not row-
wise, as we saw in the JSON, CSV, plain text, and Avro files.
This is a very interesting and important format for big data processing and
for making the process faster. In this section, we will focus on adding
Parquet support to Spark, saving the data into the filesystem, reloading it
again, and then testing. Parquet is similar to Avro as it gives you a parquet
method but this time, it is a slightly different implementation.
In the build.sbt file, for the Avro format, we need to add an external
dependency, but for Parquet, we already have that dependency within
Spark. So, Parquet is the way to go for Spark because it is inside the
standard package.
Let's have a look at the logic that's used in the SaveParquet.scala file for
saving and loading Parquet files.
First, we coalesce the two partitions, specify the format, and then specify
that we want to save parquet:
package com.tomekl007.chapter_4
import com.databricks.spark.avro._
import com.tomekl007.UserTransaction
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import scala.reflect.io.Path
//when
rdd.coalesce(2)
.write
.parquet(FileName)
fromFile.show()
assert(fromFile.count() == 2)
}
Let's begin this test but, before that, we will comment out the following
code withing our SaveParquet.scala file to see the structure of the files:
// override def afterEach() {
// val path = Path(FileName)
// path.deleteRecursively()
// }
A new transactions.parquet folder gets created and we have two parts inside it
—part-r-00000 and part-r-00001. This time, however, the format is entirely
binary and there is some metadata embedded with it.
We have the metadata embedded and also the amount and userID fields, which
are of the string type. The part r-00000 is exactly the same and has the
schema embedded. Hence, Parquet is also a schema-based format. When we
read the data, we can see that we have the userID and amount columns
available.
Summary
In this chapter, we learned how to save data in plain text format. We noticed
that schema information is lost when we do not load the data properly. We
then learned how to leverage JSON as a data format and saw that JSON
retains the schema, but it has a lot of overhead because the schema is for
every record. We then learned about CSV and saw that Spark has embedded
support for it. The disadvantage of this approach, however, is that the
schema is not about the specific types of records, and tabs need to be
inferred implicitly. Toward the end of this chapter, we covered Avro
and Parquet, which have columnar formats that are also embedded with
Spark.
First, we will create an array of user transactions for users A, B, A, B, and C for
some amount, as per the following example:
val keysWithValuesList =
Array(
UserTransaction("A", 100),
UserTransaction("B", 4),
UserTransaction("A", 100001),
UserTransaction("B", 10),
UserTransaction("C", 10)
)
We then need to key our data by a specific field, as per the following
example:
val keyed = data.keyBy(_.userId)
Now, our data is assigned to the keyed variable and its type is a tuple. The
first element is a string, that is, userId and the second element is
UserTransaction.
Let's look at the transformations that are available. First, we will look at
countByKey.
This returns a Map of key K, and Long is a generic type because it can be any
type of key. In this example, the key will be a string. Every operation that
returns map is not entirely safe. If you see a signature of the method that is
returning map, it is a sign that this data will be sent to the driver and it needs
to fit in the memory. If there is too much data to fit into one driver's memory,
then we will run out of memory. Hence, we need to be cautious when using
this method.
We then perform an assert count that should contain the same elements as
the map, as per the following example:
counted should contain theSameElementsAs Map("B" -> 2, "A" -> 2, "C" -> 1)
B is 2 because we have two values for it. Also, A is one similar to C as they
have only one value. CountByKey() is not memory expensive because it only
stores key and counter. However, if the key is a complex and a big object,
for example, a transaction with multiple fields, which is more than two, then
that map could be really big.
We also have a combineByKey() method, which combines the same elements for
the same key, and shares the negative aggregateByKey() that is able to aggregate
different types. We have foldByKey, which is taking the current state and value,
but returns the same type as the value for the key.
First, we will create our array of user transactions, as shown in the following
example:
val keysWithValuesList =
Array(
UserTransaction("A", 100),
UserTransaction("B", 4),
UserTransaction("A", 100001),
UserTransaction("B", 10),
UserTransaction("C", 10)
)
We will then use parallelize to create an RDD, as we want our data to be key-
wise. This is shown in the following example:
val data = spark.parallelize(keysWithValuesList)
val keyed = data.keyBy(_.userId)
In the preceding code, we invoked keyBy for userId to have the data of payers,
key, and user transaction.
The next argument is a method that takes the current element that we are
processing. In this example, transaction: UserTransaction) => is a current
transaction and also needs to take the state that we were initializing our
function with, and, hence, it will be an array buffer here.
It needs to be of the same type that's as shown in the following code block,
so this is our type T:
mutable.ArrayBuffer.empty[Long]
At this point, we are able to take any transaction and add it to the specific
state. This is done in a distributed way. For one key, execution is done on
one executor and, for exactly the same key, on different executors. This
happens in parallel, so multiple trades will be added for the same key.
Now, Spark knows that, for exactly the same key, it has multiple states of
type T ArrayBuffer that it needs to merge. So, we need to mergeAmounts for our
transactions for the same key.
The mergeArgument is a method that takes two states, both of which are
intermediate states of type T, as shown in the following code block:
val mergeAmounts = (p1: mutable.ArrayBuffer[Long], p2: mutable.ArrayBuffer[Long])
=> p1 ++= p2
In this example, we want to merge the release buffers into one array buffer.
Therefore, we issue p1 ++= p2. This will merge two array buffers into one.
Now, we have all arguments ready and we are able to execute aggregateByKey
and see what the results look like. The result is an RDD of string and type T,
the ArrayBuffer[long], which is our state. We will not be keeping UserTransaction
in our RDD anymore, which helps in reducing the amount of memory.
UserTransaction is a heavy object because it can have multiple fields and, in
this example, we are only interested in the amount field. So, this way, we can
reduce the memory that is used.
The following example shows what our result should look like:
aggregatedTransactionsForUserId.collect().toList should contain theSameElementsAs
List(
("A", ArrayBuffer(100, 100001)),
("B", ArrayBuffer(4,10)),
("C", ArrayBuffer(10)))
We should have a key, A, and an ArrayBuffer of 100 and 10001, since it is our
input data. B should be 4 and 10, and lastly, C should be 10.
In the next section, we'll be looking at the actions that are available on
key/value pairs.
Actions on key/value pairs
In this section, we'll be looking at the actions on key/value pairs.
Therefore, we'll be using collect() and we'll be examining the output of our
action on these key/value pairs.
First, we will create our transactions array and RDD according to userId, as
shown in the following example:
val keysWithValuesList =
Array(
UserTransaction("A", 100),
UserTransaction("B", 4),
UserTransaction("A", 100001),
UserTransaction("B", 10),
UserTransaction("C", 10)
)
The first action that comes to our mind is to collect(). collect() takes every
element and assigns it to the result, and thus our result is very different than
the result of keyBy.
Our result is a pair of keys, userId, and a value, that is, UserTransaction. We can
see, from the following example, that we can have a duplicated key:
res should contain theSameElementsAs List(
("A",UserTransaction("A",100)),
("B",UserTransaction("B",4)),
("A",UserTransaction("A",100001)),
("B",UserTransaction("B",10)),
("C",UserTransaction("C",10))
)//note duplicated key
We can see, from the preceding output, that our test has passed. To see the
other actions, we will look at different methods.
RDD, it means that it is not an action. This is the case for key/value pairs.
collect() does not return an RDD but returns an array, thus it is an
action. count returns long, so this is also an action. countByKey returns map. If
we want to reduce our elements, then this is an action, but reduceByKey is not an
action. This is the big difference between reduce and reduceByKey.
We can see that everything is normal according to the RDD, so actions are
the same and differences are only in transformation.
Examining HashPartitioner
Examining RangePartitioner
Testing
We will then use keyBy (as shown in the following example) because the
partitioner will automatically work on the key for our data:
val keyed = data.keyBy(_.userId)
example:
assert(partitioner.isEmpty)
It first checks if our key is null. If it is null, it will land in partition number 0.
If we have data with null keys, they will all land in the same executors, and,
as we know, this is not a good situation because the executors will have a lot
of memory overhead and they can fail without memory exceptions.
If the key is not null, then it does a nonNegativeMod from hashCode and the number
of partitions. It has to be the modulus of the number of partitions so that it
can be assigned to the proper partition. Thus, the hashCode method is very
important for our key.
If we are supplying a custom key and not a primitive type like an integer or
string, which has a well-known hashCode, we need to supply and implement a
proper hashCode as well. But the best practice is to use the case class from
Scala as they have hashCode and equals implemented for you.
We have defined partitioner now, but partitioner is something that could be
changed dynamically. We can change our partitioner to be rangePartitioner.
rangePartitioner takes the partitions in an RDD.
Let's start our test to check if we were able to assign partitioner properly, as
shown in the following output:
Our tests have passed. This means that, at the initial point, the partitioner was
empty and then we had to shuffle RDD at partitionBy, and also a
branchPartitioner. But it shows us only the number line where we created an
instance of the partitioner interface.
In the next section, we'll try to improve it or try to tweak and play with the
partitioner by implementing a custom partitioner.
Implementing a custom partitioner
In this section, we'll implement a custom partitioner and create a partitioner
that takes a list of parses with ranges. If our key falls into a specific range,
we will assign the partition number index of the list.
We will implement the logic range partitioning based on our own range
partitioning and then test our partitioner. Let's start with the black box test
without looking at the implementation.
The first part of the code is similar to what we have used already, but this
time we have keyBy amount of data, as shown in the following example:
val keysWithValuesList =
Array(
UserTransaction("A", 100),
UserTransaction("B", 4),
UserTransaction("A", 100001),
UserTransaction("B", 10),
UserTransaction("C", 10)
)
val data = spark.parallelize(keysWithValuesList)
val keyed = data.keyBy(_.amount)
We are keying by the amount and we have the following keys: 100, 4, 100001,
10, and 10.
Let's look at the following example, which shows the implementation of our
custom range partitioner:
class CustomRangePartitioner(ranges: List[(Int,Int)]) extends Partitioner{
override def numPartitions: Int = ranges.size
override def getPartition(key: Any): Int = {
if(!key.isInstanceOf[Int]){
throw new IllegalArgumentException("partitioner works only for Int type")
}
val keyInt = key.asInstanceOf[Int]
val index = ranges.lastIndexWhere(v => keyInt >= v._1 && keyInt <= v._2)
println(s"for key: $key return $index")
index
}
}
Next, we have the getPartition method. First, our partitioner will work only
for integers, as shown in the following example:
if(!key.isInstanceOf[Int])
We can see that this is an integer and cannot be used for other types. For the
same reason, we first need to check whether our key is an instance of
integer, and, if it is not, we get an IllegalArgumentException because that
partitioner works only for the int type.
We can now test our keyInt by using asInstanceOf. Once this is done, we are
able to iterate over ranges and take the last range when the index is between
predicates. Our predicate is a tuple v, and should be as follows:
val index = ranges.lastIndexWhere(v => keyInt >= v._1 && keyInt <= v._2)
KeyInt should be more than or equal to v._1, which is the first element of the
tuple, but it should also be lower than the second element, v._2.
The start of the range is v._1 and the end of the range is v._2, so we can check
that our element is within range.
In the end, we will print the for key we found in the index for debugging
purposes, and we will return the index, which will be our partition. This is
shown in the following example:
println(s"for key: $key return $index")
In the next chapter, we will learn how to test our Spark jobs and Apache
Spark jobs.
Testing Apache Spark Jobs
In this chapter, we will test Apache Spark jobs and learn how to separate
logic from the Spark engine.
We will first cover unit testing of our code, which will then be used by the
integration test in SparkSession. Later, we will be mocking data sources
using partial functions, and then learn how to leverage ScalaCheck for
property-based testing for a test as well as types in Scala. By the end of this
chapter, we will have performed tests in different versions of Spark.
Let's look at the logic first and then the simple test.
So, we have a BonusVerifier object that has only one method, quaifyForBonus,
that takes our userTransaction model class. According to our login in the
following code, we load user transactions and filter all users that are
qualified for a bonus. First, we need to test it to create an RDD and filter it.
We need to create a SparkSession and also create data for mocking an RDD
or DataFrame, and then test the whole Spark API. Since this involves logic,
we will test it in isolation. The logic is as follows:
package com.tomekl007.chapter_6
import com.tomekl007.UserTransaction
object BonusVerifier {
private val superUsers = List("A", "X", "100-million")
def qualifyForBonus(userTransaction: UserTransaction): Boolean = {
superUsers.contains(userTransaction.userId) && userTransaction.amount > 100
}
}
We have a list of super users with the A, X, and 100-million user IDs. If our
userTransaction.userId is within the superUsers list, and if the
userTransaction.amount is higher than 100, then the user qualifies for a bonus;
otherwise, they don't. In the real world, the qualifier for bonus logic will be
even more complex, and thus it is very important to test the logic in
isolation.
The following code shows our test using the userTransaction model. We know
that our user transaction includes userId and amount. The following example
shows our domain model object, which is shared between a Spark execution
integration test and our unit testing, separated from Spark:
package com.tomekl007
import java.util.UUID
We need to create our UserTransaction for user ID X and the amount 101, as
shown in the following example:
package com.tomekl007.chapter_6
import com.tomekl007.UserTransaction
import org.scalatest.FunSuite
class SeparatingLogic extends FunSuite {
test("test complex logic separately from spark engine") {
//given
val userTransaction = UserTransaction("X", 101)
//when
val res = BonusVerifier.qualifyForBonus(userTransaction)
//then
assert(res)
}
}
Here, we have a user, X, that spends 99 for which our results should be false.
When we validate our code, we can see, from the following output, that our
test has passed:
We have covered two cases, but in real-world scenarios, there are many
more. For example, if we want to test the case where we are
specifying userId, which is not from this superuser list, and we have
some_new_user that spends a lot of money, in our case, 100000, we get the
following result:
test(testName = "test complex logic separately from spark engine - non qualify2") {
//given
val userTransaction = UserTransaction("some_new_user", 100000)
//when
val res = BonusVerifier.qualifyForBonus(userTransaction)
//then
assert(!res)
}
Let's assume that it should not qualify, and so such logic is a bit complex.
Therefore, we are testing it in a unit test way:
Our tests are very fast and so we are able to check that everything works as
expected without introducing Spark at all. In the next section, we'll be
changing the logic with integration testing using SparkSession.
Integration testing using
SparkSession
Let's now learn about integration testing using SparkSession.
Here, we are creating the Spark engine. The following line is crucial for the
integration test:
val spark: SparkContext =
SparkSession.builder().master("local[2]").getOrCreate().sparkContext
For the same reason, we should use unit tests often to convert all edge cases
and use integration testing only for the smaller part of the logic, such as the
capital edge case.
This is the first time that Spark has been involved in our integration testing.
Creating an RDD is also a time-consuming operation. Compared to just
creating an array, it is really slow to create an RDD because that is also a
heavy object.
This function was already unit tested, so we don't need to consider all edge
cases, different IDs, different amounts, and so on. We are just creating a
couple of IDs with some amounts to test whether or not our whole chain of
logic is working as expected.
After we have applied this logic, our output should be similar to the
following:
UserTransaction("A", 100001)
Let's start this test and check how long it takes to execute a single integration
test, as shown in the following output:
It took around 646 ms to execute this simple test.
If we want to cover every edge case, the value will be multiplied by a factor
of hundreds compared to the unit test from the previous section. Let's start
this unit test with three edge cases, as shown in the following output:
We can see that our test took only 18 ms, which means that it was 20 times
faster, even though we covered three edge cases, compared to integration
tests with only one case.
Here, we have covered a lot of logic with hundreds of edge cases, and we
can conclude that it is really wise to have unit tests at the lowest possible
level.
In the next section, we will be mocking data sources using partial functions.
Mocking data sources using partial
functions
In this section, we will cover the following topics:
For production, we will load user transactions and see that the HiveDataLoader
component has only one method, sparkSession.sql, and ("select * from
transactions"), as shown in the following code block:
object HiveDataLoader {
def loadUserTransactions(sparkSession: SparkSession): DataFrame = {
sparkSession.sql("select * from transactions")
}
}
This means that the function goes to Hive to retrieve our data and returns a
DataFrame. According to our logic, it executes the provider that is returning a
DataFrame and from a DataFrame, it is only selecting amount.
This logic is not simple we can test because our SparkSession provider is
interacting with the external system in production. So, we can create a
function such as the following:
UserDataLogic.loadAndGetAmount(spark, HiveDataLoader.loadUserTransactions)
Let's see how to test such a component. First, we will create a DataFrame of
user transactions, which is our mock data, as shown in the following
example:
val df = spark.sparkContext
.makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
.toDF()
However, we need to save the data to Hive, embed it, and then start Hive.
Since we are using the partial functions, we can pass a partial function as a
second argument, as shown in the following example:
val res = UserDataLogic.loadAndGetAmount(spark, _ => df)
The first argument is spark, but it is not used in our method this time. The
second argument is a method that is taking SparkSession and returning
DataFrame.
But, for the logic shown, it is transparent and doesn't consider whether the
DataFrame comes from Hive, SQL, Cassandra, or any other source, as
shown in the following example:
val df = provider(sparkSession)
df.select(df("amount"))
In our example, df comes from the memory that we created for the purposes
of the test. Our logic continues and it selects the amount.
Then, we show our columns, res.show() , and that logic should end up with
one column amount. Let's start this test, as shown in the following example:
We can see from the preceding example that our resulting DataFrame has
one column amount in 100 and 200 values. This means it worked as expected,
without the need to start an embedding Hive. The key here is to use a
provider and not embed our select start within our logic.
Property-based testing
Creating a property-based test
First, let's define the first property of our string type in the following way:
property("length of strings") = forAll { (a: String, b: String) =>
a.length + b.length >= a.length
}
Let's assume that we want to get two random strings, and in those strings,
the invariants should be perceived.
If we are adding the length of string a to the length of string b, the sum of
that should be greater or equal to a.length, because if b is 0 then it will be
equal, as shown in the following example:
a.length + b.length >= a.length
However, this is an invariant of string and for every input string, it should be
true.
The second property that we are defining is a bit more complex, as shown in
the following code:
property("creating list of strings") = forAll { (a: String, b: String, c: String)
=>
List(a,b,c).map(_.length).sum == a.length + b.length + c.length
}
When we map every element to length, the sum of those elements should be
equal to adding everything by length. Here, we have a.length + b.length +
c.length and we will test the collections API to check if the map and other
functions work as expected.
Let's start this property-based test to check if our properties are correct, as
shown in the following example:
We can see that the StringType.length property of string passed and executed
100 tests. It could be surprising that 100 tests were executed, but let's try to see
We will print the a argument and b argument, and retry our property by
testing the following output:
We can see that a lot of weird strings were generated, so this is an edge case
that we were not able to create up-front. Property-based testing will create a
very weird unique code that isn't a proper string. So, this is a great tool for
testing whether our logic is working as expected for a specific type.
Let's start with the mocking data sources from the third section of this
chapter—Mocking data sources using partial functions.
Now, let's compare it to the Spark pre-2.x. We can see that this time, we are
unable to use DataFrames. Let's assume that the following example shows
our logic from the previous Sparks:
test("mock loading data from hive"){
//given
import spark.sqlContext.implicits._
val df = spark.sparkContext
.makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
.toDF()
.rdd
//when
val res = UserDataLogicPre2.loadAndGetAmount(spark, _ => df)
//then
println(res.collect().toList)
}
}
We can see that we are not able to use DataFrames this time.
In the previous section, loadAndGetAmount was taking spark and DataFrame, but
the DataFrame in the following example is an RDD, not a DataFrame
anymore, and so we are passing an rdd:
val res = UserDataLogicPre2.loadAndGetAmount(spark, _ => rdd)
However, we need to create a different UserDataLogicPre2 for Spark that takes
SparkSession and returns an RDD after mapping an RDD of an integer, as
shown in the following example:
object UserDataLogicPre2 {
def loadAndGetAmount(sparkSession: SparkSession, provider: SparkSession =>
RDD[Row]): RDD[Int] = {
provider(sparkSession).map(_.getAs[Int]("amount"))
}
}
object HiveDataLoaderPre2 {
def loadUserTransactions(sparkSession: SparkSession): RDD[Row] = {
sparkSession.sql("select * from transactions").rdd
}
}
In the preceding code, we can see that the provider is executing our provider
logic, mapping every element, getting it as an int. Then, we get the amount.
Row is a generic type that can have a variable number of arguments.
In the next chapter, we will learn how to leverage the Spark GraphX API.
Leveraging the Spark GraphX API
In this chapter, we will learn how to create a graph from a data source. We
will then carry out experiments with the Edge API and Vertex API. By the
end of this chapter, you will know how to calculate the degree of vertex
and PageRank.
We will take the graph.g file, load it, and see how it will provide results in
Spark. First, we need to get a resource to our graph.g file. We will do this
using the getClass.getResource() method to get the path to it, as follows:
package com.tomekl007.chapter_7
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Graph, GraphLoader}
object GraphBuilder {
assert(graph.triplets.count() == 4)
}
We will start the test and see whether we are able to load our graph properly.
import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
VertexId is of the long type; this is only a type alias for Long:
type VertexID = Long
But since our graph sometimes has a lot of content, the VertexId should be
unique and a very long number. Every vertex in our vertices' RDD should
have a unique VertexId. The custom data associated with the vertex can be
any class, but we will go for simplicity with the String class. First, we are
creating a vertex with ID 1 and string data a, the next with ID 2 and string
data b, the next with ID 3 and string data c, and similarly for the data with
ID 4 and string d, as follows:
val users: RDD[(VertexId, (String))] =
spark.parallelize(Array(
(1L, "a"),
(2L, "b"),
(3L, "c"),
(4L, "d")
))
Creating a graph from only vertices will be correct but not very useful. A graph is the
best way to find relationships between the data, which is why a graph is the main
building block for social networks.
Creating couple relationships
In this section, we will create couple relationships and edges between our
vertices. Here, we'll have a relationship that is an Edge. An Edge is a case class
from the org.apache.spark.graphx package. It is a bit more involved because we
need to specify the source vertex ID and destination vertex ID. We want to
specify that vertex ID 1 and 2 have a relationship, so let's make a label for
this relationship. In the following code, we will specify vertex ID 1 and ID 2
as a friend, then we will specify vertex ID 1 and ID 3 as a friend as well.
Lastly, vertex ID 2 and ID 4 will be a wife:
val relationships =
spark.parallelize(Array(
Edge(1L, 2L, "friend"),
Edge(1L, 3L, "friend"),
Edge(2L, 4L, "wife")
))
Also, a label could be of any type—it doesn't need to be a String type; we can
type what we want and pass it. Once we have our vertices, users, and edge
relationships, we can create a graph. We are using the Graph class' apply
method to construct our Spark GraphX graph. We need to pass users, VertexId,
and relationships, as follows:
Once we convert it into uppercase, we can just collect all the vertices and
perform toList(), as follows:
println(res.vertices.collect().toList)
}
We can see that after applying the transformation to the values, our graph
has the following vertices:
Using the Edge API
In this section, we will construct the graph using the Edge API. We'll also
use the vertex, but this time we'll focus on the edge transformations.
Constructing the graph using edge
As we saw in the previous sections, we have edges and vertices, which is
an RDD. As this is an RDD, we can get an edge. We have a lot of methods
that are available on the normal RDD. We can use the max method,
min method, sum method, and all other actions. We will apply the reduce
method, so the reduce method will take two edges, we will take e1, e2, and we
can perform some logic on it.
Since the edge is chaining together two vertices, we can perform some logic
here. For example, if the e1 edge attribute is equal to friend, we want to lift an
edge using the filter operation. So, the filter method is taking only one
edge, and then if the edge e1 is a friend, it will be perceived automatically.
We can see that at the end we can collect it and perform a toList so that the
API that is on Spark is available for our use. The following code will help us
implement our logic:
import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
class EdgeAPI extends FunSuite {
val spark: SparkContext =
SparkSession.builder().master("local[2]").getOrCreate().sparkContext
val relationships =
spark.parallelize(Array(
Edge(1L, 2L, "friend"),
Edge(1L, 3L, "friend"),
Edge(2L, 4L, "wife")
))
//when
val resFromFilter = graph.edges.filter((e1) => e1.attr ==
"friend").collect().toList
println(resFromFilter)
It also has a couple of methods on the top of the standard RDD. For
example, we can do a map edge, which will take an edge, and we can take an
attribute and map every label to uppercase, as follows:
val res = graph.mapEdges(e => e.attr.toUpperCase)
On the graph, we can also perform group edges. Grouping edges is similar to GROUP ,
BY
but only for edges.
Let's start our code. We can see in the output that our code has filtered
the wife edge—we only perceive the friend edge from vertex ID 1 to ID 2, and
also vertex ID 1 to ID 3, and map edges as shown in the following
screenshot:
Calculating the degree of the vertex
In this section, we will cover the total degree, then we'll split it into two parts
—an in-degree and an out-degree—and we will understand how this works
in the code.
For our first test, let's construct the graph that we already know about:
package com.tomekl007.chapter_7
import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
import org.scalatest.Matchers._
val relationships =
spark.parallelize(Array(
Edge(1L, 2L, "friend"),
Edge(1L, 3L, "friend"),
Edge(2L, 4L, "wife")
))
We can get the degrees using the degrees method. The degrees method is
returning VertexRDD because degrees is a vertex:
val graph = Graph(users, relationships)
//when
val degrees = graph.degrees.collect().toList
The preceding code explains that for the 4L instance of VertexId, there is only
one relationship because there is a relationship between 2L and 4L.
Then, for the 2L instance of VertexId, there are two, so it is between 1L, 2L and
2L, 4L. For the 1L instance of VertexId, there are two, which are 1L, 2L and 1L,
3L, and for VertexId 3L, there is only one relationship, between 1L and 3L. This
way, we can check how our graph is coupled and how many relationships
there are. We can find out which vertex is best known by sorting them, so we
can see that our test passed in the following screenshot:
The in-degree
The in-degree tells us how many vertices come into the second vertex, but
not the other way around. This time, we can see that for the 2L instance
of VertexId, there's only one inbound vertex. We can see that 2L has a
relationship with 1L, 3L has a relationship with 1L as well, and 4L has a
relationship with 1L. In the following resulting dataset, there will be no data
for VertexId 1L, because 1L is the input. So, 1L would only be a source and not
a destination:
test("should calculate in-degree of vertices") {
//given
val users: RDD[(VertexId, (String))] =
spark.parallelize(Array(
(1L, "a"),
(2L, "b"),
(3L, "c"),
(4L, "d")
))
val relationships =
spark.parallelize(Array(
Edge(1L, 2L, "friend"),
Edge(1L, 3L, "friend"),
Edge(2L, 4L, "wife")
))
//when
val degrees = graph.inDegrees.collect().toList
//then
degrees should contain theSameElementsAs List(
(2L, 1L),
(3L, 1L),
(4L, 1L)
)
}
The outDegrees method contains both RDD and VertexRDD, which we have
collected to a list using the collect and toList methods.
val relationships =
spark.parallelize(Array(
Edge(1L, 2L, "friend"),
Edge(1L, 3L, "friend"),
Edge(2L, 4L, "wife")
))
//when
val degrees = graph.outDegrees.collect().toList
//then
degrees should contain theSameElementsAs List(
(1L, 2L),
(2L, 1L)
)
}
}
Also, VertexId 2L should have one outbound vertex as there is a relationship
between 2L and 4L and not the other way around, as shown in the preceding
code.
import org.apache.spark.graphx.GraphLoader
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
import org.scalatest.Matchers._
We are splitting on the comma and the first group is our integer, which will
be vertex ID, and then fields(1) is the name of vertex, as follows:
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
Next, we will join the users with ranks. We will join the users using the VertexId
by using the username and rank of the user. Once we have that, we can sort
everything by the rank, so we will take a second element of the tuple and it
should be sorted as sortBy ((t) =>t.2. At the beginning of the file, we will
have the user with the most influence:
//when
val rankByUsername = users.join(ranks).map {
case (_, (username, rank)) => (username, rank)
}.sortBy((t) => t._2, ascending = false)
.collect()
.toList
}
If we skip the sortBy method, Spark does not guarantee any ordering of elements; to keep
the ordering, we need to issue the sortBy method.
When we start running this test, we can see whether the GraphX PageRank
was able to calculate the influence of our users. We get the output that's
shown in the preceding screenshot, where BarackObama was first with 1.45
influence, then ladygaga with an influence of 1.39, odersky with 1.29, jeresig with
,
0.99 matai_zaharia with 0.70, and at the end, justinbieber with an influence of
0.15.
Learning PySpark
Tomasz Drabas, Denny Lee
ISBN: 978-1-78646-370-8
ISBN: 978-1-78883-536-7