Big Data Technologies Lab

BIG DATA TECHNOLOGIES LAB
CSC - 394
Name: Addhyan Pant DATE: 25th March 2021
UID : 18BCS3780 SEC : 18AITBDA1 ( Group 1 )
EXPERIMENT – 5
AIM: Write the code of a word count program using Apache Spark
1) Working on Notebook and creating the desired program

i) First, check the version of Python and Spark. Also, import packages on the notebook.
ii) Using the filter and map function with Lambda function.
Lambda Function: Lambda functions are anonymous functions in Python. Anonymous
functions do not bind to any name in runtime and it returns the functions without any
name. They are usually used with map and filter methods. Lambda functions create
functions to be called later.
iii) Creating RDD from Object

RDD: RDDs are data stacks distributed throughout cluster of computers. Each stack is
calculated on different computers in the cluster. RDDs are the most basic data structure of
Spark.
iv) Performing Transformation and Action on RDD
Transformations and actions are two type of operations in Spark. Transformations create
new RDDs.
Actions performs computation on the RDDs.
Map, filter, flatmap and union are basic RDD transformations. Collect, take, first and count
are basic RDD actions.
v) Performing Transformation and Action on PAIR RDD

Pair RDD is a special type of RDD to work with datasets with key/value pairs. All regular
transformations work on pair RDD.
2) Creating a SparkSession and performing the subsequent tasks.
i) SparkContext is the main entry point for creating RDDs while SparkSession
provides asingle point of entry to interact with Spark Dataframes.
SparkSession is used to create DataFrame, register Data Frame, execute SQL queries.
We can access SparkSession in PySpark using spark variable
ii) Creating PySpark DataFrame from RDD

Spark SQL which is a Spark module for structured data processing provides a programming
abstraction called DataFrames and can also act as a distributed SQL query engine.
iii) Adding Dataset

a) People.csv: In the people.csv tile click on Insert to Code and then click on
SparkSession DataFrame.
Set the ‘File_name’ and ‘Bucket_name’ as shown in the Inserted Code.
b) 5000_people.txt: In the 5000_people.txt tile click on Insert to Code and then click on
Credentials.
Set the ‘File_name’ and ‘Bucket_name’ as shown in the Inserted Code.
iv) Create PySpark DataFrame from external file

Create PySpark DataFrame by using SparkSession's read.csv method. To do this, we
should give path of csv file as an argument to the method.
Show action prints first 20 rows of DataFrame.
Count action prints number of rows in DataFrame

Columns attribute prints the list of columns in DataFrame
PrintSchema action prints the types of columns in the Dataframe and it gives information
aboutwhether there is null values in columns or not.
Select method to select some columns of DataFrame. If we give argument to show method, it
prints out
rows as number of arguments.
Getting the count of the Data Elements in the DataFrame.
We can use the dropDuplicates action to get the Dataframe to remove/drop all the duplicates
in the
dataframe.
● We can also get the count of the females in the dataframe using count action.
● Group the dataframe based upon the Sex using groupby action.
● Sort the dataframe based upon the DOB using orderby action.
● Rename the column name using withColumnRenamed action.
3) Using SQL queries with Data Frames by using Spark SQL module
SQL queries are used to achieve the same things with Data Frames. Firstly, we should
create temporary table by using create Or Replace Temp View method. We should
give the name of temporary table as an argument to the method. Then, we can give
any query we want to execute to Spark Session's sql method as an argument.
4) Create RDD from external file

5) Machine Learning with PySpark MLlib
PySpark MLlib is the Apache Spark's scalable machine learning library in Python
consisting of common learning algorithms and utilities. We use Kmeans algorithm
of MLlib library to cluster data in 5000_points.txt dataset
Result: We have successfully created a Apache Spark code for word count and have also
implemented it.

Big Data Technologies Lab

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Big Data Technologies Lab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Technologies Lab

Uploaded by

Copyright:

Available Formats

BIG DATA TECHNOLOGIES LAB

1) Working on Notebook and creating the desired program

iii) Creating RDD from Object

v) Performing Transformation and Action on PAIR RDD

ii) Creating PySpark DataFrame from RDD

iii) Adding Dataset

iv) Create PySpark DataFrame from external file

Show action prints first 20 rows of DataFrame.

Count action prints number of rows in DataFrame

4) Create RDD from external file

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.