Big Data Technologies Lab

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

BIG DATA TECHNOLOGIES LAB

CSC - 394
Name: Addhyan Pant DATE: 25th March 2021
UID : 18BCS3780 SEC : 18AITBDA1 ( Group 1 )

EXPERIMENT – 5

AIM: Write the code of a word count program using Apache Spark

1) Working on Notebook and creating the desired program


i) First, check the version of Python and Spark. Also, import packages on the notebook.

ii) Using the filter and map function with Lambda function.
Lambda Function: Lambda functions are anonymous functions in Python. Anonymous
functions do not bind to any name in runtime and it returns the functions without any
name. They are usually used with map and filter methods. Lambda functions create
functions to be called later.

iii) Creating RDD from Object


RDD: RDDs are data stacks distributed throughout cluster of computers. Each stack is
calculated on different computers in the cluster. RDDs are the most basic data structure of
Spark.
iv) Performing Transformation and Action on RDD
Transformations and actions are two type of operations in Spark. Transformations create
new RDDs.
Actions performs computation on the RDDs.
Map, filter, flatmap and union are basic RDD transformations. Collect, take, first and count
are basic RDD actions.

v) Performing Transformation and Action on PAIR RDD


Pair RDD is a special type of RDD to work with datasets with key/value pairs. All regular
transformations work on pair RDD.
2) Creating a SparkSession and performing the subsequent tasks.
i) SparkContext is the main entry point for creating RDDs while SparkSession
provides asingle point of entry to interact with Spark Dataframes.
SparkSession is used to create DataFrame, register Data Frame, execute SQL queries.
We can access SparkSession in PySpark using spark variable

ii) Creating PySpark DataFrame from RDD


Spark SQL which is a Spark module for structured data processing provides a programming
abstraction called DataFrames and can also act as a distributed SQL query engine.

iii) Adding Dataset


a) People.csv: In the people.csv tile click on Insert to Code and then click on
SparkSession DataFrame.
Set the ‘File_name’ and ‘Bucket_name’ as shown in the Inserted Code.
b) 5000_people.txt: In the 5000_people.txt tile click on Insert to Code and then click on
Credentials.
Set the ‘File_name’ and ‘Bucket_name’ as shown in the Inserted Code.

iv) Create PySpark DataFrame from external file


Create PySpark DataFrame by using SparkSession's read.csv method. To do this, we
should give path of csv file as an argument to the method.

Show action prints first 20 rows of DataFrame.

Count action prints number of rows in DataFrame


Columns attribute prints the list of columns in DataFrame

PrintSchema action prints the types of columns in the Dataframe and it gives information
aboutwhether there is null values in columns or not.

Select method to select some columns of DataFrame. If we give argument to show method, it
prints out
rows as number of arguments.
Getting the count of the Data Elements in the DataFrame.
We can use the dropDuplicates action to get the Dataframe to remove/drop all the duplicates
in the
dataframe.
● We can also get the count of the females in the dataframe using count action.
● Group the dataframe based upon the Sex using groupby action.
● Sort the dataframe based upon the DOB using orderby action.
● Rename the column name using withColumnRenamed action.
3) Using SQL queries with Data Frames by using Spark SQL module
SQL queries are used to achieve the same things with Data Frames. Firstly, we should
create temporary table by using create Or Replace Temp View method. We should
give the name of temporary table as an argument to the method. Then, we can give
any query we want to execute to Spark Session's sql method as an argument.

4) Create RDD from external file


5) Machine Learning with PySpark MLlib
PySpark MLlib is the Apache Spark's scalable machine learning library in Python
consisting of common learning algorithms and utilities. We use Kmeans algorithm
of MLlib library to cluster data in 5000_points.txt dataset

Result: We have successfully created a Apache Spark code for word count and have also
implemented it.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy