0% found this document useful (0 votes)
20 views13 pages

Week 8-10 Bda

The document discusses exploding arrays into rows, finding word frequencies by grouping and counting, and storing analyzed results in a CSV file. The key steps are using explode() to expand arrays, groupBy() and count() to find word frequencies, and write.csv() or write.text() to save results to files.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

Week 8-10 Bda

The document discusses exploding arrays into rows, finding word frequencies by grouping and counting, and storing analyzed results in a CSV file. The key steps are using explode() to expand arrays, groupBy() and count() to find word frequencies, and write.csv() or write.text() to save results to files.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Week-8

Aim: Explode a column of arrays into rows of elements and perform text
formatting operations

Procedure: To explode and flatten an array of array (nested array) column


into
rows using Spark, you can use the explode and flatten functions from the
pyspark.sql.functions module. The explode function takes an array
or map column and creates a new row for each element in it.
The flatten function takes an array of array column and converts it into a

single array column Program:


Week-9
Aim:Find the word frequencies of the text data using
and groupby() and count()

Procedure:
groupBy() is used to group the records based on single or multiple column values. Once the
records are grouped, we can perform various aggregate functions like count(), sum(), avg(),
etc. on the grouped data.

count() is used to get the number of records for each group. To perform the count, first, you
need to perform the groupBy() on DataFrame which groups the records based on single or multiple
column values, and then do the count() to get the number of records for each group. Program:

Input data:
Week-10
Aim: Store the analyzed results in a CSV file.

Procedure:
In PySpark you can save (write/extract) a DataFrame to a CSV file on disk by
using dataframeObj.write.csv("path"), using this you can also write DataFrame
to AWS S3, Azure Blob, HDFS, or any PySpark supported file systems.

Program:
Loading dataset

from pyspark.shell import spark book

= spark.read.text("books.txt")

book.show()

Analysing the dataset – finding 3 letter words

lines = book.withColumn("words", split(book["value"], " ")) lines_exp =


lines.select(lines.value,explode(lines.words).alias("words")) word_frequencies =
lines_exp.groupBy("words").count() three_letter_words =
word_frequencies.filter(length(word_frequencies.words) == 3)
three_letter_words.show()
Writing into .txt file

words_to_save = three_letter_words.select("words")
words_to_save.write.text("three_letter_words.txt")

Loading a dataset

from pyspark.shell import spark

df = spark.read.csv("input.csv")
Writing into .csv file
df.write.csv("output.csv")

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy