Week 8-10 Bda
Week 8-10 Bda
Aim: Explode a column of arrays into rows of elements and perform text
formatting operations
Procedure:
groupBy() is used to group the records based on single or multiple column values. Once the
records are grouped, we can perform various aggregate functions like count(), sum(), avg(),
etc. on the grouped data.
count() is used to get the number of records for each group. To perform the count, first, you
need to perform the groupBy() on DataFrame which groups the records based on single or multiple
column values, and then do the count() to get the number of records for each group. Program:
Input data:
Week-10
Aim: Store the analyzed results in a CSV file.
Procedure:
In PySpark you can save (write/extract) a DataFrame to a CSV file on disk by
using dataframeObj.write.csv("path"), using this you can also write DataFrame
to AWS S3, Azure Blob, HDFS, or any PySpark supported file systems.
Program:
Loading dataset
= spark.read.text("books.txt")
book.show()
words_to_save = three_letter_words.select("words")
words_to_save.write.text("three_letter_words.txt")
Loading a dataset
df = spark.read.csv("input.csv")
Writing into .csv file
df.write.csv("output.csv")