0% found this document useful (0 votes)

9 views4 pages

Pyspark Interview Questions

The document outlines various data analysis tasks using Spark DataFrame operations, including calculating cumulative sales, ranking monthly sales, handling duplicates and missing values, and performing aggregations on employee salaries. It also includes tasks related to customer order analysis, route fare ranking, and data cleaning for pollutants. Each task is accompanied by sample data and expected operations to achieve the desired results.

Uploaded by

Manas Barua

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views4 pages

Pyspark Interview Questions

Uploaded by

Manas Barua

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

1.

Imagine you're analyzing the monthly sales performance of a company across

different regions. You want to calculate:
1. The cumulative sales for each region over months.
2. The rank of each month based on sales within the same region.

data = [
("East", "Jan", 200), ("East", "Feb", 300),
("East", "Mar", 250), ("West", "Jan", 400),
("West", "Feb", 350), ("West", "Mar", 450) ]

columns = ["Region", "Month", "Sales"]

import org.apache.spark.sql.functions._

df = spark.createDataFrame([
("East", "Jan", 200), ("East", "Feb", 300),
("East", "Mar", 250), ("West", "Jan", 400),
("West", "Feb", 350), ("West", "Mar", 450) ], ['region', 'month', 'sales'])

df = df.groupby('region') \
.pivot('month') \
.agg(count('month'), sum('sales'))

df.show()

df = df.groupby('region') \
.pivot('month') \
.agg(count('sales'), sum('sales'))

from pyspark.sql import functions as func

df = df.withColumn('Region', \
func.create_map(func.lit('Jan'),df.Month,
func.lit('Feb'),df.Month,
func.lit('Mar'),df.Month
)
)
#Use explode function to explode the map
res = df.select('*',func.explode(df.mapCol).alias('col_id','col_value'))
res.show()

df = spark.createDataFrame([
('John', '123', '00015', '1'), ('John', '123', '00016', '2'), ('John', '345',
'00205', '3'),
('John', '345', '00206', '4'), ('John', '789', '00283', '5'), ('John', '789',
'00284', '6'),
('John', '789', '00285', '7')
], ['Customer', 'ID', 'unit', 'order'])
df_new = df1.groupby('Customer','sid') \
.pivot('dr', range(1,N+1)) \
.agg(
F.first('ID').alias('ID'),
F.first('unit').alias('unit'),
F.first('order').alias('order')
)
---------------------------
2.Drop Duplicate the dataset. 2 Handle any missing values appropriately. 3
Determine the top 3 most frequent activity_type for each user_id. 4 Calculate the
time spent by each user on each activity_type

data = [
("U1", "2024-12-30 10:00:00", "LOGIN"),
("U1", "2024-12-30 10:05:00", "BROWSE"),
("U1", "2024-12-30 10:20:00", "LOGOUT"),
("U2", "2024-12-30 11:00:00", "LOGIN"),
("U2", "2024-12-30 11:15:00", "BROWSE"),
("U2", "2024-12-30 11:30:00", "LOGOUT"),
("U1", "2024-12-30 10:20:00", "LOGOUT"), # Duplicate entry
(None, "2024-12-30 12:00:00", "LOGIN"), # Missing user_id
("U3", None, "LOGOUT") # Missing timestamp
]

3.calculate average salary of eacha role.

4.Second largets salary of HR department.

data = [
(1, 'Alice', 'Engineer', 'IT', 70000),
(2, 'Bob', 'Engineer', 'IT', 80000),
(3, 'Carol', 'Manager', 'HR', 90000),
(4, 'David', 'Manager', 'Finance', 95000),
(5, 'Eve', 'Engineer', 'IT', 72000),
(6, 'Frank', 'Analyst', 'Finance', 60000),
(7, 'Grace', 'Analyst', 'Finance', 62000),
(8, 'Hannah', 'Engineer', 'IT', 71000),
(9, 'Ivy', 'Manager', 'HR', 88000),
(10, 'Jack', 'Engineer', 'IT', 73000)
]

columns = ["id", "name", "role", "dept", "salary"]

5.write query to get top five sells.

data = [
(1, 10, 100.0),
(2, 5, 200.0),
(3, 7, 150.0),
(4, 20, 50.0),
(5, 1, 500.0),
(6, 15, 120.0),
(7, 2, 300.0),
(8, 3, 250.0),
(9, 8, 100.0),
(10, 4, 400.0)
]

columns = ["id", "quantity", "price"]

6.find the customer who has not order in last one year.
customer_data = [
(1, 'Alice'),
(2, 'Bob'),
(3, 'Carol'),
(4, 'David'),
(5, 'Eve')
]

orders_data = [
(1, 101, '2022-01-01'),
(2, 102, '2023-04-15'),
(3, 103, '2022-11-20'),
(4, 104, '2023-05-10'),
(5, 105, '2021-12-25')
]

customers_columns = ["id", "name"]

orders_columns = ["id", "orderId", "orderDate"]

7.Write a query to get the output with following columns:

"Name","Maths","English","Science","highest_marks"

markings = [('RN','Maths',87),('RN','English',89),('RN','Science',95),
('NR','Maths',27),('NR','English',90),('NR','Science',91)]
schema = ["name","subject","marks"]

8.Fill None values in the level column with the average level of that pollutant.
9.Fill None values in the date column with the previous available date.

data = [
(1, "Delhi", "PM2.5", 320, None),
(2, "Mumbai", "PM10", 180, "2024-03-01"),
(3, "Bangalore", "SO2", None, "2024-03-02"),
(4, "Kolkata", "NO2", 45, None),
(5, "Chennai", "CO", 10, "2024-03-03")
]
columns = ["id", "city", "pollutant", "level", "date"]

10.Find the top 2 routes with the highest total fare collected.
11.Rank routes by total fare using a window function.

data = [
("2024-03-01", "Delhi", "Mumbai", 2500),
("2024-03-01", "Delhi", "Bangalore", 4000),
("2024-03-02", "Mumbai", "Chennai", 3500),
("2024-03-02", "Delhi", "Mumbai", 2600),
("2024-03-03", "Kolkata", "Delhi", 2000)
]
columns = ["date", "source", "destination", "fare"]

12.1.Display the schema and first 3 rows. 2.Write a query to filter employees
earning more than ₹70,000 3.Use groupBy to get the average salary for each
department. 4.Filter employees whose names start with the letter 'A' 5.Use groupBy
and count() to find the number of employees in each department. 6.Add a new column
Tax that deducts 10% from Salary 7.Display employees sorted in descending order of
salary. 8.Find the second highest salary without using LIMITandOFFSET`. 9.Filter
records where the department is either "HR" or "IT". 10.Calculate the sum of all
salaries.

data = [
(1, "Amit", "IT", 60000),
(2, "Priya", "HR", 55000),
(3, "Rahul", "Finance", 75000),
(4, "Sneha", "IT", 80000),
(5, "Karan", "HR", 65000)
]
columns = ["EmpID", "Name", "Department", "Salary"]

PPT of Chapter 2
No ratings yet
PPT of Chapter 2
49 pages
Security and Swiftsmart: Certification From Swift
No ratings yet
Security and Swiftsmart: Certification From Swift
3 pages
Introduction C - Programming-Lab Manual
No ratings yet
Introduction C - Programming-Lab Manual
12 pages
Multiple Choice Questions: Productnr
No ratings yet
Multiple Choice Questions: Productnr
8 pages
STATISTICS
No ratings yet
STATISTICS
23 pages
Errors and Exceptions.ipynb - Colab
No ratings yet
Errors and Exceptions.ipynb - Colab
3 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Avaya VoIP Phone With Cisco Switch and CPPM
No ratings yet
Avaya VoIP Phone With Cisco Switch and CPPM
5 pages
Sg2002 TRM en
No ratings yet
Sg2002 TRM en
918 pages
Pyhtonpractice Questions
No ratings yet
Pyhtonpractice Questions
5 pages
quewtion sql_pyspark
No ratings yet
quewtion sql_pyspark
4 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Xenapp Xendesktop 7 16 PDF
No ratings yet
Xenapp Xendesktop 7 16 PDF
945 pages
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
Leap Motion PDF
No ratings yet
Leap Motion PDF
18 pages
Spark SQLpdf 20 jan
No ratings yet
Spark SQLpdf 20 jan
4 pages
THIRDPARTYLICENSEREADME-JAVAFX
No ratings yet
THIRDPARTYLICENSEREADME-JAVAFX
44 pages
Consumer Reviews of Amazon Products
No ratings yet
Consumer Reviews of Amazon Products
1 page
Practical Set 4 Answers
No ratings yet
Practical Set 4 Answers
6 pages
Lab 03b - Manage Azure Resources by Using ARM Templates
No ratings yet
Lab 03b - Manage Azure Resources by Using ARM Templates
5 pages
Paper 2.
No ratings yet
Paper 2.
5 pages
Tmp
No ratings yet
Tmp
1 page
Apaches Park
No ratings yet
Apaches Park
1 page
Eikon Imaging Ideator
No ratings yet
Eikon Imaging Ideator
2 pages
PysparkSqlFinalDocument
No ratings yet
PysparkSqlFinalDocument
31 pages
Half Yearly Answers
No ratings yet
Half Yearly Answers
10 pages
Campus Automation System (Full Document)
No ratings yet
Campus Automation System (Full Document)
10 pages
EDA with Pandas
No ratings yet
EDA with Pandas
8 pages
Solution
No ratings yet
Solution
8 pages
Fujitsu M12 (Building Block Configuration) Installation Specialist
No ratings yet
Fujitsu M12 (Building Block Configuration) Installation Specialist
2 pages
Practical Module Federation
No ratings yet
Practical Module Federation
167 pages
Types of Files in Linux
No ratings yet
Types of Files in Linux
24 pages
Lab 2 Legacy Networks BGP Example As A Distributed System and Autonomous Forwarding Decisions
No ratings yet
Lab 2 Legacy Networks BGP Example As A Distributed System and Autonomous Forwarding Decisions
27 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
Py Spark Samples
No ratings yet
Py Spark Samples
3 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Python - Pandas_Numpy Interview Q&A
No ratings yet
Python - Pandas_Numpy Interview Q&A
12 pages
python interviews
No ratings yet
python interviews
154 pages
# IP Practical #
No ratings yet
# IP Practical #
12 pages
Programs of Python Pandas
No ratings yet
Programs of Python Pandas
15 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
No ratings yet
Interactive Data Analysis With Jupyter Cheatsheet 1731972443
10 pages
Day77
No ratings yet
Day77
10 pages
Day 60
No ratings yet
Day 60
10 pages
ANS KEY SET A
No ratings yet
ANS KEY SET A
6 pages
journal
No ratings yet
journal
47 pages
20 SQL Pandas
No ratings yet
20 SQL Pandas
14 pages
DHP Journal
No ratings yet
DHP Journal
29 pages
Sig-naTrak® ACE DCC Controller User Manual
No ratings yet
Sig-naTrak® ACE DCC Controller User Manual
48 pages
SAMPLE
No ratings yet
SAMPLE
5 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
CareView 560RF Operation Manual - 201906
No ratings yet
CareView 560RF Operation Manual - 201906
34 pages
practice_questions2
No ratings yet
practice_questions2
2 pages
Okta Q4FY22 Earnings Presentation 03.02.22
No ratings yet
Okta Q4FY22 Earnings Presentation 03.02.22
59 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
PPT 7 Image Integration 8.29 1
No ratings yet
PPT 7 Image Integration 8.29 1
31 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Top 100 Pyspark Functions for Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions for Data Engineers 1738131847
30 pages
EXP-3
No ratings yet
EXP-3
10 pages
Practical
No ratings yet
Practical
6 pages
Pyspark_Coding_Interview_Questions
No ratings yet
Pyspark_Coding_Interview_Questions
19 pages
Set B
No ratings yet
Set B
8 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Lab Record IP
No ratings yet
Lab Record IP
13 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
XII_IP_Model_1_Ans
No ratings yet
XII_IP_Model_1_Ans
8 pages
Ip Practical File
No ratings yet
Ip Practical File
20 pages
Product Details of Contec CMS8000 Patient Monitor CMS8000 6-Parameter
No ratings yet
Product Details of Contec CMS8000 Patient Monitor CMS8000 6-Parameter
8 pages
Wachtel Resume
No ratings yet
Wachtel Resume
1 page
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
EDA Python for Data Analsis
No ratings yet
EDA Python for Data Analsis
10 pages
ACID Property
No ratings yet
ACID Property
6 pages
PROJECT REPORT KRISHI MART - Deleted
No ratings yet
PROJECT REPORT KRISHI MART - Deleted
26 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
CFE
No ratings yet
CFE
5 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
Enterprise Resource Planning (Erp)
No ratings yet
Enterprise Resource Planning (Erp)
34 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
HPE - Sf000075877en - Us - 3PAR Storeserv 8000 - Disk Cage Interface Card 1 and 2 in A Warning State
No ratings yet
HPE - Sf000075877en - Us - 3PAR Storeserv 8000 - Disk Cage Interface Card 1 and 2 in A Warning State
6 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
SecRef - Securities & Issuer ID Mapping
No ratings yet
SecRef - Securities & Issuer ID Mapping
17 pages
Python Practical File 12
No ratings yet
Python Practical File 12
22 pages
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
No ratings yet
Excel To Pandas Advanced Data Techniques For BI Devs 1729266352
9 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Cracking The C Interview PDF
No ratings yet
Cracking The C Interview PDF
320 pages
NX 3.0.9000 - Chapter 09 - Troubleshooting
No ratings yet
NX 3.0.9000 - Chapter 09 - Troubleshooting
85 pages
Picture Editor Data Entry SW11-650
No ratings yet
Picture Editor Data Entry SW11-650
90 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Apache Cassandra Developer Associate - Exam Practice Tests
From Everand
Apache Cassandra Developer Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
No Ph.D. Game Design With Three.js
From Everand
No Ph.D. Game Design With Three.js
Nikiforos Kontopoulos
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Pyspark Interview Questions

Uploaded by

Pyspark Interview Questions

Uploaded by

1.

Imagine you're analyzing the monthly sales performance of a company across

columns = ["Region", "Month", "Sales"]

from pyspark.sql import functions as func

3.calculate average salary of eacha role.

columns = ["id", "name", "role", "dept", "salary"]

5.write query to get top five sells.

columns = ["id", "quantity", "price"]

customers_columns = ["id", "name"]

7.Write a query to get the output with following columns:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.