0% found this document useful (0 votes)
9 views4 pages

Pyspark Interview Questions

The document outlines various data analysis tasks using Spark DataFrame operations, including calculating cumulative sales, ranking monthly sales, handling duplicates and missing values, and performing aggregations on employee salaries. It also includes tasks related to customer order analysis, route fare ranking, and data cleaning for pollutants. Each task is accompanied by sample data and expected operations to achieve the desired results.

Uploaded by

Manas Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Pyspark Interview Questions

The document outlines various data analysis tasks using Spark DataFrame operations, including calculating cumulative sales, ranking monthly sales, handling duplicates and missing values, and performing aggregations on employee salaries. It also includes tasks related to customer order analysis, route fare ranking, and data cleaning for pollutants. Each task is accompanied by sample data and expected operations to achieve the desired results.

Uploaded by

Manas Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

Imagine you're analyzing the monthly sales performance of a company across


different regions. You want to calculate:
1. The cumulative sales for each region over months.
2. The rank of each month based on sales within the same region.

data = [
("East", "Jan", 200), ("East", "Feb", 300),
("East", "Mar", 250), ("West", "Jan", 400),
("West", "Feb", 350), ("West", "Mar", 450) ]

columns = ["Region", "Month", "Sales"]

import org.apache.spark.sql.functions._

df = spark.createDataFrame([
("East", "Jan", 200), ("East", "Feb", 300),
("East", "Mar", 250), ("West", "Jan", 400),
("West", "Feb", 350), ("West", "Mar", 450) ], ['region', 'month', 'sales'])

df = df.groupby('region') \
.pivot('month') \
.agg(count('month'), sum('sales'))

df.show()

df = df.groupby('region') \
.pivot('month') \
.agg(count('sales'), sum('sales'))

from pyspark.sql import functions as func


df = df.withColumn('Region', \
func.create_map(func.lit('Jan'),df.Month,
func.lit('Feb'),df.Month,
func.lit('Mar'),df.Month
)
)
#Use explode function to explode the map
res = df.select('*',func.explode(df.mapCol).alias('col_id','col_value'))
res.show()

df = spark.createDataFrame([
('John', '123', '00015', '1'), ('John', '123', '00016', '2'), ('John', '345',
'00205', '3'),
('John', '345', '00206', '4'), ('John', '789', '00283', '5'), ('John', '789',
'00284', '6'),
('John', '789', '00285', '7')
], ['Customer', 'ID', 'unit', 'order'])
df_new = df1.groupby('Customer','sid') \
.pivot('dr', range(1,N+1)) \
.agg(
F.first('ID').alias('ID'),
F.first('unit').alias('unit'),
F.first('order').alias('order')
)
---------------------------
2.Drop Duplicate the dataset. 2 Handle any missing values appropriately. 3
Determine the top 3 most frequent activity_type for each user_id. 4 Calculate the
time spent by each user on each activity_type

data = [
("U1", "2024-12-30 10:00:00", "LOGIN"),
("U1", "2024-12-30 10:05:00", "BROWSE"),
("U1", "2024-12-30 10:20:00", "LOGOUT"),
("U2", "2024-12-30 11:00:00", "LOGIN"),
("U2", "2024-12-30 11:15:00", "BROWSE"),
("U2", "2024-12-30 11:30:00", "LOGOUT"),
("U1", "2024-12-30 10:20:00", "LOGOUT"), # Duplicate entry
(None, "2024-12-30 12:00:00", "LOGIN"), # Missing user_id
("U3", None, "LOGOUT") # Missing timestamp
]

3.calculate average salary of eacha role.


4.Second largets salary of HR department.

data = [
(1, 'Alice', 'Engineer', 'IT', 70000),
(2, 'Bob', 'Engineer', 'IT', 80000),
(3, 'Carol', 'Manager', 'HR', 90000),
(4, 'David', 'Manager', 'Finance', 95000),
(5, 'Eve', 'Engineer', 'IT', 72000),
(6, 'Frank', 'Analyst', 'Finance', 60000),
(7, 'Grace', 'Analyst', 'Finance', 62000),
(8, 'Hannah', 'Engineer', 'IT', 71000),
(9, 'Ivy', 'Manager', 'HR', 88000),
(10, 'Jack', 'Engineer', 'IT', 73000)
]

columns = ["id", "name", "role", "dept", "salary"]

5.write query to get top five sells.


data = [
(1, 10, 100.0),
(2, 5, 200.0),
(3, 7, 150.0),
(4, 20, 50.0),
(5, 1, 500.0),
(6, 15, 120.0),
(7, 2, 300.0),
(8, 3, 250.0),
(9, 8, 100.0),
(10, 4, 400.0)
]

columns = ["id", "quantity", "price"]

6.find the customer who has not order in last one year.
customer_data = [
(1, 'Alice'),
(2, 'Bob'),
(3, 'Carol'),
(4, 'David'),
(5, 'Eve')
]

orders_data = [
(1, 101, '2022-01-01'),
(2, 102, '2023-04-15'),
(3, 103, '2022-11-20'),
(4, 104, '2023-05-10'),
(5, 105, '2021-12-25')
]

customers_columns = ["id", "name"]


orders_columns = ["id", "orderId", "orderDate"]

7.Write a query to get the output with following columns:


"Name","Maths","English","Science","highest_marks"

markings = [('RN','Maths',87),('RN','English',89),('RN','Science',95),
('NR','Maths',27),('NR','English',90),('NR','Science',91)]
schema = ["name","subject","marks"]

8.Fill None values in the level column with the average level of that pollutant.
9.Fill None values in the date column with the previous available date.

data = [
(1, "Delhi", "PM2.5", 320, None),
(2, "Mumbai", "PM10", 180, "2024-03-01"),
(3, "Bangalore", "SO2", None, "2024-03-02"),
(4, "Kolkata", "NO2", 45, None),
(5, "Chennai", "CO", 10, "2024-03-03")
]
columns = ["id", "city", "pollutant", "level", "date"]

10.Find the top 2 routes with the highest total fare collected.
11.Rank routes by total fare using a window function.

data = [
("2024-03-01", "Delhi", "Mumbai", 2500),
("2024-03-01", "Delhi", "Bangalore", 4000),
("2024-03-02", "Mumbai", "Chennai", 3500),
("2024-03-02", "Delhi", "Mumbai", 2600),
("2024-03-03", "Kolkata", "Delhi", 2000)
]
columns = ["date", "source", "destination", "fare"]

12.1.Display the schema and first 3 rows. 2.Write a query to filter employees
earning more than ₹70,000 3.Use groupBy to get the average salary for each
department. 4.Filter employees whose names start with the letter 'A' 5.Use groupBy
and count() to find the number of employees in each department. 6.Add a new column
Tax that deducts 10% from Salary 7.Display employees sorted in descending order of
salary. 8.Find the second highest salary without using LIMITandOFFSET`. 9.Filter
records where the department is either "HR" or "IT". 10.Calculate the sum of all
salaries.

data = [
(1, "Amit", "IT", 60000),
(2, "Priya", "HR", 55000),
(3, "Rahul", "Finance", 75000),
(4, "Sneha", "IT", 80000),
(5, "Karan", "HR", 65000)
]
columns = ["EmpID", "Name", "Department", "Salary"]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy