Pyspark Interview Questions
Pyspark Interview Questions
data = [
("East", "Jan", 200), ("East", "Feb", 300),
("East", "Mar", 250), ("West", "Jan", 400),
("West", "Feb", 350), ("West", "Mar", 450) ]
import org.apache.spark.sql.functions._
df = spark.createDataFrame([
("East", "Jan", 200), ("East", "Feb", 300),
("East", "Mar", 250), ("West", "Jan", 400),
("West", "Feb", 350), ("West", "Mar", 450) ], ['region', 'month', 'sales'])
df = df.groupby('region') \
.pivot('month') \
.agg(count('month'), sum('sales'))
df.show()
df = df.groupby('region') \
.pivot('month') \
.agg(count('sales'), sum('sales'))
df = spark.createDataFrame([
('John', '123', '00015', '1'), ('John', '123', '00016', '2'), ('John', '345',
'00205', '3'),
('John', '345', '00206', '4'), ('John', '789', '00283', '5'), ('John', '789',
'00284', '6'),
('John', '789', '00285', '7')
], ['Customer', 'ID', 'unit', 'order'])
df_new = df1.groupby('Customer','sid') \
.pivot('dr', range(1,N+1)) \
.agg(
F.first('ID').alias('ID'),
F.first('unit').alias('unit'),
F.first('order').alias('order')
)
---------------------------
2.Drop Duplicate the dataset. 2 Handle any missing values appropriately. 3
Determine the top 3 most frequent activity_type for each user_id. 4 Calculate the
time spent by each user on each activity_type
data = [
("U1", "2024-12-30 10:00:00", "LOGIN"),
("U1", "2024-12-30 10:05:00", "BROWSE"),
("U1", "2024-12-30 10:20:00", "LOGOUT"),
("U2", "2024-12-30 11:00:00", "LOGIN"),
("U2", "2024-12-30 11:15:00", "BROWSE"),
("U2", "2024-12-30 11:30:00", "LOGOUT"),
("U1", "2024-12-30 10:20:00", "LOGOUT"), # Duplicate entry
(None, "2024-12-30 12:00:00", "LOGIN"), # Missing user_id
("U3", None, "LOGOUT") # Missing timestamp
]
data = [
(1, 'Alice', 'Engineer', 'IT', 70000),
(2, 'Bob', 'Engineer', 'IT', 80000),
(3, 'Carol', 'Manager', 'HR', 90000),
(4, 'David', 'Manager', 'Finance', 95000),
(5, 'Eve', 'Engineer', 'IT', 72000),
(6, 'Frank', 'Analyst', 'Finance', 60000),
(7, 'Grace', 'Analyst', 'Finance', 62000),
(8, 'Hannah', 'Engineer', 'IT', 71000),
(9, 'Ivy', 'Manager', 'HR', 88000),
(10, 'Jack', 'Engineer', 'IT', 73000)
]
6.find the customer who has not order in last one year.
customer_data = [
(1, 'Alice'),
(2, 'Bob'),
(3, 'Carol'),
(4, 'David'),
(5, 'Eve')
]
orders_data = [
(1, 101, '2022-01-01'),
(2, 102, '2023-04-15'),
(3, 103, '2022-11-20'),
(4, 104, '2023-05-10'),
(5, 105, '2021-12-25')
]
markings = [('RN','Maths',87),('RN','English',89),('RN','Science',95),
('NR','Maths',27),('NR','English',90),('NR','Science',91)]
schema = ["name","subject","marks"]
8.Fill None values in the level column with the average level of that pollutant.
9.Fill None values in the date column with the previous available date.
data = [
(1, "Delhi", "PM2.5", 320, None),
(2, "Mumbai", "PM10", 180, "2024-03-01"),
(3, "Bangalore", "SO2", None, "2024-03-02"),
(4, "Kolkata", "NO2", 45, None),
(5, "Chennai", "CO", 10, "2024-03-03")
]
columns = ["id", "city", "pollutant", "level", "date"]
10.Find the top 2 routes with the highest total fare collected.
11.Rank routes by total fare using a window function.
data = [
("2024-03-01", "Delhi", "Mumbai", 2500),
("2024-03-01", "Delhi", "Bangalore", 4000),
("2024-03-02", "Mumbai", "Chennai", 3500),
("2024-03-02", "Delhi", "Mumbai", 2600),
("2024-03-03", "Kolkata", "Delhi", 2000)
]
columns = ["date", "source", "destination", "fare"]
12.1.Display the schema and first 3 rows. 2.Write a query to filter employees
earning more than ₹70,000 3.Use groupBy to get the average salary for each
department. 4.Filter employees whose names start with the letter 'A' 5.Use groupBy
and count() to find the number of employees in each department. 6.Add a new column
Tax that deducts 10% from Salary 7.Display employees sorted in descending order of
salary. 8.Find the second highest salary without using LIMITandOFFSET`. 9.Filter
records where the department is either "HR" or "IT". 10.Calculate the sum of all
salaries.
data = [
(1, "Amit", "IT", 60000),
(2, "Priya", "HR", 55000),
(3, "Rahul", "Finance", 75000),
(4, "Sneha", "IT", 80000),
(5, "Karan", "HR", 65000)
]
columns = ["EmpID", "Name", "Department", "Salary"]