Day 60
Day 60
Interview
Question
Ganesh. R
Problem Statement
Problem Statement:
We have two datasets: one for employees and one for
departments. Your task is to calculate the difference
between the maximum salaries in the 'engineering'
and 'marketing' departments.
Input Table Data
#Employee Data
employee_data = [(1, "John", "Doe", 60000, 1),
(2, "Jane", "Smith", 55000, 2),
(3, "Emily", "Johnson", 70000, 1),
(4, "Michael", "Brown", 80000, 3),
(5, "Chris", "Davis", 45000, 4),
(6, "Anna", "Wilson", 52000, 5)]
#Schema design
employee_schema = StructType([ StructField("id",
IntegerType(), True),
StructField("first_name",
StringType(), True),
StructField("last_name",
StringType(), True),
StructField("salary",
FloatType(), True),
StructField("department_id", IntegerType(), True) ])
Input Table Data
# Department Data
department_data = [
(1, "engineering"),
#Schema design
department_schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("department", StringType(), True),
]
)
Output Table
(max(Max_Salary) - min(Max_Salary))
25000
Problem Statement:
We have two datasets: one for employees and one for departments. Your task is to calculate the
difference between the maximum salaries in the 'engineering' and 'marketing' departments.
Employee Dataset:
id: Unique identifier for each employee. first_name: Employee's first name. last_name:
Employee's last name. salary: Employee's salary. department_id: Foreign key to the department
the employee belongs to.
Department Dataset:
id: Unique identifier for each department. department: Name of the department. Write a query
that calculates the difference between the highest salaries found in the marketing and
engineering departments. Output just the absolute difference in salaries.
# Employee Data
employee_data = [
(1, "John", "Doe", 60000, 1),
(2, "Jane", "Smith", 55000, 2),
(3, "Emily", "Johnson", 70000, 1),
(4, "Michael", "Brown", 80000, 3),
(5, "Chris", "Davis", 45000, 4),
(6, "Anna", "Wilson", 52000, 5),
]
employee_schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("first_name", StringType(), True),
StructField("last_name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("department_id", IntegerType(), True),
]
)
employee_df = spark.createDataFrame(employee_data,
schema=employee_schema)
# Department Data
department_data = [
(1, "engineering"),
(2, "human resource"),
(3, "operation"),
(4, "marketing"),
(5, "sales"),
(6, "customer care"),
]
department_schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("department", StringType(), True),
]
)
department_df = spark.createDataFrame(department_data,
schema=department_schema)
employee_df.createOrReplaceTempView("employee")
department_df.createOrReplaceTempView("dept")
%sql
With CTE as (
select
D.department,
max(salary) as Max_Salary
from
employee E
Join dept D on E.department_id = D.id
Where
D.department in ('engineering', 'marketing')
Group by
D.department
)
Select
Max(max_salary) - Min(max_salary) as Diff
From
CTE
Expected Output:
The difference between the highest salary in 'engineering' and the highest salary in 'marketing'.
Breakdown:
max_salary_df: Groups the data by department and computes the maximum salary for each.
salary_diff_df: Aggregates to find the difference between the max and min of the maximum
salaries.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.
Ganesh. R
THANK YOU
For Your Support