0% found this document useful (0 votes)
2 views10 pages

Day 60

The document outlines a problem statement requiring the calculation of the salary difference between the highest salaries in the 'engineering' and 'marketing' departments using two datasets: employees and departments. It includes sample data, schema designs, and a SQL query to achieve the desired output. The expected result is the absolute difference in maximum salaries from the specified departments.

Uploaded by

Lapi Lapil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Day 60

The document outlines a problem statement requiring the calculation of the salary difference between the highest salaries in the 'engineering' and 'marketing' departments using two datasets: employees and departments. It includes sample data, schema designs, and a SQL query to achieve the desired output. The expected result is the absolute difference in maximum salaries from the specified departments.

Uploaded by

Lapi Lapil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Scenario Based

Interview
Question

Ganesh. R
Problem Statement

Problem Statement:
We have two datasets: one for employees and one for
departments. Your task is to calculate the difference
between the maximum salaries in the 'engineering'
and 'marketing' departments.
Input Table Data

#Employee Data
employee_data = [(1, "John", "Doe", 60000, 1),
(2, "Jane", "Smith", 55000, 2),
(3, "Emily", "Johnson", 70000, 1),
(4, "Michael", "Brown", 80000, 3),
(5, "Chris", "Davis", 45000, 4),
(6, "Anna", "Wilson", 52000, 5)]

#Schema design
employee_schema = StructType([ StructField("id",
IntegerType(), True),
StructField("first_name",
StringType(), True),
StructField("last_name",
StringType(), True),
StructField("salary",
FloatType(), True),
StructField("department_id", IntegerType(), True) ])
Input Table Data

# Department Data
department_data = [
(1, "engineering"),

(2, "human resource"),


(3, "operation"),
(4, "marketing"),
(5, "sales"),
(6, "customer care"),
]

#Schema design

department_schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("department", StringType(), True),
]
)
Output Table

(max(Max_Salary) - min(Max_Salary))
25000
Problem Statement:

We have two datasets: one for employees and one for departments. Your task is to calculate the
difference between the maximum salaries in the 'engineering' and 'marketing' departments.

Employee Dataset:

id: Unique identifier for each employee. first_name: Employee's first name. last_name:
Employee's last name. salary: Employee's salary. department_id: Foreign key to the department
the employee belongs to.

Department Dataset:

id: Unique identifier for each department. department: Name of the department. Write a query
that calculates the difference between the highest salaries found in the marketing and
engineering departments. Output just the absolute difference in salaries.

from pyspark.sql.types import *

# Employee Data
employee_data = [
(1, "John", "Doe", 60000, 1),
(2, "Jane", "Smith", 55000, 2),
(3, "Emily", "Johnson", 70000, 1),
(4, "Michael", "Brown", 80000, 3),
(5, "Chris", "Davis", 45000, 4),
(6, "Anna", "Wilson", 52000, 5),
]

employee_schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("first_name", StringType(), True),
StructField("last_name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("department_id", IntegerType(), True),
]
)

employee_df = spark.createDataFrame(employee_data,
schema=employee_schema)

# Department Data
department_data = [
(1, "engineering"),
(2, "human resource"),
(3, "operation"),
(4, "marketing"),
(5, "sales"),
(6, "customer care"),
]

department_schema = StructType(
[
StructField("id", IntegerType(), True),
StructField("department", StringType(), True),
]
)

department_df = spark.createDataFrame(department_data,
schema=department_schema)

# Show both DataFrames


employee_df.display()
department_df.display()

employee_df.createOrReplaceTempView("employee")
department_df.createOrReplaceTempView("dept")

%sql
With CTE as (
select
D.department,
max(salary) as Max_Salary
from
employee E
Join dept D on E.department_id = D.id
Where
D.department in ('engineering', 'marketing')
Group by
D.department
)
Select
Max(max_salary) - Min(max_salary) as Diff
From
CTE

from pyspark.sql.functions import *

# Step 1: Join Employee and Department DataFrames


joined_df = employee_df.join(
department_df, employee_df.department_id == department_df.id
)
# Step 2: Filter by departments 'engineering' and 'marketing'
filtered_df = joined_df.filter(
department_df.department.isin("engineering", "marketing")
)
# Step 3: Group by department and calculate the maximum salary for
each department
max_salary_df = filtered_df.groupBy(department_df.department).agg(
max("salary").alias("Max_Salary")
)
# Step 4: Find the difference between max and min salaries from the
grouped result
salary_diff_df = max_salary_df.agg(max("Max_Salary") -
min("Max_Salary")).alias("Diff")
# Display the result
salary_diff_df.display()

Expected Output:

The difference between the highest salary in 'engineering' and the highest salary in 'marketing'.

Breakdown:

joined_df: Joins the employee and department DataFrames on department_id.

filtered_df: Filters the departments to only include 'engineering' and 'marketing'.

max_salary_df: Groups the data by department and computes the maximum salary for each.

salary_diff_df: Aggregates to find the difference between the max and min of the maximum
salaries.
IF YOU FOUND THIS POST
USEFUL, PLEASE SAVE IT.

Ganesh. R
THANK YOU
For Your Support

I Appreciate for your support on


My Account, I will Never Stop to Share the
Knowledge.

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

rganesh203 (Ganesh R) rganesh203 (Ganesh R)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy