Window Functions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Data Engineering

Data Transformation

WINDOW
FUNCTIONS

DHANESH SARPALE
BASIC EXAMPLE

In the above example, the average salary has


been calculated by aggregating the salaries
based on the job titles of the employees.

DHANESH SARPALE
GENERIC SYNTAX
MySQL

SELECT <column_list>,
<aggregate_function>(<column_expression>) OVER
(
PARTITION BY <partition_expression>
ORDER BY <order_expression>
ROWS <window_frame>
) AS <alias>
FROM <table_name>;

Python

import pandas as pd

df['<alias>'] = df['<column_expression>'].<aggregate_function>
().\
groupby(<partition_expression>).\
<transform_function>()

DHANESH SARPALE
LIST OF SQL WINDOW FUNCTIONS

DHANESH SARPALE
DATA ENGINEERING COMMON
OPERATIONS WITH WINDOW
FUNCTIONS
aggregating, transforming, and analyzing data
within precise partitions or windows

1. Data Aggregation
2. Data Cleansing
3. Data Enrichment
4. Data Partitioning
5. Data Ordering

DHANESH SARPALE
1. DATA AGGREGATION

To perform aggregations over subsets of


data within a given window.
To calculate aggregated values such as
cumulative sums, averages, counts, or
percentages.
To perform these aggregations efficiently
and in a flexible manner, allowing to
aggregate data at different levels of
granularity.

DHANESH SARPALE
1. DATA AGGREGATION
MySQL

SELECT product_id, category, sales,


SUM(sales) OVER (PARTITION BY category) As
category_total_sales,
AVG(sales) OVER (PARTITION BY category) As
category_avg_sales,
SUM(sales) OVER () AS overall_total_sales,
AVG(sales) OVER () AS overall_avg_sales
FROM sales_data
GROUP BY product_id, category;

Python
import pandas as pd

# Assume you already have the data loaded into a pandas DataFrame called
'df'

# Calculating the sum and average sales for each product and category, and
overall sum and average
df['category_total_sales'] = df.groupby('category')['sales'].transform('sum')
df['category_avg_sales'] = df.groupby('category')['sales'].transform('mean')
df['overall_total_sales'] = df['sales'].sum()
df['overall_avg_sales'] = df['sales'].mean()

# Displaying the DataFrame


print(df)

DHANESH SARPALE
2. DATA CLEANSINS

To assist in data cleansing tasks by


identifying and handling duplicates,
missing values, or outliers within specific
windows.
To rank rows based on certain criteria and
identify duplicate records.
To calculate statistical measures within
windows to identify outliers that need to be
handled or removed during the ETL process.

DHANESH SARPALE
2. DATA CLEANSING
MySQL

SELECT name, score,


RANK() OVER (ORDER BY score DESC) AS rank
FROM students;

Python

import pandas as pd

# Assume you already have the data loaded into a pandas


DataFrame called 'df'

# Assigning ranks to students based on their exam scores


df['rank'] = df['score'].rank(ascending=False, method='min')

# Displaying the DataFrame


print(df)

DHANESH SARPALE
3.DATA ENRICHMENT

Window functions provide the ability to


enrich data by computing values based on a
subset of related records within a window.
to derive new information or generate
additional features for your dataset.
For instance, to calculate moving averages,
running totals, or cumulative sums within a
window to provide insights into trends or
patterns in the data.

DHANESH SARPALE
3.DATA ENRICHMENT
MySQL
SELECT product_id, sales,
AVG(sales) OVER (ORDER BY date_column ROWS
BETWEEN 2 PRECEDING AND CURRENT ROW) AS
moving_average,
SUM(sales) OVER (ORDER BY date_column) AS
running_total,
SUM(sales) OVER (ORDER BY date_column) AS
cumulative_sum
FROM sales_data;

Python
# Calculate the 3-day moving average of sales for each
product
df['moving_average'] = df['sales'].rolling(window=3,
min_periods=1).mean()

# Calculate the cumulative sum of sales for each product


df['cumulative_sum'] = df['sales'].cumsum()

# Display the DataFrame


print(df)

DHANESH SARPALE
4. DATA PARTITIONANING

Window functions enable to partition data


into logical groups based on one or more
columns.
This is particularly helpful during the
transformation phase of ETL when
performing calculations or aggregations
separately for different partitions.
For example, to partition data by region, time
period, or any other relevant attribute and
apply window functions within each partition
to obtain partition-specific results.

DHANESH SARPALE
5.DATA ORDERING

Window functions provide the ability to order


data within each partition based on specified
criteria.
This is useful when performing calculations
or aggregations in a specific order.
For example, to order time series data by
timestamp and use window functions to
calculate moving averages or detect trends
over a specified window size.

DHANESH SARPALE
Thank you for taking the time to read
this document! If you found it valuable,
I would greatly appreciate it if you
could show your support by liking and
sharing it with your network. I am
eager to connect with you on LinkedIn,
Let's connect and collaborate to foster
growth together!

DHANESH SARPALE

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy