Unit 5 - Time Series Analysis and Predictive Modeling
Unit 5 - Time Series Analysis and Predictive Modeling
Unit 5 - Time Series Analysis and Predictive Modeling
Predictive Modeling
Unit 5
By
Dr. G. Sunitha
Professor
Department of AI & ML
School of Computing
• It is a sequence of data
points that occur in
successive order over some
period of time.
2
mj-clean Dataset
3
mj-clean Dataset
S.No. Attribute Description
A string representing the name of the city where the
1 city
transaction occurred.
Two-letter state abbreviation indicating the U.S. state in which
2 state
the transaction took place.
3 price Price paid in dollars
4
Importing and Cleaning
# importing pandas
import pandas as pd
Output
5
Importing and Cleaning . . .
• The events in this dataset are not equally spaced in time; the number of
transactions reported each day varies from 0 to several hundred. Many methods
used to analyze time series require the measurements to be equally spaced.
import numpy as np
def GroupByDay( transactions ):
grouped = transactions[ [ 'date’ , 'ppg’ ] ].groupby( 'date’ )
daily = grouped.aggregate(np.mean)
daily[ 'date’ ] = daily.index
start = daily.date[ 0 ]
one_year = np.timedelta64(1 , 'Y’)
daily[ 'years’ ] = (daily.date - start) / one_year
return daily
6
Importing and Cleaning . . .
• transactions[['date', 'ppg']].groupby('date'): This line selects the 'date' and
'ppg' columns from the 'transactions' DataFrame and groups the data by the 'date'
column. It creates a grouped object.
• grouped.aggregate(np.mean): This line applies mean function to 'ppg' values for
each group of transactions on the same date. The result is a DataFrame with the
aggregated values.
7
Importing and Cleaning . . .
• daily['date'] = daily.index: This line creates a new column 'date' in the 'daily'
DataFrame and sets its values to be the index of the 'daily' DataFrame, which
represents the dates.
• start = daily.date[0]: This line assigns the first date in the 'daily' DataFrame
to the variable 'start'.
• one_year = np.timedelta64(1, 'Y'): This line creates a numpy timedelta
object representing one year.
• daily['years'] = (daily.date - start) / one_year: This line calculates the
number of years from the 'start' date for each date in the 'daily' DataFrame and
adds a new column 'years' with these values.
• return daily: The function returns the modified 'daily' DataFrame.
8
Importing and Cleaning . . .
9
Importing and Cleaning . . .
def GroupByQualityAndDay(transactions):
groups = transactions.groupby('quality’)
dailies = { }
for name, group in groups:
dailies[name] = GroupByDay(group)
return dailies
10
Importing and Cleaning . . .
# calling function to find daily aggregated information for each quality level
d = GroupByQualityAndDay(dataset)
11
Plotting
# plotting using matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create subplots
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(15, 15), sharex=True)
# show plot
plt.show()
12
Plotting . . .
13
Plotting . . .
• One apparent feature in these plots is a gap around November 2013. It is possible
that data collection was not active during this time, or the data might not be
available. We will consider ways to deal with this missing data later.
• Visually, it looks like the price of high quality cannabis is declining during this
period, and the price of medium quality is increasing. The price of low quality
might also be increasing, but it is harder to tell, since it seems to be more volatile.
14
Moving Averages
• Most time series analysis is based on the modeling assumption that the observed
series is the sum of three components:
– Trend
A smooth function that captures persistent changes
– Seasonality
Periodic variation, possibly including daily, weekly, monthly, or yearly cycles
– Noise
Random variation around the longterm trend
15
Moving Averages . . .
• But if the trend is not a simple function, a good alternative is a moving average.
• A moving average divides the series into overlapping regions, called windows,
and computes the average of the values in each window.
• One of the simplest moving averages is the rolling mean, which computes the
mean of the values in each window. For example, if the window size is 3, the
rolling mean computes the mean of values 0 through 2, 1 through 3, 2 through
4, etc.
• pandas provides rolling().mean() function, which takes a Series and a window
size and returns a new Series.
16
Moving Averages . . .
import matplotlib.pyplot as plt
import seaborn as sns
# Create subplots
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 7), sharex=True)
# Iterate over quality levels and create rolling mean plots using Seaborn
for i, (quality, daily_data) in enumerate(d.items()):
# Adjust the window size as needed
rolling_mean = daily_data['ppg'].rolling(window = 30).mean()
axes[i].plot(daily_data['date'], rolling_mean,
label = f'{quality} - Rolling Mean', color='blue')
axes[i].set_title(f'Rolling Mean - {quality}')
axes[i].set_xlabel('Date')
axes[i].set_ylabel('Rolling Mean (ppg)')
axes[i].legend()
# show plot
plt.tight_layout()
plt.show() 17
Moving Averages . . .
18
Moving Averages . . .
• The rolling mean seems to do a good job of smoothing out the noise and
extracting the trend. The first 29 values are NaN, and wherever there’s a
missing value, it’s followed by another 29 NaNs. There are ways to fill in these
gaps, but they are a minor nuisance.
• An alternative is the Exponentially-Weighted Moving Average (EWMA), which
has two advantages.
– It computes a weighted average where the most recent value has the
highest weight and the weights for previous values drop off exponentially.
– The pandas implementation of EWMA handles missing values better.
19
Moving Averages . . .
import matplotlib.pyplot as plt
import seaborn as sns
# Create subplots
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 7), sharex=True)
# Iterate over quality levels and create EWMA plots using Seaborn
for i, (quality, daily_data) in enumerate(d.items()):
# Calculate EWMA with a specified span (adjust alpha as needed)
ewma = daily_data['ppg'].ewm(span=7, adjust=False).mean()
axes[i].plot(daily_data['date'], ewma, label=f'{quality} - EWMA’ , color='blue')
axes[i].set_title(f'EWMA - {quality}')
axes[i].set_xlabel('Date')
axes[i].set_ylabel('EWMA (ppg)')
axes[i].legend()
# show plot
plt.tight_layout()
plt.show() 20
Moving Averages . . .
21