TSF Sparkling
TSF Sparkling
TSF Sparkling
Forecasting-Sparkling
Wine
19.02.2023
─
Taniya Dubey
PGP-DSBA
Module 8 - Time series
1
INDEX
s.no Title Page
no
1 Read the data as an appropriate Time Series data and plot the data. 4-5
1. Data dictionary
2. Rows of dataset
3. Rows of new dataset
4. Statistical summary
Problem Statement:
3
ABC Estate Wines has been a leader in the wine industry for many years, offering high-quality wines to
consumers all around the world. As the company continues to expand its reach and grow its customer
base, it is essential to analyze market trends and forecast future sales to ensure continued success.
In this report, we will focus on analyzing the sales data for sparkling wine in the 20th century. As an
analyst for ABC Estate Wines, I have been tasked with reviewing this data to identify patterns, trends,
and opportunities for growth in the sparkling wine market.This knowledge will help us to make informed
decisions about how to position our products in the market, optimize our sales strategies, and forecast
Overall, this report aims to provide valuable insights into the sparkling wine market and how ABC Estate
Data Dictionary:
Column name Details
4
Data Type;
Index: DateTime
Sales: integer
Month: integer Year:
integer
Statistical summary:
Table 4: statistical summary of data
Null Value:
There are no null values present in the dataset. So we can do further analysis smoothly.
The line plot shows the patterns of trend and seasonality and also shows that there was a peak in the
year 1988.
Boxplot Yearly:
Plot 4: boxplot yearly
This yearly box plot shows there is consistency over the years and there was a peak in 1988-1989.
Outliers are present in all years.
Boxplot Monthly:
Plot 5: boxplot monthly
8
The plot shows that sales are highest in the month of December and lowest in the month of January.
Sales are consistent from January to July then from august the sales start to increase. Outliers are
present in January, February and July.
Tuesday has more sales than other days and Wednesday has the lowest sales of the week.
Outliers are present on all days which is understandable.
Graph of Monthly Sales over the years:
Plot 7: graph of monthly sales over the year
9
This plot shows that December has the highest sales over the years and the year 1988 was the year with
the highest number of sales.
Correlation plot
Plot 8: correlation plot
This heat map shows that there is a low correlation between sales and year. there is a more correlation
between month and sales. It indicated seasonal patterns in sales
Plot ECDF: Empirical Cumulative Distribution Function
This graph shows the distribution of data.
Decomposition -Additive
Plot 10: decomposition plot addictive
Decomposition-Multiplicative
11
3. Split the data into training and test. The test data should
start in1991.
12
As per the instructions given in the project we have split the data, around 1991.
With training data from 1980 to 1990 December. Test data starts from the first
month of January 1991 till the end.
The green line indicates the predictions made by the model, while the orange
values are the actual test values. It is clear the predicted values are very far off
from the actual values
Model was evaluated using the RMSE metric. Below is the RMSE calculated for this model.
The green line indicates the predictions made by the model, while the orange
values are the actual test values. It is clear the predicted values are very far off
from the actual values
Model was evaluated using the RMSE metric. Below is the RMSE calculated for this model.
The green line indicates the predictions made by the model, while the orange
values are the actual test values. It is clear the predicted values are very far off
from the actual values
Model was evaluated using the RMSE metric. Below is the RMSE calculated for this model.
Model was evaluated using the RMSE metric. Below is the RMSE calculated for this model.
2pointTrailingMovingAverage 813.400684
4pointTrailingMovingAverage 1156.589694
6pointTrailingMovingAverage 1283.927428
9pointTrailingMovingAverage 1346.278315
We have made multiple moving average models with rolling windows varying
from 2 to 9. Rolling average is a better method than simple average as it takes into
account only the previous n values to make the prediction, where n is the rolling
window defined. This takes into account the recent trends and is in general more
accurate. The higher the rolling window, the smoother will be its curve, since
more values are being taken into account.
Model was evaluated using the RMSE metric. Below is the RMSE calculated for this model.
0.1 1375.393398
0.2 1595.206839
0.3 1935.507132
0.4 2311.919615
0.5 2666.351413
0.6 2979.204388
0.7 3249.944092
0.8 3483.801006
The green line indicates the predictions made by the model, while the orange
values are the actual test values. It is clear the predicted values are very far off
from the actual values
Model was evaluated using the RMSE metric. Below is the RMSE calculated for this model.
Output for a best alpha, beta, and gamma values are shown by the green color
line in the above plot. The best model had both a multiplicative trends, as well as
a seasonality Model, which was evaluated using the RMSE metric. Below is the
RMSE calculated for this model.
Alpha=0.4,Beta=0.1,Gamma=0.3,TripleExponentialSmoothing 317.434302
● H1 : The Time Series does not have a unit root and is thus stationary.
We would want the series to be stationary for building ARIMA models and thus we would want the p-
value of this test to be less than the α value.
In order to try and make the series stationary we used the differencing approach. We used .diff()
function on the existing series without any argument, implying the default diff value of 1 and also
dropped the NaN values, since differencing of order 1 would generate the first value as NaN which need
to be dropped
Plot 21: plot for dickey fuller test after differencing approch
22
p-value 0.000000
Dickey - Fuller test was 0.000, which is obviously less than 0.05. Hence the null hypothesis that the series is
not stationary at difference = 1 was rejected, which implied that the series has indeed become stationary
after we performed the differencing. Null hypothesis was rejected since the p-value was less than alpha i.e.
0.05. Also the rolling mean plot was a straight line this time around. Also the series looked more or less the
same from both the directions, indicating stationarity.
We could now proceed ahead with ARIMA/ SARIMA models, since we had made
the series stationary.
We employed a for loop for determining the optimum values of p,d,q, where p is
the order of the AR (Auto-Regressive) part of the model, while q is the order of
the MA (Moving Average) part of the model. d is the differencing that is required
to make the series stationary. p,q values in the range of (0,4) were given to the for
loop, while a fixed value of 1 was given for d, since we had already determined d
to be 1, while checking for stationarity using the ADF test.
Some parameter combinations for the Model...
Model: (0, 1, 1)
Model: (0, 1, 2)
Model: (0, 1, 3)
Model: (1, 1, 0)
Model: (1, 1, 1)
Model: (1, 1, 2)
Model: (1, 1, 3)
Model: (2, 1, 0)
Model: (2, 1, 1)
Model: (2, 1, 2)
Model: (2, 1, 3)
Model: (3, 1, 0)
Model: (3, 1, 1)
Model: (3, 1, 2)
Model: (3, 1, 3)
Akaike information criterion (AIC) value was evaluated for each of these models and the model with
least AIC value was selected.
24
the summary report for the ARIMA model with values (p=2,d=1,q=2).
Auto_ARIMA 1299.978401
25
Akaike information criterion (AIC) value was evaluated for each of these models and the model with
least AIC value was selected. Here only the top 5 models are shown.
the summary report for the best SARIMA model with values (2,1,2)(2,0,2,12)
26
We also plotted the graphs for the residual to determine if any further information can be extracted or
all the usable information has already been extracted. Below were the plots for the best auto SARIMA
model.
Plot 22: SARIMA plot
RSME of Model:
528.6069474180102
27
Looking at ACF plot we can see a shard decay after lag 1 for original as well as differenced data.hence we
select the q value to be 1. i.e. q=1.
Looking at PACF plot we can again see significant bars till lag 1 for differenced series which is stationary
in nature, post 1 the decay is large enough. Hence we choose p value to be 1. i.e. p=1. d values will be 1,
29
since we had seen earlier that the series is stationary with lag1. Hence the values selected for manual
ARIMA:- p=1, d=1, q=1 summary from this manual ARIMA model.
1319.9367298218867
30
359.612454
8. Build a table (create a data frame) with all the models built
along with their corresponding parameters and the respective
RMSE values on the test data.
Based on the above comparison of all the various models that we had built, we
can conclude that the triple exponential smoothing or the Holts-Winter model is
32
giving us the lowest RMSE, hence it would be the most optimum model sales
predictions made by this best optimum model.
the sales prediction on the graph along with the confidence intervals. PFB the graph.
Plot 27: prediction plot
33
Predictions, 1 year into the future are shown in orange color, while the confidence interval has been
shown in grey color.
10. Comment on the model thus built and report your findings
and suggest the measures that the company should be taking
for future sales.
● The sales for Sparkling wine for the company are predicted to be at least
the same as last year, if not more, with peak sales for next year potentially
higher than this year.
● Sparkling wine has been a consistently popular wine among customers
with only a very marginal decline in sales, despite reaching its peak
popularity in the late 1980s.
● Seasonality has a significant impact on the sales of Sparkling wine, with
sales being slow in the first half of the year and picking up from August to
December.
● It is recommended for the company to run campaigns in the first half of
the year when sales are slow, particularly in the months of March to July.
● Combining promotions where Sparkling wine is paired with a less popular
wine such as "Rose wine" under a special offer may encourage customers
34
to try the underperforming wine, which could potentially boost its sales
and benefit the company.