Intro To Analytics Modeling Homework 2
Intro To Analytics Modeling Homework 2
5/28/2020
Contents
Question 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Question 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Question 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Question 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Question 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Question 6.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Question 4.1
My project for a utility company is to identify if a particular customer is stealing electricity by tampering with
the meter and bypassing the device that records the amount of kWh the premise is consuming. Customers
have been known to use several different methodologies, including installing jumper cables on the meter, to
achieve this goal and essentially receive free power for their home.
Some predictors I could use to feed a clustering model in identifying patterns in such customers include:
latitude/longitutde - the relative location of previous theft cases compared to new cases could show a strong
correlation as it seems customers who have a similar method of stealing can be duplicated with surrounding
neighborhoods, property type - residential vs. commercial vs. mobile home vs. vacation home clustered, and
account delinquent nonpayment credit score - how clustered are the credit scores in a geographic location of
customer accounts.
Question 4.2
First step is exploratory data analysis and visualizing the Iris dataset. Then plotting the correlation of all
4 predictor variables shows how Petal Length and Width are the most prominent. Also Sepal Length shows
strong correlation. Sepal Width is the only predictor that is not highly correlated with the others. Therefore,
we will fit the clustering model on the combination of these 3 variables and validate what we can eyeball
from the graphs. It looks like versicolor/virginica have a significant scattering away from setosa with the
Petal Length and Width measurements. Another observation is that there are 3 different species of flower
which would be a good hypothesis for 3 clusters. We will be verifying this through the elbow diagram below.
# load libraries
pacman::p_load(kernlab, dplyr, ggthemes, corrplot, ggplot2, tidyverse, tidyr,
outliers, moments, lubridate, changepoint)
1
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
plot(iris)
6.5
Sepal.Length
4.5
3.5
Sepal.Width
2.0
7
5
Petal.Length
3
1
2.0
Petal.Width
0.5
2
Iris Dataset
8
Species
Inches
setosa
4
versicolor
virginica
3
Petal.Length
Sepal.Width
Petal.Width
1
0.8
Sepal.Length
0.6
0.4
0.2
Sepal.Width 0
−0.2
−0.4
−0.6
Petal.Length
−0.8
−1
# explore each combination of 2 predictors and see which combination clusters better
ggplot(iris, aes(Petal.Length, Petal.Width, color= Species)) + geom_point()
4
2.5
2.0
Species
Petal.Width
1.5
setosa
versicolor
virginica
1.0
0.5
0.0
2 4 6
Petal.Length
5
4.5
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0 virginica
2.5
2.0
5 6 7 8
Sepal.Length
6
2.5
2.0
Species
Petal.Width
1.5
setosa
versicolor
virginica
1.0
0.5
0.0
2 4 6
Petal.Length
7
6
Species
Petal.Length
setosa
4
versicolor
virginica
5 6 7 8
Sepal.Length
8
2.5
2.0
Species
Petal.Width
1.5
setosa
versicolor
virginica
1.0
0.5
0.0
5 6 7 8
Sepal.Length
9
2.5
2.0
Species
Petal.Width
1.5
setosa
versicolor
virginica
1.0
0.5
0.0
2.0 2.5 3.0 3.5 4.0 4.5
Sepal.Width
10
6
Species
Petal.Length
setosa
4
versicolor
virginica
11
4.5
4.0
3.5
Species
Sepal.Width
setosa
versicolor
3.0 virginica
2.5
2.0
5 6 7 8
Sepal.Length
In order to determine the optimal k, we test out a couple different possible values of k (ranging from 1 to
10) and see how many more clusters affect the clustering of points. We can acheive this by plotting the
within-groups sum of squares against the number of clusters and use the elbow method to determine when
there is a drop in the least marginal difference from the last k. Doing this, we can see that after k = 3, the
observed difference is not substantial and we would be overfitting our algorithm.
Next, we fit the model according to 3 clusters and experiment with the 3 correclated predictors from the
exploratory data analysis work above (Petal.Length, Petal.Width, Sepal.Length) and comapre it with the
actual species to pick the best performing model. And since the starting assignments are random, we specify
nstart = 20 and R will try 20 different random starting assignments and then select the one with the lowest
within cluster variation.
From the final outputs of the confusion matrices, we can see that Model 2 - just using the predictors
Petal.Width and Petal.Length proved the highest performing model in correctly classifying the species. We
can see that model 1 and 3 had an accuracy of 89%, but model 2 had the highest at 96%. The model
performed well for the setosa and versicolor species but struggled with the virginica species in all cases. This
is consistent with the EDA as the graphs showed separation from setosa and the other species, but overlap
between versicolor and virginica.
12
}
plot(x = 1:10, y = wss, type = "b", xlab = "# of clusters", ylab = "Within-groups sum of squares")
700
Within−groups sum of squares
500
300
100
0
2 4 6 8 10
# of clusters
##
## setosa versicolor virginica
## 1 0 2 36
## 2 0 48 14
## 3 50 0 0
13
table(model2$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 4
## 3 0 2 46
table(model3$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 0 2 36
## 2 50 0 0
## 3 0 48 14
## [1] 0.8933333
#model 2 accuracy
(50 + 48 + 46) / length(iris$Species)
## [1] 0.96
Question 5.1
The Grubbs type 10 is a test for one outlier (side is detected automatically and can be reversed by opposite
parameter). Type 11 is a test for two outliers on opposite tails while 20 is test for two outliers in one tail.
We start off with some summary statistics and look at some descriptive features.
14
# last colummn of dataset
crime = crime_df$Crime
# stats summary
summary(crime)
## [1] 1.08848
## [1] 3.943658
## [1] 1993
# visualize plots
plot(crime)
15
2000
1500
crime
1000
500
0 10 20 30 40
Index
boxplot(crime)
16
2000
1500
1000
500
Using the moments package, we can see the skewness and kurtosis of the data. Skewness of 1.08 indicates
our data is positively skewed to the right, kurtosis of 3.9 indicates a right tailed skew from the normal
distribution, and extreme value of 1993 is most away from the mean.
We use the Grubbs test to check for outliers more thoroughly and it’s done by calculating the outlier minus
the mean and divided by the standard deviation. We then compare this to the critical value and test the
alternative hypothesis. Using the two tailed test (type = 11), we see that the p-value is 1. Therefore, we
definitely do not reject the null hypothesis (p-value is greater than 0.05), and the test of the alternative
hypothesis of the highest value being an outlier has failed. From visualizing the box-and-whisker plot, we
can see that this data has an outlier only on one tail and no potential outliers at the bottom of the dataset.
Therefore, a test for that will be more representative.
# check outlier
grubbs.test(crime, type = 11)
##
## Grubbs test for two opposite outliers
##
## data: crime
## G = 4.26877, U = 0.78103, p-value = 1
## alternative hypothesis: 342 and 1993 are outliers
##
## Grubbs test for one outlier
17
##
## data: crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
Now we use the one tail test (type = 10). Since p-value = 0.078 (> 0.05), we fail to reject null hypothesis
that there is no outlier), both the highest and lowest values in the Crime column are within the expected
standard of deviation and are not outliers according to the significance of the Grubbs test. Although our
data is skewed, the result of the test is close to but not under the significance that it needs to be to ultimately
throw out the data points. Therefore, since there isn’t a clear significant p-value, I would err on the side of
keeping the significance of the outlier data points as part of the dataset.
Question 6.1
For my project, a change detection model could be applied to when there is a tamper event that is then
followed by a drop in kWh usage. Another check that is good would be to account for seasonality and see if
this change is consistent with the same months in previous years before (ex: hotter to colder months would
see a normal drop in electricity because customers would be using more gas instead). Therfore, the threshold
would be the average kWh usage before the tamper event and compare it with the average of the usage after
the tamper event to see how much the usage has dropoped. Then the critical value would be the avg usage
in past months at that same month (compare this July vs. past Julys) and see if it goes higher than the
sensitivity of the amount of difference between that and the current month.
Question 6.2.1
We first want to load the data and visualize it in different graphs to get a picture of where to start our
analysis. The daily temperatures are plotted by year and month to spot out any trends. In the temperature
by year plot, we can see if climate seems to have generally increased over time. It looks like temperatures
have stayed relatively similar throughout 20 years with the exception of abnormally high mean temperatures
starting in 2010. In the temperature by month plot, we can see a clear drop in temperature from August
to September which indicates the transition out of summer. It makes logical sense as well since starting in
September, temperatures start to cool down and October shows a definite drop in average temperature. From
this information, we would want to make sure the baseline average temperature we will use in the CUSUM
equation to detect a drop will be from summer months July/August and expect the change detection to
arrive nearing the end of August.
## DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul 98 86 91 84 89 84 90 73 82 91 93 95
## 2 2-Jul 97 90 88 82 91 87 90 81 81 89 93 85
## 3 3-Jul 97 93 91 87 93 87 87 87 86 86 93 82
## 4 4-Jul 90 91 91 88 95 84 89 86 88 86 91 86
## 5 5-Jul 89 84 91 90 96 86 93 80 90 89 90 88
## 6 6-Jul 93 84 89 91 96 87 93 84 90 82 81 87
## X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 85 95 87 92 105 82 90 85
## 2 87 90 84 94 93 85 93 87
## 3 91 89 83 95 99 76 87 79
## 4 90 91 85 92 98 77 84 85
18
## 5 88 80 88 90 100 83 86 84
## 6 82 87 89 90 98 83 87 84
library(readr)
Temperature by Year
100
90
Temperature
80
70
60
50
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
19
geom_jitter(pch = 21, alpha = .2, color = ’dark orange’) +
geom_boxplot(color = ’dark blue’) +
geom_hline(yintercept = mean(temps_df$temp), linetype = ’dotted’) +
xlab(’Month’) + ylab(’Temperature’) + labs(title = ’Temperature by Month’)
Temperature by Month
100
90
Temperature
80
70
60
50
7 8 9 10
Month
I use the CUSUM equation to detect a decrease below and set the threshold, t = 4 and c = 0.5. We start off
with these values because there isn’t a huge relative scale change for this scenario. The average temerature
across all days from 1996-2015 is 83.34 degrees and we can visualize this on the graphs as well. The change
from 83 to below, is small enough which is why 0.5 was chosen as the sensitivity. The threshold of 4 degrees
was used as that is the difference in degrees from the mean of August to September, where we see the drop in
temperature. We then run a for loop to calculate “s sub t” until it reaches the desired threshold. As shown
below, the change was detected on the 61st day, August 30th with a mean temperature of 85.8 degrees. This
is the average day throughout the years where we see the unofficial summer end.
t = 4
c = 0.5
st1 = 0
20
# detecting a decrease CUSUM equation
for(i in 1:length(row_mean)) {
decrease = mu - row_mean[i] - c
st1 = st
if(st > t) {
print(i)
break
}
}
## [1] 61
row_mean[61]
## [1] 85.8
temps$DAY[61]
## [1] "30-Aug"
Question 6.2.2
Using the column means to run this same algorithm by year rather than by day, we set the t = 2 and c =
1. From visualizing the mean of the temperature by year, it doesn’t have as big of a drop as we see in the
temperature by month plot. 2 degrees is sufficient for the threshold as that looks like the difference in yearly
means. As suspected in the visual graph, we can see that the for loop results in detecting a change at 2010.
st1 = st
21
if(st > t) {
print(i)
break
}
}
## [1] 15
column_mean[15]
## X2010
## 87.21138
22