0% found this document useful (0 votes)

138 views22 pages

Intro To Analytics Modeling Homework 2

The document appears to be homework assignments containing questions and responses about clustering analysis on iris flower data and identifying electricity theft. Question 4.1 discusses using clustering to identify patterns in customer electricity usage that could indicate theft. Question 4.2 discusses exploratory data analysis of the iris data, noting petal length and width are most correlated and appear to separate 3 flower species into distinct clusters.

Uploaded by

Brian Lobo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views22 pages

Intro To Analytics Modeling Homework 2

Uploaded by

Brian Lobo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

hw2

5/28/2020

Contents
Question 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Question 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Question 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Question 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Question 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Question 6.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Question 4.1

My project for a utility company is to identify if a particular customer is stealing electricity by tampering with
the meter and bypassing the device that records the amount of kWh the premise is consuming. Customers
have been known to use several different methodologies, including installing jumper cables on the meter, to
achieve this goal and essentially receive free power for their home.
Some predictors I could use to feed a clustering model in identifying patterns in such customers include:
latitude/longitutde - the relative location of previous theft cases compared to new cases could show a strong
correlation as it seems customers who have a similar method of stealing can be duplicated with surrounding
neighborhoods, property type - residential vs. commercial vs. mobile home vs. vacation home clustered, and
account delinquent nonpayment credit score - how clustered are the credit scores in a geographic location of
customer accounts.

Question 4.2

First step is exploratory data analysis and visualizing the Iris dataset. Then plotting the correlation of all
4 predictor variables shows how Petal Length and Width are the most prominent. Also Sepal Length shows
strong correlation. Sepal Width is the only predictor that is not highly correlated with the others. Therefore,
we will fit the clustering model on the combination of these 3 variables and validate what we can eyeball
from the graphs. It looks like versicolor/virginica have a significant scattering away from setosa with the
Petal Length and Width measurements. Another observation is that there are 3 different species of flower
which would be a good hypothesis for 3 clusters. We will be verifying this through the elbow diagram below.

# load libraries
pacman::p_load(kernlab, dplyr, ggthemes, corrplot, ggplot2, tidyverse, tidyr,
outliers, moments, lubridate, changepoint)

# load and visualize data

iris <- read.table("data/iris.txt", header=TRUE, strip.white=TRUE)
head(iris)

1
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa

plot(iris)

2.0 3.0 4.0 0.5 1.5 2.5

6.5
Sepal.Length

4.5
3.5

Sepal.Width
2.0

7
5
Petal.Length

3
1
2.0

Petal.Width
0.5

1.0 2.0 3.0

Species

4.5 6.0 7.5 1 3 5 7 1.0 2.0 3.0

# plot data to visualize

ggplot(data = iris %>% gather(key, value, -Species), aes(x = key, y = value, color = Species)) +
geom_boxplot() + geom_jitter(alpha = .5, pch = 1) +
xlab(’Measurement’) + ylab(’Inches’) + labs(title = ’Iris Dataset’)

2
Iris Dataset
8

Species
Inches

setosa
4
versicolor
virginica

Petal.Length Petal.Width Sepal.Length Sepal.Width

Measurement

# correlation of the 4 predictor variables to the species

corrplot(cor(iris[,-5]), type = ’upper’, diag = F)

3
Petal.Length
Sepal.Width

Petal.Width
1

0.8
Sepal.Length
0.6

0.4

0.2

Sepal.Width 0

−0.2

−0.4

−0.6
Petal.Length
−0.8

−1

# explore each combination of 2 predictors and see which combination clusters better
ggplot(iris, aes(Petal.Length, Petal.Width, color= Species)) + geom_point()

4
2.5

2.0

Species
Petal.Width

1.5
setosa
versicolor
virginica
1.0

0.5

0.0
2 4 6
Petal.Length

ggplot(iris, aes(Sepal.Length, Sepal.Width, color= Species)) + geom_point()

5
4.5

4.0

3.5
Species
Sepal.Width

setosa
versicolor
3.0 virginica

2.5

2.0

5 6 7 8
Sepal.Length

ggplot(iris, aes(Petal.Length, Petal.Width, color= Species)) + geom_point()

6
2.5

2.0

Species
Petal.Width

1.5
setosa
versicolor
virginica
1.0

0.5

0.0
2 4 6
Petal.Length

ggplot(iris, aes(Sepal.Length, Petal.Length, color= Species)) + geom_point()

7
6

Species
Petal.Length

setosa
4
versicolor
virginica

5 6 7 8
Sepal.Length

ggplot(iris, aes(Sepal.Length, Petal.Width, color= Species)) + geom_point()

8
2.5

2.0

Species
Petal.Width

1.5
setosa
versicolor
virginica
1.0

0.5

0.0
5 6 7 8
Sepal.Length

ggplot(iris, aes(Sepal.Width, Petal.Width, color= Species)) + geom_point()

9
2.5

2.0

Species
Petal.Width

1.5
setosa
versicolor
virginica
1.0

0.5

0.0
2.0 2.5 3.0 3.5 4.0 4.5
Sepal.Width

ggplot(iris, aes(Sepal.Width, Petal.Length, color= Species)) + geom_point()

10
6

Species
Petal.Length

setosa
4
versicolor
virginica

2.0 2.5 3.0 3.5 4.0 4.5

Sepal.Width

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +

geom_point(aes(color = Species)) +
scale_color_viridis_d() + theme_minimal()

11
4.5

4.0

3.5
Species
Sepal.Width

setosa
versicolor
3.0 virginica

2.5

2.0

5 6 7 8
Sepal.Length

In order to determine the optimal k, we test out a couple different possible values of k (ranging from 1 to
10) and see how many more clusters affect the clustering of points. We can acheive this by plotting the
within-groups sum of squares against the number of clusters and use the elbow method to determine when
there is a drop in the least marginal difference from the last k. Doing this, we can see that after k = 3, the
observed difference is not substantial and we would be overfitting our algorithm.
Next, we fit the model according to 3 clusters and experiment with the 3 correclated predictors from the
exploratory data analysis work above (Petal.Length, Petal.Width, Sepal.Length) and comapre it with the
actual species to pick the best performing model. And since the starting assignments are random, we specify
nstart = 20 and R will try 20 different random starting assignments and then select the one with the lowest
within cluster variation.
From the final outputs of the confusion matrices, we can see that Model 2 - just using the predictors
Petal.Width and Petal.Length proved the highest performing model in correctly classifying the species. We
can see that model 1 and 3 had an accuracy of 89%, but model 2 had the highest at 96%. The model
performed well for the setosa and versicolor species but struggled with the virginica species in all cases. This
is consistent with the EDA as the graphs showed separation from setosa and the other species, but overlap
between versicolor and virginica.

# set seed so random variables can be replicated

set.seed(1)

# loop k = 1 to 10 and calculate total within-groups sum of squares

wss = c()
for(i in 1:10) {
cluster <- kmeans(iris[,-5], centers = i)
wss[i] <- cluster$tot.withinss

12
}

plot(x = 1:10, y = wss, type = "b", xlab = "# of clusters", ylab = "Within-groups sum of squares")
700
Within−groups sum of squares

500
300
100
0

2 4 6 8 10

# of clusters

# 3 predictor variables determined from EDA steps

data <- iris %>% select(Petal.Length, Petal.Width, Sepal.Length)

# use model with k = 3 and combination of different predictors

model1 <- kmeans(data, 3, nstart = 20)
# just Petal.Width and Petal.Length
model2 <- kmeans(iris[,3:4], 3, nstart = 20)
# all 4 predictors
model3 <- kmeans(iris[,1:4], 3, nstart = 20)

# compare clusters with acutal species

table(model1$cluster, iris$Species)

##
## setosa versicolor virginica
## 1 0 2 36
## 2 0 48 14
## 3 50 0 0

13
table(model2$cluster, iris$Species)

##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 4
## 3 0 2 46

table(model3$cluster, iris$Species)

##
## setosa versicolor virginica
## 1 0 2 36
## 2 50 0 0
## 3 0 48 14

# model 1 & 3 accuracy

(50 + 48 + 36) / length(iris$Species)

## [1] 0.8933333

#model 2 accuracy
(50 + 48 + 46) / length(iris$Species)

## [1] 0.96

Question 5.1
The Grubbs type 10 is a test for one outlier (side is detected automatically and can be reversed by opposite
parameter). Type 11 is a test for two outliers on opposite tails while 20 is test for two outliers in one tail.
We start off with some summary statistics and look at some descriptive features.

# load and visualize data

crime_df <- read.table("data/uscrime.txt", header=TRUE, strip.white=TRUE)
head(crime_df)

## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob

## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682

14
# last colummn of dataset
crime = crime_df$Crime

# stats summary
summary(crime)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 342.0 658.5 831.0 905.1 1057.5 1993.0

# measure of the asymmetry of the probability distribution around the mean

skew <- skewness(crime)
skew

## [1] 1.08848

# measures the tailed-ness of a variable

kurtosis <- kurtosis(crime)
kurtosis

## [1] 3.943658

# most extreme value from the mean

outlier <- outlier(crime)
outlier

## [1] 1993

# visualize plots
plot(crime)

15
2000
1500
crime

1000
500

0 10 20 30 40

Index

boxplot(crime)

16
2000
1500
1000
500

Using the moments package, we can see the skewness and kurtosis of the data. Skewness of 1.08 indicates
our data is positively skewed to the right, kurtosis of 3.9 indicates a right tailed skew from the normal
distribution, and extreme value of 1993 is most away from the mean.
We use the Grubbs test to check for outliers more thoroughly and it’s done by calculating the outlier minus
the mean and divided by the standard deviation. We then compare this to the critical value and test the
alternative hypothesis. Using the two tailed test (type = 11), we see that the p-value is 1. Therefore, we
definitely do not reject the null hypothesis (p-value is greater than 0.05), and the test of the alternative
hypothesis of the highest value being an outlier has failed. From visualizing the box-and-whisker plot, we
can see that this data has an outlier only on one tail and no potential outliers at the bottom of the dataset.
Therefore, a test for that will be more representative.

# check outlier
grubbs.test(crime, type = 11)

##
## Grubbs test for two opposite outliers
##
## data: crime
## G = 4.26877, U = 0.78103, p-value = 1
## alternative hypothesis: 342 and 1993 are outliers

grubbs.test(crime, type = 10)

##
## Grubbs test for one outlier

17
##
## data: crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

Now we use the one tail test (type = 10). Since p-value = 0.078 (> 0.05), we fail to reject null hypothesis
that there is no outlier), both the highest and lowest values in the Crime column are within the expected
standard of deviation and are not outliers according to the significance of the Grubbs test. Although our
data is skewed, the result of the test is close to but not under the significance that it needs to be to ultimately
throw out the data points. Therefore, since there isn’t a clear significant p-value, I would err on the side of
keeping the significance of the outlier data points as part of the dataset.

Question 6.1

For my project, a change detection model could be applied to when there is a tamper event that is then
followed by a drop in kWh usage. Another check that is good would be to account for seasonality and see if
this change is consistent with the same months in previous years before (ex: hotter to colder months would
see a normal drop in electricity because customers would be using more gas instead). Therfore, the threshold
would be the average kWh usage before the tamper event and compare it with the average of the usage after
the tamper event to see how much the usage has dropoped. Then the critical value would be the avg usage
in past months at that same month (compare this July vs. past Julys) and see if it goes higher than the
sensitivity of the amount of difference between that and the current month.

Question 6.2.1

We first want to load the data and visualize it in different graphs to get a picture of where to start our
analysis. The daily temperatures are plotted by year and month to spot out any trends. In the temperature
by year plot, we can see if climate seems to have generally increased over time. It looks like temperatures
have stayed relatively similar throughout 20 years with the exception of abnormally high mean temperatures
starting in 2010. In the temperature by month plot, we can see a clear drop in temperature from August
to September which indicates the transition out of summer. It makes logical sense as well since starting in
September, temperatures start to cool down and October shows a definite drop in average temperature. From
this information, we would want to make sure the baseline average temperature we will use in the CUSUM
equation to detect a drop will be from summer months July/August and expect the change detection to
arrive nearing the end of August.

# load and visualize data

temps <- read.table("data/temps.txt", header=TRUE, strip.white=TRUE)
head(temps)

## DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul 98 86 91 84 89 84 90 73 82 91 93 95
## 2 2-Jul 97 90 88 82 91 87 90 81 81 89 93 85
## 3 3-Jul 97 93 91 87 93 87 87 87 86 86 93 82
## 4 4-Jul 90 91 91 88 95 84 89 86 88 86 91 86
## 5 5-Jul 89 84 91 90 96 86 93 80 90 89 90 88
## 6 6-Jul 93 84 89 91 96 87 93 84 90 82 81 87
## X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 85 95 87 92 105 82 90 85
## 2 87 90 84 94 93 85 93 87
## 3 91 89 83 95 99 76 87 79
## 4 90 91 85 92 98 77 84 85

18
## 5 88 80 88 90 100 83 86 84
## 6 82 87 89 90 98 83 87 84

library(readr)

# read and format data

temps_df = read_delim(’data/temps.txt’, delim = ’\t’) %>%
as_tibble() %>% gather(year, temp, -DAY) %>%
mutate(year = as.factor(year),
date = paste(DAY,year, sep = ’-’)) %>%
mutate(date_val = dmy(date),
color = ifelse(temp > mean(.$temp), ’Above’, ’Below’),
month = month(date_val),
day = day(date_val)) %>%
dplyr::select(date_val, DAY, year, temp, color, month, day)

# visualize trend over time by year

ggplot(data = temps_df, aes(x = year, y = temp)) +
geom_jitter(pch = 21, alpha = .2, color = ’dark orange’) +
geom_boxplot(color = ’dark blue’) +
geom_hline(yintercept = mean(temps_df$temp), linetype = ’dotted’) +
xlab(’’) + ylab(’Temperature’) + labs(title = ’Temperature by Year’)

Temperature by Year

100

90
Temperature

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

# visualize trend over time by month

ggplot(data = temps_df, aes(x = as.factor(month), y = temp)) +

19
geom_jitter(pch = 21, alpha = .2, color = ’dark orange’) +
geom_boxplot(color = ’dark blue’) +
geom_hline(yintercept = mean(temps_df$temp), linetype = ’dotted’) +
xlab(’Month’) + ylab(’Temperature’) + labs(title = ’Temperature by Month’)

Temperature by Month

100

90
Temperature

7 8 9 10
Month

I use the CUSUM equation to detect a decrease below and set the threshold, t = 4 and c = 0.5. We start off
with these values because there isn’t a huge relative scale change for this scenario. The average temerature
across all days from 1996-2015 is 83.34 degrees and we can visualize this on the graphs as well. The change
from 83 to below, is small enough which is why 0.5 was chosen as the sensitivity. The threshold of 4 degrees
was used as that is the difference in degrees from the mean of August to September, where we see the drop in
temperature. We then run a for loop to calculate “s sub t” until it reaches the desired threshold. As shown
below, the change was detected on the 61st day, August 30th with a mean temperature of 85.8 degrees. This
is the average day throughout the years where we see the unofficial summer end.

row_mean = rowMeans(x = temps[c(2:21)])

# summer dates (July & August)

summer_df = temps_df %>% filter(month %in% c(as.Date(7), as.Date(8)))
# determine baseline mean in the summer months only
summer_mean = mean(summer_df$temp)
mu = mean(summer_mean)

t = 4
c = 0.5
st1 = 0

20
# detecting a decrease CUSUM equation
for(i in 1:length(row_mean)) {
decrease = mu - row_mean[i] - c

if(st1 + decrease < 0) {

st = 0
} else {
st = st1 + decrease
}

st1 = st
if(st > t) {
print(i)
break
}
}

## [1] 61

row_mean[61]

## [1] 85.8

temps$DAY[61]

## [1] "30-Aug"

Question 6.2.2

Using the column means to run this same algorithm by year rather than by day, we set the t = 2 and c =
1. From visualizing the mean of the temperature by year, it doesn’t have as big of a drop as we see in the
temperature by month plot. 2 degrees is sufficient for the threshold as that looks like the difference in yearly
means. As suspected in the visual graph, we can see that the for loop results in detecting a change at 2010.

column_mean = colMeans(x = temps[c(2:21)])

mu = mean(column_mean)
t = 2
c = 1
st1 = 0

# detecting an increase CUSUM equation

for(i in 1:length(column_mean)) {
increase = column_mean[i] - mu - c

if(st1 + increase < 0) {

st = 0
} else {
st = st1 + increase
}

st1 = st

21
if(st > t) {
print(i)
break
}
}

## [1] 15

column_mean[15]

## X2010
## 87.21138

Effectiveness of Parental Involvement On The Academic Performance of Grade 6
No ratings yet
Effectiveness of Parental Involvement On The Academic Performance of Grade 6
43 pages
Chapman 2018appendixs2
No ratings yet
Chapman 2018appendixs2
10 pages
Determine The Personality and Interest of Ss3 Students of Demonstration Secondary School of Ahamdu Bello University, Zaria
No ratings yet
Determine The Personality and Interest of Ss3 Students of Demonstration Secondary School of Ahamdu Bello University, Zaria
40 pages
EDA With R Lab Manual
No ratings yet
EDA With R Lab Manual
110 pages
Influence of Psychological Capital On The Gross Performance of Telesales Salespeople in A Wholesaler-Distributor Company
No ratings yet
Influence of Psychological Capital On The Gross Performance of Telesales Salespeople in A Wholesaler-Distributor Company
11 pages
SPSS Assignment 1
No ratings yet
SPSS Assignment 1
6 pages
KVA Anusha - PGP12021 - BA
100% (1)
KVA Anusha - PGP12021 - BA
8 pages
Plotting With R
No ratings yet
Plotting With R
2 pages
DSE (Week 7)
No ratings yet
DSE (Week 7)
4 pages
bi 5to 8
No ratings yet
bi 5to 8
6 pages
Bs Report On Iris
No ratings yet
Bs Report On Iris
6 pages
Tidyverse Cheat Sheet
No ratings yet
Tidyverse Cheat Sheet
1 page
Homework R1
No ratings yet
Homework R1
7 pages
(Document Title) : Profed7 Assessment in Learning 2
No ratings yet
(Document Title) : Profed7 Assessment in Learning 2
75 pages
K-means Cluter Analysis for IRIS Data Frame in R
No ratings yet
K-means Cluter Analysis for IRIS Data Frame in R
3 pages
DSBDA LAB_3_1737952797670
No ratings yet
DSBDA LAB_3_1737952797670
9 pages
Adolfo Garce - Political Knowledge Regimes
No ratings yet
Adolfo Garce - Political Knowledge Regimes
30 pages
Chapter 3 _STAT1204..
No ratings yet
Chapter 3 _STAT1204..
10 pages
S27
No ratings yet
S27
30 pages
Final 1
No ratings yet
Final 1
70 pages
Anuj Khandelwal 3029 BCP a Business Analytics Continuous Assessment 2
No ratings yet
Anuj Khandelwal 3029 BCP a Business Analytics Continuous Assessment 2
20 pages
Chapter 08
No ratings yet
Chapter 08
13 pages
Final Data Lab
No ratings yet
Final Data Lab
20 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
DS Report
No ratings yet
DS Report
11 pages
Inext
No ratings yet
Inext
18 pages
kmeans_steps
No ratings yet
kmeans_steps
3 pages
Is There Social Capital in Cities? Indonesia
No ratings yet
Is There Social Capital in Cities? Indonesia
26 pages
kmean pgm
No ratings yet
kmean pgm
3 pages
R Programming: 122AD0029 - T.MANISH
No ratings yet
R Programming: 122AD0029 - T.MANISH
21 pages
Chicago Journal of Sociology 2016
No ratings yet
Chicago Journal of Sociology 2016
47 pages
Merging and Importing Data Additionalmaterial
No ratings yet
Merging and Importing Data Additionalmaterial
2 pages
TRUSTEES' REPORT & ACCOUNTS April 2006 - March 2007
No ratings yet
TRUSTEES' REPORT & ACCOUNTS April 2006 - March 2007
40 pages
Anirban Paper 2 Poultry
No ratings yet
Anirban Paper 2 Poultry
10 pages
AMR - Assignment 1-Sample Solutions
No ratings yet
AMR - Assignment 1-Sample Solutions
7 pages
Classification Using R
No ratings yet
Classification Using R
9 pages
DATAMINING
No ratings yet
DATAMINING
24 pages
Analysis Course HW1
No ratings yet
Analysis Course HW1
5 pages
Solution HW2
No ratings yet
Solution HW2
6 pages
datamininganddataware
No ratings yet
datamininganddataware
25 pages
ML R Experiment1
No ratings yet
ML R Experiment1
10 pages
Feedback Arc Set
No ratings yet
Feedback Arc Set
8 pages
Using R For Data Preprocessing, Exploratory Analysis, Visualization
No ratings yet
Using R For Data Preprocessing, Exploratory Analysis, Visualization
7 pages
R-course_part7-ML_exercise-sheet-2024
No ratings yet
R-course_part7-ML_exercise-sheet-2024
8 pages
iris_hc_solution
No ratings yet
iris_hc_solution
31 pages
Isye HW2
No ratings yet
Isye HW2
10 pages
EDA AnalysisA
No ratings yet
EDA AnalysisA
15 pages
RDM Slides Clustering With R 1
No ratings yet
RDM Slides Clustering With R 1
64 pages
9 .ML Programs
No ratings yet
9 .ML Programs
95 pages
R Code For Discriminant and Cluster Analysis
No ratings yet
R Code For Discriminant and Cluster Analysis
23 pages
A Complete Guide To The Iris Dataset in R
No ratings yet
A Complete Guide To The Iris Dataset in R
3 pages
Datamining 2
No ratings yet
Datamining 2
54 pages
SUMITs MINOR REPORT
No ratings yet
SUMITs MINOR REPORT
16 pages
Module 2 Iris Data Set
No ratings yet
Module 2 Iris Data Set
1 page
Data Science Project
No ratings yet
Data Science Project
31 pages
Rakesh Jha
No ratings yet
Rakesh Jha
22 pages
MKT7005 Strategic Marketing
0% (1)
MKT7005 Strategic Marketing
8 pages
Iris Flower Classification
No ratings yet
Iris Flower Classification
47 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
Introduction To R. Graphical Representation of Multivariate Observations
No ratings yet
Introduction To R. Graphical Representation of Multivariate Observations
5 pages
Incidence Function Model in R
No ratings yet
Incidence Function Model in R
22 pages
Assessing The Utility of ChatGPT Throughout The Entire Clinical Workflow
No ratings yet
Assessing The Utility of ChatGPT Throughout The Entire Clinical Workflow
15 pages
AVO Responses
No ratings yet
AVO Responses
5 pages
Soal Up 12-12-21
100% (1)
Soal Up 12-12-21
29 pages
Zero Based Budgeting - Incremental Budgeting
No ratings yet
Zero Based Budgeting - Incremental Budgeting
2 pages
Data Exploration and Visualisation With R: Yanchang Zhao
No ratings yet
Data Exploration and Visualisation With R: Yanchang Zhao
45 pages
Format For Me318 Laboratory Reports
No ratings yet
Format For Me318 Laboratory Reports
2 pages
Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012
No ratings yet
Data Mining - R Assignment: Konstantinos Stavrou (70134) 11/11/2012
13 pages
IRIS Commands Practice
No ratings yet
IRIS Commands Practice
10 pages
Scholarship: Program
No ratings yet
Scholarship: Program
4 pages
A Short Tutorial To Write An Abstract
No ratings yet
A Short Tutorial To Write An Abstract
6 pages
Materi Praktikum
No ratings yet
Materi Praktikum
7 pages
(IJCST-V3I4P21) : Ms - Pallavi.D.Patil, P.M.Mane
No ratings yet
(IJCST-V3I4P21) : Ms - Pallavi.D.Patil, P.M.Mane
7 pages
Distance Learning
No ratings yet
Distance Learning
3 pages
R Programs
No ratings yet
R Programs
30 pages
Predictive Habitat Distribution Models in Ecology
No ratings yet
Predictive Habitat Distribution Models in Ecology
40 pages
Field Development Planning Optimization Using Rese
No ratings yet
Field Development Planning Optimization Using Rese
8 pages
Machine Learning in Ecology
No ratings yet
Machine Learning in Ecology
15 pages
LESSON Myths of Entr. Ship
No ratings yet
LESSON Myths of Entr. Ship
15 pages
Problem-Based Learning Jan2010
No ratings yet
Problem-Based Learning Jan2010
63 pages
Guisan 2000 - Predictive Habitat Distribution Models in Ecology
No ratings yet
Guisan 2000 - Predictive Habitat Distribution Models in Ecology
40 pages
R For Data Science - Tidyverse For Beginners (Ggplot2, Dplyr, Tidyr, Readr, Purr, Tibble, Stringr, Forcats) PDF
No ratings yet
R For Data Science - Tidyverse For Beginners (Ggplot2, Dplyr, Tidyr, Readr, Purr, Tibble, Stringr, Forcats) PDF
1 page
Kaizen Tool Kit: Mistake Proofing - Pokayoke
No ratings yet
Kaizen Tool Kit: Mistake Proofing - Pokayoke
8 pages
Cavity Fill Balancing Technique For Rubber Injection
No ratings yet
Cavity Fill Balancing Technique For Rubber Injection
5 pages
Problems
No ratings yet
Problems
22 pages
Evaluation of Performance Appraisal System at GSK
No ratings yet
Evaluation of Performance Appraisal System at GSK
22 pages
Assessing Competency Mapping of Employees
No ratings yet
Assessing Competency Mapping of Employees
12 pages
Knitspeak: An A to Z Guide to the Language of Knitting Patterns
From Everand
Knitspeak: An A to Z Guide to the Language of Knitting Patterns
Andrea Berman Price
4/5 (20)
Java: Advanced Guide to Programming Code with Java
From Everand
Java: Advanced Guide to Programming Code with Java
Charlie Masterson
No ratings yet
Data Structures and Algorithms Implementation through C: Let’s Learn and Apply
From Everand
Data Structures and Algorithms Implementation through C: Let’s Learn and Apply
Brijesh Bakariya
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Intro To Analytics Modeling Homework 2

Uploaded by

Intro To Analytics Modeling Homework 2

Uploaded by

hw2

# load and visualize data

2.0 3.0 4.0 0.5 1.5 2.5

1.0 2.0 3.0

4.5 6.0 7.5 1 3 5 7 1.0 2.0 3.0

# plot data to visualize

Petal.Length Petal.Width Sepal.Length Sepal.Width

# correlation of the 4 predictor variables to the species

ggplot(iris, aes(Sepal.Length, Sepal.Width, color= Species)) + geom_point()

ggplot(iris, aes(Petal.Length, Petal.Width, color= Species)) + geom_point()

ggplot(iris, aes(Sepal.Length, Petal.Length, color= Species)) + geom_point()

ggplot(iris, aes(Sepal.Length, Petal.Width, color= Species)) + geom_point()

ggplot(iris, aes(Sepal.Width, Petal.Width, color= Species)) + geom_point()

ggplot(iris, aes(Sepal.Width, Petal.Length, color= Species)) + geom_point()

2.0 2.5 3.0 3.5 4.0 4.5

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +

# set seed so random variables can be replicated

# loop k = 1 to 10 and calculate total within-groups sum of squares

# 3 predictor variables determined from EDA steps

# use model with k = 3 and combination of different predictors

# compare clusters with acutal species

# model 1 & 3 accuracy

# load and visualize data

## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob

## Min. 1st Qu. Median Mean 3rd Qu. Max.

# measure of the asymmetry of the probability distribution around the mean

# measures the tailed-ness of a variable

# most extreme value from the mean

grubbs.test(crime, type = 10)

# load and visualize data

# read and format data

# visualize trend over time by year

# visualize trend over time by month

row_mean = rowMeans(x = temps[c(2:21)])

# summer dates (July & August)

if(st1 + decrease < 0) {

column_mean = colMeans(x = temps[c(2:21)])

# detecting an increase CUSUM equation

if(st1 + increase < 0) {

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.