0% found this document useful (0 votes)
4 views7 pages

Application of Linear Algebra

This document is an R Markdown report detailing the analysis of Olympic decathlon scores from 2012-2020 using PCA and k-means clustering. The analysis shows that using PCA to reduce dimensionality before clustering improves the classification of data into three distinct classes. The results indicate that the third class is more accurately defined after applying PCA compared to the original dataset.

Uploaded by

freemanchen115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Application of Linear Algebra

This document is an R Markdown report detailing the analysis of Olympic decathlon scores from 2012-2020 using PCA and k-means clustering. The analysis shows that using PCA to reduce dimensionality before clustering improves the classification of data into three distinct classes. The results indicate that the third class is more accurately defined after applying PCA compared to the original dataset.

Uploaded by

freemanchen115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

HW3

Freeman Chen

11/16/2021

R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring
HTML, PDF, and MS Word documents. For more details on using R Markdown see
http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as
well as the output of any embedded R code chunks within the document. You can embed an
R code chunk like this:
summary(cars)

## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00

library(devtools)

## Loading required package: usethis

#install_github("vqv/ggbiplot")
library(ggbiplot)

## Loading required package: ggplot2

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

#install.packages("factoextra")
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at


https://goo.gl/ve3WBa

data <- read.csv("Olympic_Dec.csv")


head(data)
## X100M LongJump ShotPut HighJump X400M X110M DiscusThrow PoleVault
## 1 1011 1068 769 850 963 1032 716 972
## 2 994 942 807 794 904 1035 834 849
## 3 801 940 759 906 859 917 782 819
## 4 850 970 819 850 853 863 835 849
## 5 980 945 712 850 899 926 785 819
## 6 940 864 782 714 906 989 852 880
## JavelinThrow X1500M
## 1 767 721
## 2 838 674
## 3 996 744
## 4 763 795
## 5 780 746
## 6 698 695

##perfroman PCA
pca.fit <- prcomp(data, center = TRUE, scale=TRUE)
summary(pca.fit)

## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
PC7
## Standard deviation 1.8102 1.3196 1.1844 0.95432 0.88665 0.82298
0.6760
## Proportion of Variance 0.3277 0.1741 0.1403 0.09107 0.07861 0.06773
0.0457
## Cumulative Proportion 0.3277 0.5018 0.6421 0.73314 0.81176 0.87949
0.9252
## PC8 PC9 PC10
## Standard deviation 0.54742 0.49014 0.45629
## Proportion of Variance 0.02997 0.02402 0.02082
## Cumulative Proportion 0.95516 0.97918 1.00000

plot(pca.fit$sdev^2, main="SCREE Diagram", type="l")


#From
the SCREE Diagram and the cumulative proportions of variances, we can observe that only
about the first 6 components may be sufficient to explain the variation in the original
dataset
# Compute k-means with k = 3
set.seed(123)
km.res <- kmeans(data, 3, nstart = 25)
# Print the results
print(km.res)

## K-means clustering with 3 clusters of sizes 36, 32, 1


##
## Cluster means:
## X100M LongJump ShotPut HighJump X400M X110M DiscusThrow
PoleVault
## 1 917.3611 936.0556 757.4167 838.0833 894.5833 935.1667 762.7222
884.6944
## 2 858.5312 825.1250 753.0938 774.3438 826.6250 873.9375 764.5312
825.3438
## 3 825.0000 847.0000 0.0000 925.0000 765.0000 869.0000 618.0000
941.0000
## JavelinThrow X1500M
## 1 779.3056 728.8333
## 2 707.9688 649.5938
## 3 746.0000 634.0000
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
1 1 1 1 1 1 1
## [39] 2 1 2 2 2 2 1 2 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 1433565 1346603 0
## (between_SS / total_SS = 32.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
"tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"

# graph the cluster

fviz_cluster(km.res, data = data)

##cluster after PCA


km.res.pca <- kmeans(pca.fit$x[,1:6], 3, nstart = 25)
# Print the results
print(km.res.pca)

## K-means clustering with 3 clusters of sizes 32, 24, 13


##
## Cluster means:
## PC1 PC2 PC3 PC4 PC5
PC6
## 1 -1.407563 0.2612312 0.3442238 -0.13098381 0.006352374
0.09804290
## 2 1.686239 0.7046040 -0.2202246 -0.04617983 -0.177651106
0.04757325
## 3 0.351714 -1.9438381 -0.4407515 0.40767675 0.312334659 -
0.32916390
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 2 3 2 3 2 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
1 1 3 1 1 3 1
## [39] 3 3 2 2 3 2 3 2 2 3 2 3 1 1 1 1 1 1 1 1 1 1 1 1 2 1 3 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 156.8503 117.7800 112.9131
## (between_SS / total_SS = 35.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
"tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"

# graph the cluster

fviz_cluster(km.res.pca, data = data)


ggbiplot(pca.fit,ellipse=TRUE,choices=c(1,2))
#I use
the Olympic decathalon scores from 2012-2020 dataset, first I perform the k-mean cluster
tp make 3 classes for the datset,from the graph we can see that the 3rd class is not classfied
well from the orignal datset, then I decided use PCA method to lower the dimesnsion of the
dataset then clustering it, compare to other method, I think PCA method is the best choice
because in this dataset all the variables are integer and we don’t have to normalize the
columns of this dataset.after use the PCA method, I pick the first 6 componets and use k-
mean cluster to define 3 classes, from the graph we can see that the 3rd class is defined
way better than the orignal one.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy