Advanced Statistics - Project Report
Advanced Statistics - Project Report
MINI PROJECT
ADVANCED
STATISTICS MODULE
Submitted by Rohan Kanungo
5th June 2019
pg. 1
Advanced Statistics Module Mini-Project Rohan Kanungo
TABLE OF CONTENTS
Project Objective ........................................................................ 3
Problem Analysis ....................................................................... 4
Evidence of Multicollinearity.................................................... 5
Factor Analysis ........................................................................... 7
Naming of Factors .................................................................... 10
Multiple Regression Analysis ................................................. 11
R-Code....................................................................................... 13
pg. 2
Advanced Statistics Module Mini-Project Rohan Kanungo
Project Objective
The project is focussed on market segmentation in the context of product service
management. The data file Facor-Hair is to be used for performing the analysis.
pg. 3
Advanced Statistics Module Mini-Project Rohan Kanungo
Problem Analysis
The data set consists of 13 variables and 100 observations. Satisfaction is the
dependent variable and the others are the factors that determine the satisfaction
(independent variables)
For the purposes of market segmentation, Principal Component/Factor analysis can
be used identify the structure of a set of variables as well as provide a process for
data reduction.
We therefore examine and analyze the data set -
Understand whether these variables can be “grouped.” By grouping the
variables, we will be able to see the big picture in terms of understanding the
customer
Reduce the 13 variables to a smaller number of composite variables
str(Hairdata_original)
'data.frame': 100 obs. of 13 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ ProdQual : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
$ Ecom : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
$ TechSup : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
$ CompRes : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
$ Advertising : num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
$ ProdLine : num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
$ SalesFImage : num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
$ ComPricing : num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
$ WartyClaim : num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
$ OrdBilling : num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
$ DelSpeed : num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
$ Satisfaction: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
pg. 4
Advanced Statistics Module Mini-Project Rohan Kanungo
Evidence of Multicollinearity
The sample size is 100 which provides an adequate basis to calculate the corelation
between variables.
To determine the existence of collinearity, we run a collinearity test.
## Find the correlation
cor(Hairdata)
cor.plot(Hairdata,numbers=TRUE,xlas = 2,upper=FALSE)
The plot above shows that there is evidence of multicollinearity. The cells marked in
blue show a high degree of possibility of multi-collinearity.
pg. 5
Advanced Statistics Module Mini-Project Rohan Kanungo
$chisq
[1] 619.2726
$p.value
[1] 1.79337e-96
$df
[1] 55
Conclusion:
Since the p-value is very less, the test indicates that statistically, multicollinearity
exists in the data set.
pg. 6
Advanced Statistics Module Mini-Project Rohan Kanungo
Factor Analysis
1. Eigen Value Computation
eigen() decomposition
$values
3.426971 2.550897 1.690976 1.086556 0.609424 0.551884 0.401518 0.246952
0.203553 0.132842 0.098427
2. Scree Plot
## Scree Plot
HairScree<-data.frame(Hairfactor,HairEigenValue)
plot(HairScree,col="RED",pch=18,main="Scree Plot")
lines(HairScree,col="Blue")
abline(h=1,col="PURPLE")
pg. 7
Advanced Statistics Module Mini-Project Rohan Kanungo
Using the Kaiser rule, we determine that there are four factors, which are the
principal factors.
3. Rotation of Loadings
## Loadings
## Unrotate Principal Loadings
Hair_unrotate <- principal(Hairdata,nfactors = 4,rotate = "none")
print(Hair_unrotate,digits=5)
UnRotatedprofile <-plot(Hair_unrotate,row.names(Hair_unrotate$loadings))
UnRotatedprofile
To make the boundaries sharper, we perform an orthogonal rotation to clearly identify the factors.
pg. 8
Advanced Statistics Module Mini-Project Rohan Kanungo
pg. 9
Advanced Statistics Module Mini-Project Rohan Kanungo
Naming of Factors
RC1 RC2 RC3 RC4
ProdQual 0.00152 -0.01274 -0.03282 0.87566
Ecom 0.0568 0.87056 0.04735 -0.11746
TechSup 0.01833 -0.02446 0.93919 0.10051
CompRes 0.92582 0.11593 0.0486 0.09123
Advertising 0.13876 0.74152 -0.0816 0.01467
ProdLine 0.59122 -0.06397 0.14598 0.642
SalesFImage 0.13252 0.90045 0.07559 -0.15924
ComPricing -0.08515 0.22563 -0.24551 -0.72258
WartyClaim 0.10982 0.05483 0.93099 0.10218
OrdBilling 0.86376 0.10683 0.0839 0.03931
DelSpeed 0.9382 0.17734 -0.00463 0.05227
Factor 2 : Marketing
i. SalesFImage
ii. Ecom
iii. Advertising
pg. 10
Advanced Statistics Module Mini-Project Rohan Kanungo
mydata=data.frame(Hair_rotate$score)
mydataforregression=cbind(mydata,Hairdata_original$Satisfaction)
names(mydataforregression) <-
c("customerservice","marketing","techsupport","productvalue","customersatisf
action")
str(mydataforregression)
attach(mydataforregression)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.91800 0.07089 97.589 < 2e-16 ***
customerservice 0.61805 0.07125 8.675 1.12e-13 ***
marketing 0.50973 0.07125 7.155 1.74e-10 ***
techsupport 0.06714 0.07125 0.942 0.348
productvalue 0.54032 0.07125 7.584 2.24e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
pg. 11
Advanced Statistics Module Mini-Project Rohan Kanungo
R-squared Interpretation
Multiple R-squared: 0.6605 means that 66.05% of the dependent variable is
explained by the independent variables; i.e. 66.05% of customer satisfaction is
dependent on the four factors identified.
Probability (F-statistic > 46.21) = p-value: < 2.2e-16 is much smaller than 5%
Hence, REJECT NULL HYPOTHESIS that all betas are zeroes
Conclude at least one beta exists, ACCEPT ALT. HYPOTHESIS
Individual Coefficients are also highly significant as evidenced by the t-stat that are
extremely low
Pvalue - Each one of them is much less than 5%. Hence, individual betas also exist.
Overall, regression model exists in the poulation, meaning that the linear model of
customer satisfaction depending on customer service, marketing, technical support
and product value is statistically valid.
pg. 12
Advanced Statistics Module Mini-Project Rohan Kanungo
R-Code
## =======================================================================
## MINI-PROJECT 2
## MODULE - ADVANCED STATISTICS
## =======================================================================
## Environment Set up
## Read Input File "Factor-Hair-Revised"
## Install libraries nFactors and Psych for Factor Analysis
library(nFactors)
library(psych)
getwd()
Hairdata_original <- read.csv("Factor-Hair-Revised.csv",header=TRUE)
View(Hairdata_original)
attach(Hairdata_original)
str(Hairdata_original)
## Significance of correlation
## Bartlett's Test
cortest.bartlett(Hairdata,n=100)
## Scree Plot
HairScree<-data.frame(Hairfactor,HairEigenValue)
plot(HairScree,col="RED",pch=18,main="Scree Plot")
pg. 13
Advanced Statistics Module Mini-Project Rohan Kanungo
lines(HairScree,col="Blue")
abline(h=1,col="PURPLE")
## Loadings
## Unrotate Principal Loadings
Hair_unrotate <- principal(Hairdata,nfactors = 4,rotate = "none")
print(Hair_unrotate,digits=5)
UnRotatedprofile <-plot(Hair_unrotate,row.names(Hair_unrotate$loadings))
UnRotatedprofile
par(mfrow=c(1,2))
fa.diagram(Hair_unrotate,main="Unrotated factors")
fa.diagram(Hair_rotate,main="Rotated factors")
## =======================================================================
## END MINI-PROJECT 2
## =======================================================================
pg. 14