Vijayalakshmi
Vijayalakshmi
Vijayalakshmi
Before proceeding with the questions asked, we need to do the basic analysis to understand more
about the given dataset.
By doing that, we got to know there are 3 columns with 40 entries each and there is no null value
present in the dataset provided. The data type of Education and occupation is Object and for the
salary it is integer.
Problem 1.1:
State the null and the alternate hypothesis for conducting one-way ANOVA for both Education
and Occupation individually.
Solution:
Problem 1.2:
Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Solution:
df sum_sq mean_sq F PR(>F)
C(Education) 2.0 1.026955e+11 5.134773e+10 30.95628 1.257709e-08
Residual 37.0 6.137256e+10 1.658718e+09 NaN NaN
Since the p value is less than the significance level, we can reject the null hupothesis and states
that there is a no dependence of Salary on the education qualification.
Problem 1.3:
Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Solution:
df sum_sq mean_sq F PR(>F)
C(Occupation) 3.0 1.125878e+10 3.752928e+09 0.884144 0.458508
Residual 36.0 1.528092e+11 4.244701e+09 NaN NaN
Since the p value is less than thect the null hupothesis and states that there is a no dependence of
Salary on the occupation.
Problem 1.4:
If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)
CASE SNIPPET 1B:
Problem 1b.1:
What is the interaction between two treatments? Analyze the effects of one variable on the other
(Education and Occupation) with the help of an interaction plot.[hint: use the ‘pointplot’ function
from the ‘seaborn’ function]
Solution:
Problem 1b.2:
Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses
and state your results. How will you interpret this result?
Solution:
df sum_sq mean_sq F PR(>F)
C(Education) 2.0 1.026955e+11 5.134773e+10 31.257677 1.981539e-08
C(Occupation) 3.0 5.519946e+09 1.839982e+09 1.120080 3.545825e-01
Residual 34.0 5.585261e+10 1.642724e+09 NaN NaN
Considering both the factors(Education and Occupation), Occupation is a significant factor as the
p value is <0.05 wheras Education is not a significant variable as p value of diet is >0.05
Problem 1b.3:
Explain the business implications of performing ANOVA for this particular case study.
Solution:
The business implication we can do from the above analysis is before fixing the salary of the
employee. It is very important to finalize the employee’s salary based on their education and the
designation they are holding in an organization.
CASE SNIPPET 2:
PROBLEM 2.1:
Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?
SOLUTION:
To find the statistical summary of the dataset we have received the following outputs.
Data Types and data information:
From the above table it is clear that the Data set contains 17 columns ranging from 0 to 777. And
there is no null value present in the data set provided. The data type of all the variable is integer
except S.F.Ration ,whose data type is Float.
1. Accept:
2. Phd:
3. Perc.Alumni:
4. Outstate:
By doing the analysis it is understood that all the variables are either normally distributed
or left and right skewed distribution.
Multivariate Analysis:
Is scaling necessary for PCA in this case? Give justification and perform scaling.
SOLUTION:
Since the mean and standard deviation of the variables are different and their variation
(difference in range) is very significant, therefore scaling is necessary for PCA in order to
transform the variables into same range.
Comment on the comparison between the covariance and the correlation matrices from this data.
Solution:
Covariance matrix
From the above output, We clearly able to find that the data are highly correlated for majority of
the variables. The covariance matrix on the other hand doesn’t provide much information between
the variables (the value being too low).
Problem 2.4:
Check the dataset for outliers before and after scaling. What insight do you derive here?
Solution:
We find many outliers in the majority of the variables (the black scatter points represents outliers
of the before and after scaling box plots)
Boxplots
From the above charts, it is very obvious that there are many outliers present in the variables (the
black scatter points represents outliers before and after scaling box plots)
But in the instruction it is mentioned not to eliminate outliers if not asked, so we are proceeding
as such.
Problem 2.5 :
Perform PCA and export the data of the Principal Component scores into a data frame.
Solution:
Principal Component matrix (obtained after PCA of main dataset in dataframe) is as follows:
Eigen vectors:
2.63532465e-01, 6.45171674e-01],
-7.17976665e-01, 2.15440433e-01],
-1.80844037e-03, -7.90092535e-02],
-1.74892712e-02, 2.64360728e-02],
6.32319274e-01, -7.32157463e-02],
[ 1.68489663e-02, 1.45695544e-01, 5.67865856e-02,
-1.04418195e-01, -1.70608181e-02],
2.36557041e-02, 3.65430664e-02],
-1.79135973e-02, -5.40619707e-03],
2.27173069e-03, 1.44058416e-03],
-9.28586082e-03, -5.69794709e-03],
8.76062155e-03, 1.10967720e-02],
-1.29087605e-02, 9.21275404e-04],
-2.23873037e-02, -1.07419610e-02],
1.39240115e-02, -5.91725448e-03],
-2.72781214e-02, -5.85347628e-02],
3.90162627e-04, -8.55466507e-03]]
Eigenvalues
0.00065865, 0.00022908]))
Problem 2.6:
Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only).
Solution:
Hence the first column in our rearranged Eigen vector-matrix will be the principal component
which captures the highest variability.
Explicit form of the first PC = [-0.07, -0.05, -0.06, -0.39, -0.45, -0.04, 0.02, -0.4 , -0.26, -0.02,
0.05, -0.3 , -0.34, 0.12, -0.29, -0.18, -0.25]
Problem 2.7:
Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?
Solution:
Selecting a subset from the rearranged Eigenvalue matrix as per our need
i.e. number_comp = 2. This implies, we selected the first two principal components.
From the plot of cumulative eigenvalues plot, we came to know that after 8-10 components the
cumulative eigenvalues almost converges and this gives us the optimum number of principal
components.
Problem 2.9 :
Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal
Components Obtained]
Solution:
The business implication we came from the analysis is that we can reduce the dimensions of the
given dataset using PCA by reducing 16 components with the help of 8 to 10 principle
components. By doing this will drastically reduce the time, complexity of the analysis and the
cost.