Vijayalakshmi

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

CASE SNIPPET 1:

Salary is hypothesized to depend on educational qualification and occupation. To understand the


dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s
educational qualification and occupation are noted. Educational qualification is at three levels,
High school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and
clerical, Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.[Assume that the data
follows a normal distribution. In reality, the normality assumption may not always hold if the
sample size is small.]

Before proceeding with the questions asked, we need to do the basic analysis to understand more
about the given dataset.

By doing that, we got to know there are 3 columns with 40 entries each and there is no null value
present in the dataset provided. The data type of Education and occupation is Object and for the
salary it is integer.

Problem 1.1:

State the null and the alternate hypothesis for conducting one-way ANOVA for both Education
and Occupation individually.

Solution:

Hypothesis for Qualification:

H0(Null Hypothesis): Salary is depend on the occupation.


H1(Alternate Hypothesis): Salary is not depend on the occupation.

Hypothesis for Education:

H0(Null Hypothesis): Salary is depend on the education qualification.


H1(Alternate Hypothesis): Salary is not depend on the education qualification.

Problem 1.2:
Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Solution:
df sum_sq mean_sq F PR(>F)
C(Education) 2.0 1.026955e+11 5.134773e+10 30.95628 1.257709e-08
Residual 37.0 6.137256e+10 1.658718e+09 NaN NaN

Since the p value is less than the significance level, we can reject the null hupothesis and states
that there is a no dependence of Salary on the education qualification.

Problem 1.3:
Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.

Solution:
df sum_sq mean_sq F PR(>F)
C(Occupation) 3.0 1.125878e+10 3.752928e+09 0.884144 0.458508
Residual 36.0 1.528092e+11 4.244701e+09 NaN NaN

Since the p value is less than thect the null hupothesis and states that there is a no dependence of
Salary on the occupation.

Problem 1.4:
If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)
CASE SNIPPET 1B:
Problem 1b.1:

What is the interaction between two treatments? Analyze the effects of one variable on the other
(Education and Occupation) with the help of an interaction plot.[hint: use the ‘pointplot’ function
from the ‘seaborn’ function]

Solution:
Problem 1b.2:

Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses
and state your results. How will you interpret this result?

Solution:
df sum_sq mean_sq F PR(>F)
C(Education) 2.0 1.026955e+11 5.134773e+10 31.257677 1.981539e-08
C(Occupation) 3.0 5.519946e+09 1.839982e+09 1.120080 3.545825e-01
Residual 34.0 5.585261e+10 1.642724e+09 NaN NaN

Considering both the factors(Education and Occupation), Occupation is a significant factor as the
p value is <0.05 wheras Education is not a significant variable as p value of diet is >0.05

Problem 1b.3:

Explain the business implications of performing ANOVA for this particular case study.

Solution:

The business implication we can do from the above analysis is before fixing the salary of the
employee. It is very important to finalize the employee’s salary based on their education and the
designation they are holding in an organization.
CASE SNIPPET 2:

The dataset Education - Post 12th Standard.csv contains information on various colleges. You


are expected to do a Principal Component Analysis for this case study according to the
instructions given.

PROBLEM 2.1:

Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?

SOLUTION:

To find the statistical summary of the dataset we have received the following outputs.
Data Types and data information:

From the above table it is clear that the Data set contains 17 columns ranging from 0 to 777. And
there is no null value present in the data set provided. The data type of all the variable is integer
except S.F.Ration ,whose data type is Float.

To represent every variable in histogram representation:

1. Accept:

2. Phd:
3. Perc.Alumni:

4. Outstate:

By doing the analysis it is understood that all the variables are either normally distributed
or left and right skewed distribution.

Multivariate Analysis:

We can also analyze using pairplot for multivariate analysis:


PROBLEM 2.2:

Is scaling necessary for PCA in this case? Give justification and perform scaling.

SOLUTION:

Since the mean and standard deviation of the variables are different and their variation
(difference in range) is very significant, therefore scaling is necessary for PCA in order to
transform the variables into same range.

The matrix obtained after scaling is as follows:


Problem 2.3:

Comment on the comparison between the covariance and the correlation matrices from this data.

Solution:

Covariance matrix

Correlation matrices of the dataset.

From the above output, We clearly able to find that the data are highly correlated for majority of
the variables. The covariance matrix on the other hand doesn’t provide much information between
the variables (the value being too low).

Problem 2.4:
Check the dataset for outliers before and after scaling. What insight do you derive here?

Solution:

We find many outliers in the majority of the variables (the black scatter points represents outliers
of the before and after scaling box plots)

Boxplots

From the above charts, it is very obvious that there are many outliers present in the variables (the
black scatter points represents outliers before and after scaling box plots)

But in the instruction it is mentioned not to eliminate outliers if not asked, so we are proceeding
as such.

Problem 2.5 :
Perform PCA and export the data of the Principal Component scores into a data frame.

Extract the eigenvalues and eigenvectors.[print both]

Solution:

Principal Component matrix (obtained after PCA of main dataset in dataframe) is as follows:

Eigen vectors:

[-6.84622439e-02, 2.13767089e-01, 3.04972237e-03,

-1.86732482e-01, 8.75794411e-02, -8.64946288e-03,

3.31441114e-02, -4.85110244e-02, 6.38613800e-02,

-4.74278428e-03, 7.04245678e-02, 4.35902670e-02,

-1.80258214e-01, 1.89437570e-01, -5.81812205e-01,

2.63532465e-01, 6.45171674e-01],

[-5.45226942e-02, 2.67634316e-01, 3.36326579e-02,

-2.10569610e-01, 1.46921061e-01, -6.28667146e-03,

5.14109360e-02, -9.18269071e-02, 1.00600486e-02,

2.64870698e-02, -2.35358564e-02, -4.44916040e-02,

-2.39988072e-01, 1.52165425e-01, -4.92142782e-01,


-3.69540234e-02, -7.20814074e-01],

[-5.95619427e-02, 4.62062818e-01, -3.04373473e-02,

-2.61365642e-01, 2.57790627e-01, -7.53531308e-02,

4.97284852e-02, -8.13982545e-02, 3.74691405e-02,

1.30424797e-02, -9.34664446e-03, -1.10437618e-02,

-9.49438736e-02, -1.03488038e-01, 2.21509919e-01,

-7.17976665e-01, 2.15440433e-01],

[-3.92335224e-01, 4.82976775e-02, -3.86227996e-01,

-3.74801619e-02, -2.73994220e-01, -8.21348432e-02,

2.15340219e-02, -9.56299177e-03, 1.17385358e-01,

-1.65612206e-01, 4.37064614e-01, 3.40840838e-01,

4.55941155e-02, -4.86364097e-01, -1.32326033e-01,

-1.80844037e-03, -7.90092535e-02],

[-4.50040853e-01, 1.22251513e-01, -5.00999551e-01,

2.86806970e-02, -3.35673334e-01, 7.08368508e-02,

-2.12908838e-01, -8.17591287e-02, -1.56054234e-01,

1.28622001e-01, -4.09791493e-01, -2.32735366e-01,

-9.10555804e-03, 3.03975769e-01, 6.71208627e-02,

-1.74892712e-02, 2.64360728e-02],

[-4.41063922e-02, 5.00826888e-01, -2.59169587e-02,

-2.40563029e-01, 2.35602171e-01, -8.70468752e-02,

1.20023131e-03, -5.16875027e-02, 1.46856990e-02,

1.54703554e-02, -5.20439997e-02, -1.02771938e-02,

1.08236919e-01, -1.44554412e-01, 4.26447647e-01,

6.32319274e-01, -7.32157463e-02],
[ 1.68489663e-02, 1.45695544e-01, 5.67865856e-02,

-3.87473700e-02, 2.92442157e-02, -1.00427764e-01,

-5.50643563e-02, 5.98776194e-02, -3.27920521e-02,

-4.71280335e-02, -1.08428010e-01, -8.86508224e-02,

9.06786638e-01, -2.18665506e-02, -3.15211227e-01,

-1.04418195e-01, -1.70608181e-02],

[-4.00717980e-01, -2.99219672e-01, 2.65519582e-01,

-2.70470155e-01, 2.60933060e-02, -2.16277995e-01,

4.57759303e-01, -2.49308966e-01, -5.15565220e-01,

4.59238985e-02, -6.35026310e-02, -5.49306290e-02,

2.66195786e-02, -9.48367705e-02, 2.27662723e-02,

2.36557041e-02, 3.65430664e-02],

[-2.57512558e-01, -1.26895167e-01, 4.85090810e-01,

-4.34826126e-01, -1.88493704e-01, 1.18591135e-02,

-6.34769427e-01, 1.16478121e-01, 1.13275987e-01,

-1.35901561e-01, -3.81471713e-02, 2.62057466e-02,

-5.31919911e-02, -4.29113497e-02, 5.06421199e-02,

-1.79135973e-02, -5.40619707e-03],

[-1.78659636e-02, 3.13632559e-02, 5.54374810e-03,

-5.98432985e-02, -7.20302892e-02, -8.05531059e-02,

-8.54008112e-02, 2.41513253e-01, -9.08249553e-02,

5.88307658e-01, 5.55023373e-01, -5.04944614e-01,

1.13152461e-02, -1.50478134e-02, 7.02107771e-03,

2.27173069e-03, 1.44058416e-03],

[ 4.97872663e-02, 1.48407808e-01, -3.31875630e-02,


-1.41732335e-02, -8.47063350e-02, -2.52298177e-01,

9.98295497e-02, 8.19860448e-01, -3.64002515e-01,

-1.71055288e-01, -1.46352942e-01, 1.37824614e-01,

-1.33809171e-01, 1.25668786e-02, -2.95527406e-02,

-9.28586082e-03, -5.69794709e-03],

[-3.00457520e-01, 2.36738735e-01, 2.73986367e-01,

4.26742928e-01, -1.67730234e-02, 1.03877767e-01,

9.55755633e-02, 1.34314672e-02, 2.06516075e-02,

-5.50007904e-01, 1.91785114e-01, -4.81496847e-01,

-5.65990877e-02, 2.50936465e-02, 2.29350190e-02,

8.76062155e-03, 1.10967720e-02],

[-3.40704931e-01, 2.48256809e-01, 3.97160064e-01,

4.92759437e-01, 1.89748481e-03, 4.27155056e-02,

-5.97475319e-03, 4.52132623e-02, 4.90411316e-02,

4.86141909e-01, -1.41131130e-01, 3.92519153e-01,

-3.71327092e-03, -5.87912239e-02, -3.82965362e-02,

-1.29087605e-02, 9.21275404e-04],

[ 1.21285741e-01, 1.53648296e-01, -1.38465070e-02,

7.43784261e-02, 6.27199479e-02, 2.99025699e-01,

-3.11154810e-01, -1.89409802e-01, -6.61248821e-01,

-1.04253889e-01, 3.65038679e-01, 2.95470587e-01,

6.49899529e-02, 2.32302965e-01, 4.93418828e-02,

-2.23873037e-02, -1.07419610e-02],

[-2.90468404e-01, -2.99832292e-01, -2.21611856e-01,

1.95481298e-01, 7.32261808e-01, -2.83776732e-01,


-3.30857090e-01, 8.60094384e-02, -7.43125629e-03,

-3.90288957e-02, 2.59400842e-02, -1.31183861e-02,

-1.68504542e-02, 1.98986803e-02, -3.83546424e-02,

1.39240115e-02, -5.91725448e-03],

[-1.75716049e-01, -3.56634896e-02, 3.45969814e-02,

-9.30664459e-02, -6.63732329e-02, -2.47203488e-01,

2.11443968e-01, 5.93298116e-02, 2.93885493e-01,

-6.16332067e-02, 3.08809752e-01, 2.56797337e-01,

1.51086553e-01, 7.14123703e-01, 2.36229185e-01,

-2.72781214e-02, -5.85347628e-02],

[-2.53182160e-01, -1.36899634e-01, -6.01087629e-02,

-2.18042058e-01, 2.58397482e-01, 7.77206748e-01,

2.34649399e-01, 3.38105908e-01, 9.54042443e-02,

4.11940339e-02, 5.68293330e-03, 2.63907146e-02,

1.12411958e-01, 1.68742916e-02, 3.58348927e-02,

3.90162627e-04, -8.55466507e-03]]

Eigenvalues

[0.16638095, 0.07915147, 0.03180768, 0.02573551, 0.02131162,

0.01437382, 0.01024677, 0.00897856, 0.00674422, 0.00554829,

0.00464682, 0.00383839, 0.00300555, 0.00246455, 0.00183457,

0.00065865, 0.00022908]))

Problem 2.6:
Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only).

Solution:

Each column available in the Eigen vector-matrix corresponds to a principal component, so in


order to arrange the principal component order of variability, we can arrange the Eigen Value in
the descending order.

Hence the first column in our rearranged Eigen vector-matrix will be the principal component
which captures the highest variability.

Explicit form of the first PC = [-0.07, -0.05, -0.06, -0.39, -0.45, -0.04, 0.02, -0.4 , -0.26, -0.02,
0.05, -0.3 , -0.34, 0.12, -0.29, -0.18, -0.25]

Problem 2.7:

Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?

Solution:

Selecting a subset from the rearranged Eigenvalue matrix as per our need

i.e. number_comp = 2. This implies, we selected the first two principal components.

From the plot of cumulative eigenvalues plot, we came to know that after 8-10 components the
cumulative eigenvalues almost converges and this gives us the optimum number of principal
components.
Problem 2.9 :

Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal
Components Obtained]

Solution:

The business implication we came from the analysis is that we can reduce the dimensions of the
given dataset using PCA by reducing 16 components with the help of 8 to 10 principle
components. By doing this will drastically reduce the time, complexity of the analysis and the
cost.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy