Data Science Unit-II
Data Science Unit-II
ht
Instructor : Krishna Dutt,
ig
krishnadutt.rvs@gmail.com
r
py
March 23, 2024
Co
ft-
Disclaimer: The views expressed in this presentation are those of the
ra
author, and many open source content are referenced, with all authors
duly acknowledged.
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 1 / 28
UNIT - II syllubus
ht
ig
Linear multiple regression Estimation and testing of coefficients R2 and
r
adjusted R2 Logistic regression Estimation and Testing of coefficients R2
py
and adjusted R2 coefficients Logistic regression and interpretation of
Co
coefficients K– Nearest Neighbor classifier random forest classification
errors Ridge Regression Support Vector Machine Analysis of various
ft-
(ANOVA)
ra
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 2 / 28
Linear Regression - Single Variable
Data / features / dimensions: represent the state of real-world
phenomena
Re-presentable in primitive formats like numerical, characters, strings
ht
Composite format: image, video, sound, text, etc.
ig
Data can be in structured or non-structured (SQL/NO SQL) formats
r
Both input and corresponding output can be considered as data
py
In some cases, no explicit mention of output data, but a goal shall be
Co
considered
Usually, an n-dimensional input
ft-
M-dimensional output
Assume all data is converted to numerical format
ra
X Y Y∼
D
x1 y1 (y1 − h(θ0 , θ1 , x1 ))
x2 y2 (y2 − h(θ0 , θ1 , x2 ))
· · ·
xm ym (ym − h(θ0 , θ1 , xm ))
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 3 / 28
Linear Regression - Single Variable
Assume all data is converted to numerical format
In linear Regression, we assume a model
Yi∼ = h(θ0 , theta1 , xi ) = θ0 + θ1 .xi
ht
(1)
ig
The model is described two parameters θ0 &θ1 , which are unknown. The
goal is to estimate these unknowns for a given data. In the graph, which
r
py
line represents best fit for a given data?
Co
ft-
ra
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 4 / 28
Matrix Calculus reqd. for Linear Regression
ht
regression problems, accordingly we provide the formulas for differentiating
a scalar function of vector variable. Consider a vector variable X and
ig
Constant vector A, y scalar
function of
X y=
r
py
x1 a1
x2 a2
XT · A = AT · X ; X = . ; A = . ;
Co
.
. ..
xm am
ft-
∂(AT · X ) ∂(X T · A)
ra
= =A (2)
X X
D
∂(X T · A · X )
=2·A·X (3)
X
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 5 / 28
Linear Regression - Uni-variate
Consider the case of a uni variate. Linear Model is defined by
ht
ỹi and xi , are the i th response and predictor values respectively from data
ig
set. We consider a dataset of n values and the corresponding vectors are
r
py
x1 y1
x2 y2
Co
θ
X = . ;Y = . ;θ = 0 ; (5)
.
. .
. θ1
ft-
xm ym
ra
Recast X as
1 x1
D
1 x2
X = . ; (6)
..
1 xm
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 6 / 28
Linear Regression Ordinary least square minimization
ht
Ỹ = X · θ (7)
r ig
The deviation between actual and model predicted value is given by
py
Co
ϵ = (Y − Ỹ ) (8)
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 7 / 28
Linear Regression .. contd.
In ordinary Least square fit of the Linear model, we minimize J(θ), as cost
function is convex and easy to obtain explicit solution by equating the
gradient of J(θ) = 0.
ht
J(θ) = ϵT .ϵ = (Y − Ỹ )T · (Y − Ỹ ) = (Y T − Ỹ T )(Y − Ỹ )
r ig
= YT · Y − Y T · Ỹ − Ỹ T · Y + Ỹ T · Ỹ
py
= YT · Y − Y T (X · θ) − (X · θ)T · Y + (X · θ)T · (X · θ)
= YT · Y − Y T · X · θ − θT · X T · Y + θT · X T · X · θ(9)
∂J(θ) Co
∂(Y T · X · θ) ∂(θT · X T · Y ) ∂(θT · X T · X · θ)
ft-
=− − +
∂θ ∂θ ∂θ ∂θ
ra
= -(Y X ) − (X Y ) + 2 · X · X θ = −2 · (X Y ) + 2 · (X T · X · θ)(10)
T T T T T
D
∂J(θ)
= 0; −2 · (X T · Y ) + 2 · (X T · X · θ) = 0 (11)
∂θ
(X T · Y ) = (X T X ) · θ; θ = (X T X )− 1(X T Y ) (12)
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 8 / 28
Linear Regression - Multivariate
Consider a multivariate data
x11 , x12 , · · x1n y1
θ0
ht
x21 , x22 , · · x2n y2
θ1
X = ; X = .. ; θ = ; (13)
..
ig
. . ..
.θn
xm1 , xm2 , ·xmn
r
ym
py
Recast X as
Co
1, x11 , x12 , · · x1n
ft-
1, x21 , x22 , · · x2n
X = ; Ỹ = X · θ; ϵ = Y − Ỹ (14)
..
.
ra
1, xm1 , xm2 , · · xmn
D
J(θ) = ϵT · ϵ (15)
The objective J(θ) still scalar function of vector θ. The earlier solution for
θstillholds!!
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 9 / 28
Linear Regression - Example
Consider a uni-variate hypothetical example.
1 2 1, 1
ht
2 3 θ 0 1, 2
X =4 ; Y = 5 ; θ = θ1 ; Recast X as X = 1, 4
(16)
ig
5 6 1, 5
r
py
1, 1
T 1, 1, 1, 1 1, 2 4, 12
Co
X ·X = · = (17)
1, 2, 4, 5 1, 4 12, 46
1, 5
ft-
T − 1 46, −12 1 46, −12
(X X ) 1 = = (18)
ra
2
T 1, 1, 1, 1 3 16
X Y = · = (19)
1, 2, 4, 5 5 58
6
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 10 / 28
Linear Regression - Example - contd..
T − 1 T 46, −12 16 1
θ = (X X ) 1X Y = θ = · = (20)
ht
40 12, 4 58 1
ig
The following bi-variate problem can be solved using the above steps.
r
1, 2 3
py
2, 3 4
X = 4, 5 Y = 5 (21)
Co
5, 6 6
ft-
Verify the answer for the above bi-variate is
ra
1.7
D
θ= 0 (22)
0.7
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 11 / 28
Bivariate Least square - Example
Another bi-variate problem can be solved using the above steps. marks
obtained in theory and lab and
ht
Theory Lab grade
ig
60 70 A
r
Data = 70 75 A
py
40 55 B
Co
30
ft- 60 F
60, 70 3
70, 75 3
ra
X =
40, 55 Y = 2
(23)
D
30, 60 0
In the above ordinal values of grades are transformed into numerical values
to facilitate linear regression.
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 12 / 28
Ridge Regression
The above least square regression minimizes the deviations of model
predicted values w.r.to actual data points, the problem of over fitting can
not be avoided. An additional loss term consisting of model parameters
ht
alone is added to the objective to penalize model parameters and control
ig
the overfitting.
J(θ) = ϵT .ϵ + λθT θ
r
(24)
py
The least square fit is obtained, as earlier by solving
Co
∂J(θ)
=0 (25)
∂θ
ft-
First part of
J(ϵT · ϵ)
ra
J(θ)
=
∂θ ∂θ
D
ht
r ig
Therefore total solution with Ridge regression is obtained as
py
θ = (X T X + ·I)− 1(X T Y ) (27)
Co
ft-
ra
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 14 / 28
Hypothesis Testing in Linear Regression: T-Test
Objective: Assess the significance of individual coefficients in linear
regression.
Hypothesis Testing for Individual Coefficients:
ht
Null Hypothesis (H0 ): The coefficient is equal to zero (βi = 0).
ig
Alternative Hypothesis (H1 ): The coefficient is not equal to zero
r
(βi ̸= 0).
py
T-Test Statistic for Linear Regression:
Co
β̂i
ti =
Standard Error(β̂i )
ft-
Degrees of Freedom:
ra
df = n − p − 1
D
ht
ig
Decision Rule:
r
If |ti | is significantly different from zero, reject H0 in favor of H1 .
py
Common significance levels include 0.05, 0.01, etc.
Co
Interpretation:
A significant coefficient suggests that the corresponding predictor is
ft-
associated with the response variable.
ra
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 16 / 28
Numerical Example: Hypothesis Testing in Linear
Regression
Scenario: We have a linear regression model with a single predictor X and
ht
the response variable Y .
Hypotheses:
ig
Null Hypothesis (H0 ): β1 = 0 (No relationship between X and Y ).
r
Alternative Hypothesis (H1 ): β1 ̸= 0 (There is a significant
py
relationship).
Co
Given:
Sample size (n) = 50
ft-
Estimated coefficient (β̂1 ) = 2.5
Standard error (SE (β̂1 )) = 1.2
ra
ht
ig
Decision Rule:
r
If |t| > tα/2,df , reject H0 .
py
For α = 0.05 and df = 48, tα/2,df ≈ 2.013.
Co
Conclusion:
Since |t| > 2.013, we reject H0 in favor of H1 .
ft-
There is sufficient evidence to suggest a significant relationship
ra
between X and Y .
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 18 / 28
Correlation Coefficient R
ht
R=p
(X − 1µX )T · (X − 1µX ) · (Y − 1µY )T · (Y − 1µX )
r ig
Range of R :
py
−1 <= R <= 1
Co
measures the strength and direction of the linear relationship between X
and Y.Doesn’t directly tell the proportion of variance explained by the
ft-
relationship.
ra
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 19 / 28
Correlation Coefficient R and R 2 - another formulation
ht
(XcT · Yc )
ig
R=p ; Xc = (X − 1 · µx ); Yc = (Y − 1 · µy ) (28)
(XcT · Xc )(Yct · Yc )
r
py
(XcT · Yc )
cov (X , Y ) = ; (29)
Co
n
(X − 1µx )T (X − 1µx ) (Y − 1µy )T (Y − 1µy )
ft-
var (X ) = ; var (Y ) = ;
n n
(30)
ra
2 cov (X , Y )
R =p (31)
D
(var (X ) · var (Y )
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 20 / 28
Correlation Coefficient R 2
ht
0 <= R 2 <= 1
ig
. The difference between R and R 2 :
r
py
R (Pearson correlation coefficient):Measures the strength and
direction of the linear relationship between X and Y. Doesn’t directly
Co
tell you the proportion of variance explained by the relationship.
R 2 : Signifies the proportion of variance in Y explained by the linear
ft-
relationship with X.Doesn’t directly tell you the direction (positive or
ra
negative) of the relationship.
While mathematically R 2 is the square of R, their interpretations
D
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 21 / 28
Coefficient of Determination (R-squared) and Adjusted
R-squared
ht
There is another way of defining R 2 as below, however both the
ig
expressions, one shown above and the other shown here are both same.
Coefficient of Determination (R-squared):
r
py
Measures the proportion of the variance in the dependent variable
that is predictable from the independent variables.
Co
In vector notation, it is calculated as:
ft-
SSR
R2 = 1 −
ra
SST
D
where SSR is the sum of squared residuals, and SST is the total sum
of squares.
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 22 / 28
Calculation of SSR and SST in Linear Regression
Sum of Squared Residuals (SSR):
SSR measures the sum of the squared differences between the
ht
predicted and actual values.
In vector notation, it is calculated as:
r ig
SSR = (y − Xθ)T (y − Xθ)
py
Co
where y is the vector of actual values, X is the data matrix, and θ is
the coefficient vector.
ft-
Total Sum of Squares (SST):
SST measures the total sum of squared differences between the actual
ra
SST = (y − µy 1)T (y − µy 1)
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 23 / 28
Adjusted R-squared
R² (Coefficient of Determination): Measures the percentage of
variance in the dependent variable explained by the independent
variable. It increases (or stays the same) when more independent
ht
variables are added, even if those variables don’t actually explain any
ig
additional variation. This can lead to overfitting and noise and can
r
increase model complexity and lead to fitting noise.
py
Adjusted R²: adjusts R² to account for the number of independent
variables in the model. It penalizes for adding variables that don’t
Co
improve the model’s fit. Adjusted R² can increase or decrease when
you add variables.
ft-
Adjusted R-squared:
ra
(1 − R 2 )(n − 1)
Adjusted R 2 = 1 −
(n − p)
where n is the number of observations and p is the number of
predictors.
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 24 / 28
R2 and Adjusted R2: Numerical Example
1 2 1
ht
2 3 1 2
X = 4 ; Y = 5 ; µx = 4 1, 1, 1, 1 · 4 = 3 (32)
ig
5 6 5
r
py
2 1 3 −2
Co
1 3 2 3 −1
µy = 1, 1, 1, 1 ·
5 = 4; (Xc = X − µx 1) = 4 − 3 = 1
4
ft-
6 5 3 2
ra
2 4 −2
3 4 −1
D
6 4 2
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 25 / 28
−2
−1
(XcT .Yc ) = −2, −1, 1, 2 ·
1 = 10; (34)
ht
2
ig
(XcT .Xc ) = −2, −1, 1, 2 = 10;
(35)
r
py
−2
−1
Co
(YcT .Yc ) = −2, −1, 1, 2 ·
1 = 10; (36)
2
ft-
(XcT · Yc )
ra
R2 = 1 (38)
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 26 / 28
Steps for Numerical Solution for R, R 2 , and Adjusted R 2
1 Find Means: Calculate µx and µy , the means of X and Y ,
respectively.
ht
2 Center Variables: Obtain centered variables Xc and Yc :
ig
Xc = X − 1 · µx and Yc = Y − 1 · µy
r
py
3 Calculate Pearson Correlation Coefficient (R):
(XcT · Yc )
Co
R=p
(XcT · Xc )(YcT · Yc )
ft-
4 Calculate R 2 :
ra
R2 = R · R
D
5 Calculate Adjusted R 2 :
(1 − R 2 )(n − 1)
Adjusted R 2 = 1 −
n−k −1
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 27 / 28
Multicollinearity in Multiple Linear Regression
Definition
Multicollinearity refers to the phenomenon in which two or more
ht
independent variables in a regression model are highly correlated with each
other.
r ig
Issues Caused by Multicollinearity
py
Unstable estimation of regression coefficients
Co
Inflated standard errors
Difficulty in determining the true relationship between independent
ft-
variables and the dependent variable
ra
Detection of Multicollinearity
D