0% found this document useful (0 votes)
15 views

Data Science Unit-II

Uploaded by

reddyshadvalini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Data Science Unit-II

Uploaded by

reddyshadvalini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data Science - Unit - II

ht
Instructor : Krishna Dutt,

ig
krishnadutt.rvs@gmail.com

r
py
March 23, 2024

Co
ft-
Disclaimer: The views expressed in this presentation are those of the
ra
author, and many open source content are referenced, with all authors
duly acknowledged.
D

Copyright: This beamer is for private targetted circulation only. Content


in this beamer should not be copied either in part of full without prior
approval of the instructor.

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 1 / 28
UNIT - II syllubus

ht
ig
Linear multiple regression Estimation and testing of coefficients R2 and

r
adjusted R2 Logistic regression Estimation and Testing of coefficients R2

py
and adjusted R2 coefficients Logistic regression and interpretation of

Co
coefficients K– Nearest Neighbor classifier random forest classification
errors Ridge Regression Support Vector Machine Analysis of various
ft-
(ANOVA)
ra
D

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 2 / 28
Linear Regression - Single Variable
Data / features / dimensions: represent the state of real-world
phenomena
Re-presentable in primitive formats like numerical, characters, strings

ht
Composite format: image, video, sound, text, etc.

ig
Data can be in structured or non-structured (SQL/NO SQL) formats

r
Both input and corresponding output can be considered as data

py
In some cases, no explicit mention of output data, but a goal shall be

Co
considered
Usually, an n-dimensional input
ft-
M-dimensional output
Assume all data is converted to numerical format
ra

X Y Y∼
D

x1 y1 (y1 − h(θ0 , θ1 , x1 ))
x2 y2 (y2 − h(θ0 , θ1 , x2 ))
· · ·
xm ym (ym − h(θ0 , θ1 , xm ))
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 3 / 28
Linear Regression - Single Variable
Assume all data is converted to numerical format
In linear Regression, we assume a model
Yi∼ = h(θ0 , theta1 , xi ) = θ0 + θ1 .xi

ht
(1)

ig
The model is described two parameters θ0 &θ1 , which are unknown. The
goal is to estimate these unknowns for a given data. In the graph, which

r
py
line represents best fit for a given data?

Co
ft-
ra
D

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 4 / 28
Matrix Calculus reqd. for Linear Regression

Matrix Calculus deals with differentiating a scalar and vector functions of


vector variable. We encounter a scalar function of vector variable in linear

ht
regression problems, accordingly we provide the formulas for differentiating
a scalar function of vector variable. Consider a vector variable X and

ig
Constant vector A, y scalar
 function  of 
X y=

r
py
x1 a1
 x2   a2 
XT · A = AT · X ; X =  .  ; A =  .  ;

Co
   
.
 .   .. 
xm am
ft-
∂(AT · X ) ∂(X T · A)
ra

= =A (2)
X X
D

∂(X T · A · X )
=2·A·X (3)
X

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 5 / 28
Linear Regression - Uni-variate
Consider the case of a uni variate. Linear Model is defined by

ỹi = θ0 + θ1 .xi ; (4)

ht
ỹi and xi , are the i th response and predictor values respectively from data

ig
set. We consider a dataset of n values and the corresponding vectors are

r
py
   
x1 y1
 x2   y2   

Co
θ
X =  . ;Y =  . ;θ = 0 ; (5)
   
.
 .  .
 .  θ1
ft-
xm ym
ra

Recast X as  
1 x1
D

1 x2 
X = . ; (6)
 
 .. 
1 xm
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 6 / 28
Linear Regression Ordinary least square minimization

The Model, in vector form is represented as

ht
Ỹ = X · θ (7)

r ig
The deviation between actual and model predicted value is given by

py
Co
ϵ = (Y − Ỹ ) (8)

; Square error, ϵ2 , is called Cost function, J(θ), or sometimes called error


ft-
function also. This cost function represents the sum of square of errors
ra

between each actual value of dependent variable and model predicted


value of dependent variable considering each point represented by
D

independent variable from training data set.

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 7 / 28
Linear Regression .. contd.
In ordinary Least square fit of the Linear model, we minimize J(θ), as cost
function is convex and easy to obtain explicit solution by equating the
gradient of J(θ) = 0.

ht
J(θ) = ϵT .ϵ = (Y − Ỹ )T · (Y − Ỹ ) = (Y T − Ỹ T )(Y − Ỹ )

r ig
= YT · Y − Y T · Ỹ − Ỹ T · Y + Ỹ T · Ỹ

py
= YT · Y − Y T (X · θ) − (X · θ)T · Y + (X · θ)T · (X · θ)
= YT · Y − Y T · X · θ − θT · X T · Y + θT · X T · X · θ(9)

∂J(θ) Co
∂(Y T · X · θ) ∂(θT · X T · Y ) ∂(θT · X T · X · θ)
ft-
=− − +
∂θ ∂θ ∂θ ∂θ
ra

= -(Y X ) − (X Y ) + 2 · X · X θ = −2 · (X Y ) + 2 · (X T · X · θ)(10)
T T T T T
D

∂J(θ)
= 0; −2 · (X T · Y ) + 2 · (X T · X · θ) = 0 (11)
∂θ
(X T · Y ) = (X T X ) · θ; θ = (X T X )− 1(X T Y ) (12)
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 8 / 28
Linear Regression - Multivariate
Consider a multivariate data
   
x11 , x12 , · · x1n y1  
θ0

ht
x21 , x22 , · · x2n   y2 
 θ1 
X =  ; X =  ..  ; θ =   ; (13)
   
..

ig
 .   .  ..
.θn
xm1 , xm2 , ·xmn

r
ym

py
Recast X as

Co
 
1, x11 , x12 , · · x1n
ft-
 1, x21 , x22 , · · x2n

X =  ; Ỹ = X · θ; ϵ = Y − Ỹ (14)
 
..
.
ra
 
1, xm1 , xm2 , · · xmn
D

J(θ) = ϵT · ϵ (15)
The objective J(θ) still scalar function of vector θ. The earlier solution for
θstillholds!!
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 9 / 28
Linear Regression - Example
Consider a uni-variate hypothetical example.
     
1 2 1, 1

ht
 
2 3 θ 0 1, 2
 
X =4 ; Y = 5 ; θ = θ1 ; Recast X as X = 1, 4
   (16)

ig
5 6 1, 5

r
py
 
  1, 1  
T 1, 1, 1, 1  1, 2 4, 12

Co
X ·X = ·   = (17)
1, 2, 4, 5 1, 4 12, 46
1, 5
ft-
   
T − 1 46, −12 1 46, −12
(X X ) 1 = = (18)
ra

(4 · 46 − 12 · 12) 12, 4 40 12, 4


D

 
  2  
T 1, 1, 1, 1 3  16
X Y = ·  = (19)
1, 2, 4, 5 5 58
6
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 10 / 28
Linear Regression - Example - contd..

     
T − 1 T 46, −12 16 1
θ = (X X ) 1X Y = θ = · = (20)

ht
40 12, 4 58 1

ig
The following bi-variate problem can be solved using the above steps.

r
   
1, 2 3

py
2, 3 4
X = 4, 5 Y = 5 (21)

Co
  

5, 6 6
ft-
Verify the answer for the above bi-variate is
ra

 
1.7
D

θ= 0  (22)
0.7

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 11 / 28
Bivariate Least square - Example

Another bi-variate problem can be solved using the above steps. marks
obtained in theory and lab and

ht
 
Theory Lab grade

ig
 60 70 A 

r
 
Data =  70 75 A 

py


 40 55 B 

Co
30
ft- 60 F
   
60, 70 3
70, 75 3
ra
X =
40, 55 Y = 2
   (23)
D

30, 60 0
In the above ordinal values of grades are transformed into numerical values
to facilitate linear regression.

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 12 / 28
Ridge Regression
The above least square regression minimizes the deviations of model
predicted values w.r.to actual data points, the problem of over fitting can
not be avoided. An additional loss term consisting of model parameters

ht
alone is added to the objective to penalize model parameters and control

ig
the overfitting.
J(θ) = ϵT .ϵ + λθT θ

r
(24)

py
The least square fit is obtained, as earlier by solving

Co
∂J(θ)
=0 (25)
∂θ
ft-
First part of
J(ϵT · ϵ)
ra
J(θ)
=
∂θ ∂θ
D

given above is already obtained. The second part is


∂λ(θT .θ)
=λ·I (26)
∂θ
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 13 / 28
Ridge Regession..contd.

ht
r ig
Therefore total solution with Ridge regression is obtained as

py
θ = (X T X + ·I)− 1(X T Y ) (27)

Co
ft-
ra
D

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 14 / 28
Hypothesis Testing in Linear Regression: T-Test
Objective: Assess the significance of individual coefficients in linear
regression.
Hypothesis Testing for Individual Coefficients:

ht
Null Hypothesis (H0 ): The coefficient is equal to zero (βi = 0).

ig
Alternative Hypothesis (H1 ): The coefficient is not equal to zero

r
(βi ̸= 0).

py
T-Test Statistic for Linear Regression:

Co
β̂i
ti =
Standard Error(β̂i )
ft-
Degrees of Freedom:
ra

df = n − p − 1
D

(where n is the number of observations and p is the number of predictors)


Degrees of Freedom: In the context of hypothesis testing for linear
regression, degrees of freedom (df ) are calculated as n − p − 1, where n is
the sample size and p is the number of predictors in the model.
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 15 / 28
Hypothesis testing in Linear Regression - T test - contd..

ht
ig
Decision Rule:

r
If |ti | is significantly different from zero, reject H0 in favor of H1 .

py
Common significance levels include 0.05, 0.01, etc.

Co
Interpretation:
A significant coefficient suggests that the corresponding predictor is
ft-
associated with the response variable.
ra
D

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 16 / 28
Numerical Example: Hypothesis Testing in Linear
Regression
Scenario: We have a linear regression model with a single predictor X and

ht
the response variable Y .
Hypotheses:

ig
Null Hypothesis (H0 ): β1 = 0 (No relationship between X and Y ).

r
Alternative Hypothesis (H1 ): β1 ̸= 0 (There is a significant

py
relationship).

Co
Given:
Sample size (n) = 50
ft-
Estimated coefficient (β̂1 ) = 2.5
Standard error (SE (β̂1 )) = 1.2
ra

Degrees of freedom (df ) = n − p − 1 (Assuming p = 1) = 48


D

Significance level (α) = 0.05


T-Test Statistic:
2.5
t= ≈ 2.08
1.2
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 17 / 28
Numerical Example: Hypothesis Testing in Linear
Regression-contd..

ht
ig
Decision Rule:

r
If |t| > tα/2,df , reject H0 .

py
For α = 0.05 and df = 48, tα/2,df ≈ 2.013.

Co
Conclusion:
Since |t| > 2.013, we reject H0 in favor of H1 .
ft-
There is sufficient evidence to suggest a significant relationship
ra

between X and Y .
D

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 18 / 28
Correlation Coefficient R

The formula for R


(X − 1µX )T · (Y − 1µY )

ht
R=p
(X − 1µX )T · (X − 1µX ) · (Y − 1µY )T · (Y − 1µX )

r ig
Range of R :

py
−1 <= R <= 1

Co
measures the strength and direction of the linear relationship between X
and Y.Doesn’t directly tell the proportion of variance explained by the
ft-
relationship.
ra
D

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 19 / 28
Correlation Coefficient R and R 2 - another formulation

There is another way of showing both R and R 2

ht
(XcT · Yc )

ig
R=p ; Xc = (X − 1 · µx ); Yc = (Y − 1 · µy ) (28)
(XcT · Xc )(Yct · Yc )

r
py
(XcT · Yc )
cov (X , Y ) = ; (29)

Co
n
(X − 1µx )T (X − 1µx ) (Y − 1µy )T (Y − 1µy )
ft-
var (X ) = ; var (Y ) = ;
n n
(30)
ra

2 cov (X , Y )
R =p (31)
D

(var (X ) · var (Y )

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 20 / 28
Correlation Coefficient R 2

R 2 is the squared value of R. Accordingly Range of R 2 :

ht
0 <= R 2 <= 1

ig
. The difference between R and R 2 :

r
py
R (Pearson correlation coefficient):Measures the strength and
direction of the linear relationship between X and Y. Doesn’t directly

Co
tell you the proportion of variance explained by the relationship.
R 2 : Signifies the proportion of variance in Y explained by the linear
ft-
relationship with X.Doesn’t directly tell you the direction (positive or
ra
negative) of the relationship.
While mathematically R 2 is the square of R, their interpretations
D

differ.R focuses on the strength and direction of the association, while


R 2 focuses on the proportion of variance explained.

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 21 / 28
Coefficient of Determination (R-squared) and Adjusted
R-squared

ht
There is another way of defining R 2 as below, however both the

ig
expressions, one shown above and the other shown here are both same.
Coefficient of Determination (R-squared):

r
py
Measures the proportion of the variance in the dependent variable
that is predictable from the independent variables.

Co
In vector notation, it is calculated as:
ft-
SSR
R2 = 1 −
ra
SST
D

where SSR is the sum of squared residuals, and SST is the total sum
of squares.

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 22 / 28
Calculation of SSR and SST in Linear Regression
Sum of Squared Residuals (SSR):
SSR measures the sum of the squared differences between the

ht
predicted and actual values.
In vector notation, it is calculated as:

r ig
SSR = (y − Xθ)T (y − Xθ)

py
Co
where y is the vector of actual values, X is the data matrix, and θ is
the coefficient vector.
ft-
Total Sum of Squares (SST):
SST measures the total sum of squared differences between the actual
ra

values and the mean of the actual values.


D

In vector notation, it is calculated as:

SST = (y − µy 1)T (y − µy 1)

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 23 / 28
Adjusted R-squared
R² (Coefficient of Determination): Measures the percentage of
variance in the dependent variable explained by the independent
variable. It increases (or stays the same) when more independent

ht
variables are added, even if those variables don’t actually explain any

ig
additional variation. This can lead to overfitting and noise and can

r
increase model complexity and lead to fitting noise.

py
Adjusted R²: adjusts R² to account for the number of independent
variables in the model. It penalizes for adding variables that don’t

Co
improve the model’s fit. Adjusted R² can increase or decrease when
you add variables.
ft-
Adjusted R-squared:
ra

Adjusts R-squared for the number of predictors in the model.


D

(1 − R 2 )(n − 1)
Adjusted R 2 = 1 −
(n − p)
where n is the number of observations and p is the number of
predictors.
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 24 / 28
R2 and Adjusted R2: Numerical Example

     
1 2 1

ht
2 3 1   2
X = 4 ; Y = 5 ; µx = 4 1, 1, 1, 1 · 4 = 3 (32)

ig
    

5 6 5

r
py
       
2 1 3 −2

Co
1  3 2 3 −1
µy = 1, 1, 1, 1 · 
5 = 4; (Xc = X − µx 1) = 4 − 3 =  1 
      
4
ft-
6 5 3 2
ra
     
2 4 −2
3 4 −1
D

(Yc = Y − µy 1) =  5 − 4 =  1  (33)


    

6 4 2

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 25 / 28
 
−2
−1
(XcT .Yc ) = −2, −1, 1, 2 · 
  
 1  = 10; (34)

ht
2

ig
(XcT .Xc ) = −2, −1, 1, 2 = 10;
 
(35)

r
py
 
−2
−1

Co
(YcT .Yc ) = −2, −1, 1, 2 · 
  
 1  = 10; (36)

2
ft-
(XcT · Yc )
ra

R=p = 10/10 = 1 (37)


(XcT · Xc )(YcT · Yc )
D

R2 = 1 (38)

Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 26 / 28
Steps for Numerical Solution for R, R 2 , and Adjusted R 2
1 Find Means: Calculate µx and µy , the means of X and Y ,
respectively.

ht
2 Center Variables: Obtain centered variables Xc and Yc :

ig
Xc = X − 1 · µx and Yc = Y − 1 · µy

r
py
3 Calculate Pearson Correlation Coefficient (R):

(XcT · Yc )

Co
R=p
(XcT · Xc )(YcT · Yc )
ft-
4 Calculate R 2 :
ra

R2 = R · R
D

5 Calculate Adjusted R 2 :
(1 − R 2 )(n − 1)
Adjusted R 2 = 1 −
n−k −1
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 27 / 28
Multicollinearity in Multiple Linear Regression
Definition
Multicollinearity refers to the phenomenon in which two or more

ht
independent variables in a regression model are highly correlated with each
other.

r ig
Issues Caused by Multicollinearity

py
Unstable estimation of regression coefficients

Co
Inflated standard errors
Difficulty in determining the true relationship between independent
ft-
variables and the dependent variable
ra

Detection of Multicollinearity
D

Mathematically, multicollinearity can be detected using the Variance


Inflation Factor (VIF).
1
VIFi =
1 − Ri2
Instructor : Krishna Dutt, krishnadutt.rvs@gmail.com Data Science - Unit - II March 23, 2024 28 / 28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy