Introduction To Regression: George Boorman

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Introduction to

regression
SUPERVISED LEARNING WITH SCIKIT-LEARN

George Boorman
Core Curriculum Manager, DataCamp
Predicting blood glucose levels
import pandas as pd
diabetes_df = pd.read_csv("diabetes.csv")
print(diabetes_df.head())

pregnancies glucose triceps insulin bmi age diabetes


0 6 148 35 0 33.6 50 1
1 1 85 29 0 26.6 31 0
2 8 183 0 0 23.3 32 1
3 1 89 23 94 28.1 21 0
4 0 137 35 168 43.1 33 1

SUPERVISED LEARNING WITH SCIKIT-LEARN


Creating feature and target arrays
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
print(type(X), type(y))

<class 'numpy.ndarray'> <class 'numpy.ndarray'>

SUPERVISED LEARNING WITH SCIKIT-LEARN


Making predictions from a single feature
X_bmi = X[:, 3]
print(y.shape, X_bmi.shape)

(752,) (752,)

X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape)

(752, 1)

SUPERVISED LEARNING WITH SCIKIT-LEARN


Plotting glucose vs. body mass index
import matplotlib.pyplot as plt
plt.scatter(X_bmi, y)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN


Plotting glucose vs. body mass index

SUPERVISED LEARNING WITH SCIKIT-LEARN


Fitting a regression model
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_bmi, y)
predictions = reg.predict(X_bmi)
plt.scatter(X_bmi, y)
plt.plot(X_bmi, predictions)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN


Fitting a regression model

SUPERVISED LEARNING WITH SCIKIT-LEARN


Let's practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN
The basics of linear
regression
SUPERVISED LEARNING WITH SCIKIT-LEARN

George Boorman
Core Curriculum Manager, DataCamp
Regression mechanics
y = ax + b
Simple linear regression uses one feature
y = target
x = single feature
a, b = parameters/coe cients of the model - slope, intercept

How do we choose a and b?


De ne an error function for any given line

Choose the line that minimizes the error function

Error function = loss function = cost function

SUPERVISED LEARNING WITH SCIKIT-LEARN


The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN


The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN


The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN


The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN


The loss function

SUPERVISED LEARNING WITH SCIKIT-LEARN


Ordinary Least Squares
n
RSS = ∑(yi − y^i )2
i=1

Ordinary Least Squares (OLS): minimize RSS

SUPERVISED LEARNING WITH SCIKIT-LEARN


Linear regression in higher dimensions
y = a1 x1 + a2 x2 + b

To t a linear regression model here:


Need to specify 3 variables: a1 ,  a2 ,  b

In higher dimensions:
Known as multiple regression

Must specify coe cients for each feature and the variable b
y = a1 x1 + a2 x2 + a3 x3 + ... + an xn + b

scikit-learn works exactly the same way:


Pass two arrays: features and target

SUPERVISED LEARNING WITH SCIKIT-LEARN


Linear regression using all features
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)

SUPERVISED LEARNING WITH SCIKIT-LEARN


R-squared
R2 : quanti es the variance in target values
explained by the features
Values range from 0 to 1

High R2 : Low R2 :

SUPERVISED LEARNING WITH SCIKIT-LEARN


R-squared in scikit-learn
reg_all.score(X_test, y_test)

0.356302876407827

SUPERVISED LEARNING WITH SCIKIT-LEARN


Mean squared error and root mean squared error
n
1
M SE = ∑(yi − y^i )2
n
i=1

M SE is measured in target units, squared

RM SE = √M SE

Measure RM SE in the same units at the target variable

SUPERVISED LEARNING WITH SCIKIT-LEARN


RMSE in scikit-learn
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred, squared=False)

24.028109426907236

SUPERVISED LEARNING WITH SCIKIT-LEARN


Let's practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN
Cross-validation
SUPERVISED LEARNING WITH SCIKIT-LEARN

George Boorman
Core Curriculum Manager, DataCamp
Cross-validation motivation
Model performance is dependent on the way we split up the data

Not representative of the model's ability to generalize to unseen data

Solution: Cross-validation!

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation basics

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation and model performance
5 folds = 5-fold CV

10 folds = 10-fold CV

k folds = k-fold CV

More folds = More computationally expensive

SUPERVISED LEARNING WITH SCIKIT-LEARN


Cross-validation in scikit-learn
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=6, shuffle=True, random_state=42)
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=kf)

SUPERVISED LEARNING WITH SCIKIT-LEARN


Evaluating cross-validation peformance
print(cv_results)

[0.70262578, 0.7659624, 0.75188205, 0.76914482, 0.72551151, 0.73608277]

print(np.mean(cv_results), np.std(cv_results))

0.7418682216666667 0.023330243960652888

print(np.quantile(cv_results, [0.025, 0.975]))

array([0.7054865, 0.76874702])

SUPERVISED LEARNING WITH SCIKIT-LEARN


Let's practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN
Regularized
regression
SUPERVISED LEARNING WITH SCIKIT-LEARN

George Boorman
Core Curriculum Manager, DataCamp
Why regularize?
Recall: Linear regression minimizes a loss function

It chooses a coe cient, a, for each feature variable, plus b


Large coe cients can lead to over ing

Regularization: Penalize large coe cients

SUPERVISED LEARNING WITH SCIKIT-LEARN


Ridge regression
Loss function = OLS loss function +
n
α ∗ ∑ ai 2
i=1

Ridge penalizes large positive or negative coe cients

α: parameter we need to choose


Picking α is similar to picking k in KNN

Hyperparameter: variable used to optimize model parameters

α controls model complexity


α = 0 = OLS (Can lead to over ing)

Very high α: Can lead to under ing

SUPERVISED LEARNING WITH SCIKIT-LEARN


Ridge regression in scikit-learn
from sklearn.linear_model import Ridge
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
scores.append(ridge.score(X_test, y_test))
print(scores)

[0.2828466623222221, 0.28320633574804777, 0.2853000732200006,


0.26423984812668133, 0.19292424694100963]

SUPERVISED LEARNING WITH SCIKIT-LEARN


Lasso regression
Loss function = OLS loss function +
n
α ∗ ∑ ∣ai ∣
i=1

SUPERVISED LEARNING WITH SCIKIT-LEARN


Lasso regression in scikit-learn
from sklearn.linear_model import Lasso
scores = []
for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
scores.append(lasso.score(X_test, y_test))
print(scores)

[0.99991649071123, 0.99961700284223, 0.93882227671069, 0.74855318676232, -0.05741034640016]

SUPERVISED LEARNING WITH SCIKIT-LEARN


Lasso regression for feature selection
Lasso can select important features of a dataset

Shrinks the coe cients of less important features to zero

Features not shrunk to zero are selected by lasso

SUPERVISED LEARNING WITH SCIKIT-LEARN


Lasso for feature selection in scikit-learn
from sklearn.linear_model import Lasso
X = diabetes_df.drop("glucose", axis=1).values
y = diabetes_df["glucose"].values
names = diabetes_df.drop("glucose", axis=1).columns
lasso = Lasso(alpha=0.1)
lasso_coef = lasso.fit(X, y).coef_
plt.bar(names, lasso_coef)
plt.xticks(rotation=45)
plt.show()

SUPERVISED LEARNING WITH SCIKIT-LEARN


Lasso for feature selection in scikit-learn

SUPERVISED LEARNING WITH SCIKIT-LEARN


Let's practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy