Scikit Learn What Were Covering

scikit-learn-what-were-covering
June 8, 2023
1 What we’re covering in the Scikit-Learn Introduction

This notebook outlines the content convered in the Scikit-Learn Introduction.
It’s a quick stop to see all the Scikit-Learn functions and modules for each section outlined.
What we’re covering follows the following diagram detailing a Scikit-Learn workflow.
1.1 0. Standard library imports

For all machine learning projects, you’ll often see these libraries (Matplotlib, NumPy and pandas)
imported at the top.
[1]: %matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
We’ll use 2 datasets for demonstration purposes. * heart_disease - a classification dataset (pre-
dicting whether someone has heart disease or not) * boston_df - a regression dataset (predicting
the median house prices of cities in Boston)
[2]: # Classification data

heart_disease = pd.read_csv("../data/heart-disease.csv")
# Regression data
from sklearn.datasets import load_boston
boston = load_boston() # loads as dictionary
# Convert dictionary to dataframe
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])
1
1.2 1. Get the data ready
[3]: # Split data into X & y
X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X
[4]: # Split the data into training and test sets

from sklearn.model_selection import train_test_split
# Example use case (requires X & y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
1.3 2. Pick a model/estimator (to suit your problem)

To pick a model we use the Scikit-Learn machine learning map.
Note: Scikit-Learn refers to machine learning models and algorithms as estimators.
[5]: # Random Forest Classifier (for classification problems)

from sklearn.ensemble import RandomForestClassifier
# Instantiating a Random Forest Classifier (clf short for classifier)
clf = RandomForestClassifier()
[6]: # Random Forest Regressor (for regression problems)

from sklearn.ensemble import RandomForestRegressor
# Instantiating a Random Forest Regressor
model = RandomForestRegressor()
2
1.4 3. Fit the model to the data and make a prediction
[7]: # All models/estimators have the fit() function built-in
clf.fit(X_train, y_train)
# Once fit is called, you can make predictions using predict()

y_preds = clf.predict(X_test)
# You can also predict with probabilities (on classification models)

y_probs = clf.predict_proba(X_test)
# View preds/probabilities
y_preds, y_probs
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-
packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of
n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
[7]: (array([0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 0, 1, 0, 1, 0]), array([[0.5, 0.5],
[0.2, 0.8],
[0.4, 0.6],
[0.5, 0.5],
[0.8, 0.2],
[0.9, 0.1],
[0.5, 0.5],
[0.9, 0.1],
[0.2, 0.8],
[1. , 0. ],
[0.5, 0.5],
[0.8, 0.2],
[0.5, 0.5],
[0.5, 0.5],
[0.2, 0.8],
[0.5, 0.5],
[0.6, 0.4],
[0.2, 0.8],
[0.4, 0.6],
[1. , 0. ],
[0.5, 0.5],
[0.5, 0.5],
[0.9, 0.1],
[0.6, 0.4],
[0.2, 0.8],
[0.4, 0.6],
3
[0.5, 0.5],
[0.4, 0.6],
[0.9, 0.1],
[0.7, 0.3],
[0.7, 0.3],
[0.2, 0.8],
[1. , 0. ],
[0.1, 0.9],
[0.6, 0.4],
[0.8, 0.2],
[1. , 0. ],
[0.1, 0.9],
[1. , 0. ],
[0.5, 0.5],
[0.6, 0.4],
[1. , 0. ],
[0.3, 0.7],
[0. , 1. ],
[0.9, 0.1],
[0.6, 0.4],
[0. , 1. ],
[0.3, 0.7],
[0.8, 0.2],
[0.6, 0.4],
[0.4, 0.6],
[0.6, 0.4],
[0.2, 0.8],
[0.4, 0.6],
[0. , 1. ],
[0.9, 0.1],
[0.8, 0.2],
[0.3, 0.7],
[0. , 1. ],
[0.9, 0.1],
[0.1, 0.9],
[0.1, 0.9],
[0.3, 0.7],
[1. , 0. ],
[0.9, 0.1],
[0.6, 0.4],
[0.3, 0.7],
[0.3, 0.7],
[0. , 1. ],
[0.3, 0.7],
[0.1, 0.9],
[0.6, 0.4],
[0. , 1. ],
4
[0.7, 0.3],
[0. , 1. ],
[1. , 0. ]]))
1.5 4. Evaluate the model

Every Scikit-Learn model has a default metric which is accessible through the score() function.
However there are a range of different evaluation metrics you can use depending on the model
you’re using.
A full list of evaluation metrics can be found in the documentation.
[8]: # All models/estimators have a score() function

clf.score(X_test, y_test)
[8]: 0.8026315789473685
[9]: # Evaluting a model using cross-validation is possible with cross_val_score

from sklearn.model_selection import cross_val_score
# scoring=None means default score() metric is used

print(cross_val_score(estimator=clf,
X=X,
y=y,
cv=5, # use 5-fold cross-validation
scoring=None))
# Evaluate a model with a different scoring method

print(cross_val_score(estimator=clf,
X=X,
y=y,
cv=5, # use 5-fold cross-validation
scoring="precision"))
[0.78688525 0.86885246 0.7704918 0.78333333 0.81666667]

[0.8 0.92592593 0.85185185 0.83870968 0.75 ]
[10]: # Different classification metrics
# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_preds))
# Reciver Operating Characteristic (ROC curve)/Area under curve (AUC)

from sklearn.metrics import roc_curve, roc_auc_score
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,␣
↪y_probs[:, 1])
print(roc_auc_score(y_test, y_preds))
5
# Confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))
# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))
0.8026315789473685
0.804920304920305
[[33 4]
[11 28]]
precision recall f1-score support
0 0.75 0.89 0.81 37

1 0.88 0.72 0.79 39
accuracy 0.80 76
macro avg 0.81 0.80 0.80 76
weighted avg 0.81 0.80 0.80 76
[11]: # Different regression metrics
# Make predictions first

X = boston_df.drop("target", axis=1)
y = boston_df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)
# R^2 (pronounced r-squared) or coefficient of determination

from sklearn.metrics import r2_score
print(r2_score(y_test, y_preds))
# Mean absolute error (MAE)

from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_test, y_preds))
# Mean square error (MSE)

from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_preds))
0.8987155770408454
1.9618627450980388
6
7.75367352941176
1.6 5. Improve through experimentation

Two of the main methods to improve a models baseline metrics (the first evaluation metrics you
get).
From a data perspective asks: * Could we collect more data? In machine learning, more data is
generally better, as it gives a model more opportunities to learn patterns. * Could we improve our
data? This could mean filling in misisng values or finding a better encoding (turning things into
numbers) strategy.
From a model perspective asks: * Is there a better model we could use? If you’ve started out with
a simple model, could you use a more complex one? (we saw an example of this when looking at
the Scikit-Learn machine learning map, ensemble methods are generally considered more complex
models) * Could we improve the current model? If the model you’re using performs well straight
out of the box, can the hyperparameters be tuned to make it even better?
Hyperparameters are like settings on a model you can adjust so some of the ways it uses to
find patterns are altered and potentially improved. Adjusting hyperparameters is referred to as
hyperparameter tuning.
[12]: # How to find a model's hyperparameters

clf = RandomForestClassifier()
clf.get_params() # returns a list of adjustable hyperparameters
[12]: {'bootstrap': True,

'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 'warn',
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
7
[13]: # Example of adjusting hyperparameters by hand
# Split data into X & y

X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X
# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y)
# Instantiate two models with different settings

clf_1 = RandomForestClassifier(n_estimators=100)
clf_2 = RandomForestClassifier(n_estimators=200)
# Fit both models on training data

clf_1.fit(X_train, y_train)
clf_2.fit(X_train, y_train)
# Evaluate both models on test data and see which is best

print(clf_1.score(X_test, y_test))
print(clf_2.score(X_test, y_test))
0.868421052631579
0.8552631578947368
[14]: # Example of adjusting hyperparameters computationally (recommended)
from sklearn.model_selection import RandomizedSearchCV
# Define a grid of hyperparameters

grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
"max_depth": [None, 5, 10, 20, 30],
"max_features": ["auto", "sqrt"],
"min_samples_split": [2, 4, 6],
"min_samples_leaf": [1, 2, 4]}
# Split into train and test sets

# Set n_jobs to -1 to use all cores (NOTE: n_jobs=-1 is broken as of 8 Dec␣

↪2019, using n_jobs=1 works)
clf = RandomForestClassifier(n_jobs=1)
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
param_distributions=grid,
n_iter=10, # try 10 models total
cv=5, # 5-fold cross-validation
8
verbose=2) # print out results
# Fit the RandomizedSearchCV version of clf

rs_clf.fit(X_train, y_train);
# Find the best hyperparameters

print(rs_clf.best_params_)
# Scoring automatically uses the best hyperparameters

rs_clf.score(X_test, y_test)
Fitting 5 folds for each of 10 candidates, totalling 50 fits

[CV] n_estimators=100, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10
max_features=sqrt, max_depth=10, total= 0.2s
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s
max_features=auto, max_depth=None
max_features=auto, max_depth=None, total= 0.2s
9
max_features=auto, max_depth=20
max_features=auto, max_depth=20, total= 0.0s
10
11
12
[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 31.0s finished
packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default
of the `iid` parameter will change from True to False in version 0.22 and will
be removed in 0.24. This will change numeric results when test-set sizes are
unequal.
DeprecationWarning)
{'n_estimators': 1000, 'min_samples_split': 4, 'min_samples_leaf': 2,
'max_features': 'auto', 'max_depth': 5}
[14]: 0.819672131147541
1.7 6. Save and reload your trained model

You can save and load a model with pickle.
[15]: # Saving a model with pickle

import pickle
# Save an existing model to file

pickle.dump(rs_clf, open("rs_random_forest_model_1.pkl", "wb"))
[16]: # Load a saved pickle model

loaded_pickle_model = pickle.load(open("rs_random_forest_model_1.pkl", "rb"))
13
# Evaluate loaded model
loaded_pickle_model.score(X_test, y_test)
[16]: 0.819672131147541
You can do the same with joblib. joblib is usually more eﬀicient with numerical data (what our
models are).
[17]: # Saving a model with joblib

from joblib import dump, load
# Save a model to file

dump(rs_clf, filename="gs_random_forest_model_1.joblib")
[17]: ['gs_random_forest_model_1.joblib']
[18]: # Import a saved joblib model

loaded_joblib_model = load(filename="gs_random_forest_model_1.joblib")
[19]: # Evaluate joblib predictions

loaded_joblib_model.score(X_test, y_test)
[19]: 0.819672131147541
1.8 7. Putting it all together (not pictured)

We can put a number of different Scikit-Learn functions together using Pipeline.
As an example, we’ll use car-sales-extended-missing-data.csv. Which has missing data as
well as non-numeric data. For a machine learning model to work, there can be no missing data or
non-numeric values.
The problem we’re solving here is predicting a cars sales price given a number of parameters about
the car (a regression problem).
[20]: # Getting data ready

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
# Setup random seed

import numpy as np
14
np.random.seed(42)
# Import data and drop the rows with missing labels

data = pd.read_csv("../data/car-sales-extended-missing-data.csv")
data.dropna(subset=["Price"], inplace=True)
# Define different features and transformer pipelines

categorical_features = ["Make", "Colour"]
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
("onehot", OneHotEncoder(handle_unknown="ignore"))])
door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant", fill_value=4))])
numeric_features = ["Odometer (KM)"]

numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="mean"))
])
# Setup preprocessing steps (fill missing values, then convert to numbers)

preprocessor = ColumnTransformer(
transformers=[
("cat", categorical_transformer, categorical_features),
("door", door_transformer, door_feature),
("num", numeric_transformer, numeric_features)])
# Create a preprocessing and modelling pipeline

model = Pipeline(steps=[("preprocessor", preprocessor),
("model", RandomForestRegressor())])
# Split data
X = data.drop("Price", axis=1)
y = data["Price"]
# Fit and score the model

model.fit(X_train, y_train)
model.score(X_test, y_test)
[20]: 0.1821575815702311
15

Scikit Learn What Were Covering

Uploaded by

Copyright:

Available Formats

Scikit Learn What Were Covering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scikit Learn What Were Covering

Uploaded by

Copyright:

Available Formats

scikit-learn-what-were-covering

1 What we’re covering in the Scikit-Learn Introduction

1.1 0. Standard library imports

[1]: %matplotlib inline

[2]: # Classification data

[4]: # Split the data into training and test sets

1.3 2. Pick a model/estimator (to suit your problem)

Note: Scikit-Learn refers to machine learning models and algorithms as estimators.

[5]: # Random Forest Classifier (for classification problems)

[6]: # Random Forest Regressor (for regression problems)

# Once fit is called, you can make predictions using predict()

# You can also predict with probabilities (on classification models)

1.5 4. Evaluate the model

[8]: # All models/estimators have a score() function

[9]: # Evaluting a model using cross-validation is possible with cross_val_score

# scoring=None means default score() metric is used

# Evaluate a model with a different scoring method

[0.78688525 0.86885246 0.7704918 0.78333333 0.81666667]

[10]: # Different classification metrics

# Reciver Operating Characteristic (ROC curve)/Area under curve (AUC)

0 0.75 0.89 0.81 37

[11]: # Different regression metrics

# Make predictions first

# R^2 (pronounced r-squared) or coefficient of determination

# Mean absolute error (MAE)

# Mean square error (MSE)

1.6 5. Improve through experimentation

[12]: # How to find a model's hyperparameters

[12]: {'bootstrap': True,

# Split data into X & y

# Split data into train and test sets

# Instantiate two models with different settings

# Fit both models on training data

# Evaluate both models on test data and see which is best

[14]: # Example of adjusting hyperparameters computationally (recommended)

from sklearn.model_selection import RandomizedSearchCV

# Define a grid of hyperparameters

# Split into train and test sets

# Set n_jobs to -1 to use all cores (NOTE: n_jobs=-1 is broken as of 8 Dec␣

# Fit the RandomizedSearchCV version of clf

# Find the best hyperparameters

# Scoring automatically uses the best hyperparameters

Fitting 5 folds for each of 10 candidates, totalling 50 fits

1.7 6. Save and reload your trained model

[15]: # Saving a model with pickle

# Save an existing model to file

[16]: # Load a saved pickle model

[17]: # Saving a model with joblib

# Save a model to file

[18]: # Import a saved joblib model

[19]: # Evaluate joblib predictions

1.8 7. Putting it all together (not pictured)

[20]: # Getting data ready

# Setup random seed

# Import data and drop the rows with missing labels

# Define different features and transformer pipelines

numeric_features = ["Odometer (KM)"]

# Setup preprocessing steps (fill missing values, then convert to numbers)