Scikit Learn What Were Covering
Scikit Learn What Were Covering
Scikit Learn What Were Covering
June 8, 2023
We’ll use 2 datasets for demonstration purposes. * heart_disease - a classification dataset (pre-
dicting whether someone has heart disease or not) * boston_df - a regression dataset (predicting
the median house prices of cities in Boston)
# Regression data
from sklearn.datasets import load_boston
boston = load_boston() # loads as dictionary
# Convert dictionary to dataframe
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])
1
1.2 1. Get the data ready
[3]: # Split data into X & y
X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X
2
1.4 3. Fit the model to the data and make a prediction
[7]: # All models/estimators have the fit() function built-in
clf.fit(X_train, y_train)
# View preds/probabilities
y_preds, y_probs
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-
packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of
n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
[7]: (array([0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
1, 1, 1, 1, 1, 0, 1, 0, 1, 0]), array([[0.5, 0.5],
[0.2, 0.8],
[0.4, 0.6],
[0.5, 0.5],
[0.8, 0.2],
[0.9, 0.1],
[0.5, 0.5],
[0.9, 0.1],
[0.2, 0.8],
[1. , 0. ],
[0.5, 0.5],
[0.8, 0.2],
[0.5, 0.5],
[0.5, 0.5],
[0.2, 0.8],
[0.5, 0.5],
[0.6, 0.4],
[0.2, 0.8],
[0.4, 0.6],
[1. , 0. ],
[0.5, 0.5],
[0.5, 0.5],
[0.9, 0.1],
[0.6, 0.4],
[0.2, 0.8],
[0.4, 0.6],
3
[0.5, 0.5],
[0.4, 0.6],
[0.9, 0.1],
[0.7, 0.3],
[0.7, 0.3],
[0.2, 0.8],
[1. , 0. ],
[0.1, 0.9],
[0.6, 0.4],
[0.8, 0.2],
[1. , 0. ],
[0.1, 0.9],
[1. , 0. ],
[0.5, 0.5],
[0.6, 0.4],
[1. , 0. ],
[0.3, 0.7],
[0. , 1. ],
[0.9, 0.1],
[0.6, 0.4],
[0. , 1. ],
[0.3, 0.7],
[0.8, 0.2],
[0.6, 0.4],
[0.4, 0.6],
[0.6, 0.4],
[0.2, 0.8],
[0.4, 0.6],
[0. , 1. ],
[0.9, 0.1],
[0.8, 0.2],
[0.3, 0.7],
[0. , 1. ],
[0.9, 0.1],
[0.1, 0.9],
[0.1, 0.9],
[0.3, 0.7],
[1. , 0. ],
[0.9, 0.1],
[0.6, 0.4],
[0.3, 0.7],
[0.3, 0.7],
[0. , 1. ],
[0.3, 0.7],
[0.1, 0.9],
[0.6, 0.4],
[0. , 1. ],
4
[0.7, 0.3],
[0. , 1. ],
[1. , 0. ]]))
[8]: 0.8026315789473685
# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_preds))
print(roc_auc_score(y_test, y_preds))
5
# Confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))
# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))
0.8026315789473685
0.804920304920305
[[33 4]
[11 28]]
precision recall f1-score support
accuracy 0.80 76
macro avg 0.81 0.80 0.80 76
weighted avg 0.81 0.80 0.80 76
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)
0.8987155770408454
1.9618627450980388
6
7.75367352941176
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-
packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of
n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
7
[13]: # Example of adjusting hyperparameters by hand
0.868421052631579
0.8552631578947368
clf = RandomForestClassifier(n_jobs=1)
# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
param_distributions=grid,
n_iter=10, # try 10 models total
cv=5, # 5-fold cross-validation
8
verbose=2) # print out results
9
max_features=auto, max_depth=None
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=None, total= 0.3s
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=None
[CV] n_estimators=200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=None, total= 0.2s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20
[CV] n_estimators=10, min_samples_split=4, min_samples_leaf=4,
max_features=sqrt, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=20, total= 0.0s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
10
max_features=sqrt, max_depth=20
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20, total= 0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20, total= 0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20, total= 0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20, total= 0.1s
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20
[CV] n_estimators=100, min_samples_split=6, min_samples_leaf=1,
max_features=sqrt, max_depth=20, total= 0.1s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5, total= 1.4s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5, total= 1.5s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5, total= 1.4s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5, total= 1.9s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=2,
max_features=auto, max_depth=5, total= 2.2s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None, total= 2.8s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None, total= 1.7s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
11
max_features=auto, max_depth=None
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None, total= 1.6s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None, total= 1.3s
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None
[CV] n_estimators=1000, min_samples_split=4, min_samples_leaf=1,
max_features=auto, max_depth=None, total= 2.0s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10, total= 0.6s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10, total= 0.6s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10, total= 1.1s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10, total= 0.7s
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10
[CV] n_estimators=500, min_samples_split=2, min_samples_leaf=2,
max_features=sqrt, max_depth=10, total= 0.6s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 1.4s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 1.4s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 1.3s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 1.3s
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
12
max_features=auto, max_depth=30
[CV] n_estimators=1200, min_samples_split=6, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 1.3s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 0.0s
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30
[CV] n_estimators=10, min_samples_split=2, min_samples_leaf=4,
max_features=auto, max_depth=30, total= 0.0s
[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 31.0s finished
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-
packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default
of the `iid` parameter will change from True to False in version 0.22 and will
be removed in 0.24. This will change numeric results when test-set sizes are
unequal.
DeprecationWarning)
{'n_estimators': 1000, 'min_samples_split': 4, 'min_samples_leaf': 2,
'max_features': 'auto', 'max_depth': 5}
[14]: 0.819672131147541
13
# Evaluate loaded model
loaded_pickle_model.score(X_test, y_test)
[16]: 0.819672131147541
You can do the same with joblib. joblib is usually more efficient with numerical data (what our
models are).
[17]: ['gs_random_forest_model_1.joblib']
[19]: 0.819672131147541
# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
14
np.random.seed(42)
door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="constant", fill_value=4))])
# Split data
X = data.drop("Price", axis=1)
y = data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
/Users/daniel/Desktop/ml-course/zero-to-mastery-ml/env/lib/python3.7/site-
packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of
n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
[20]: 0.1821575815702311
15