Applied Predictive Modeling: Central Iowa R Users Group
Applied Predictive Modeling: Central Iowa R Users Group
Applied Predictive Modeling: Central Iowa R Users Group
Max Kuhn
Pfizer R&D
“Predictive Modeling”
Define That!
Predictive Modeling
is the process of creating a model whose primary goal is to achieve high
levels of accuracy.
In other words, a situation where we are concerned with making the best
possible prediction on an individual data instance.
(aka pattern recognition)(aka machine learning)
For example, does anyone care why an email or SMS is labeled as spam?
biologically potent
safe
soluble, permeable, drug–like, etc
Individual cell results are aggregated so that decisions can be made about
specific compounds
Improperly segmented objects might compromise the quality of the data so
an algorithmic filter is needed.
In this application, we have measurements on the size, intensity, shape or
several parts of the cell (e.g. nucleus, cell body, cytoskeleton).
Can these measurements be used to predict poorly segmentation using a
set of manually labeled cells?
[1] 2019 61
The more data we spend, the better estimates we’ll get (provided the data
is accurate). Given a fixed amount of data,
too much spent in training won’t allow us to get a good assessment
of predictive performance. We may find a model that fits the training
data very well, but is not generalizable (over–fitting)
too much spent in testing won’t allow us to get good estimates of
model parameters
2
Predictor B
−2
−4
−4 −2 0 2 4
Predictor A
Kuhn (Pfizer R&D) APM 17 / 57
Over–Fitting
On the next slide, two classification boundaries are shown for the a
di↵erent model type not yet discussed.
The di↵erence in the two panels is solely due to di↵erent choices in tuning
parameters.
One over–fits the training data.
−4 −2 0 2 4
Model #1 Model #2
2
Predictor B
−2
−4
−4 −2 0 2 4
Predictor A
These procedures repeated split the training data into subsets used for
modeling and performance evaluation.
0.90
Accuracy (Cross−Validation)
0.89
0.88
0.87
0 5 10 15 20 25
#Neighbors
Kuhn (Pfizer R&D) APM 23 / 57
K–Nearest Neighbors Tuning – Individual Resamples
0.950
0.925
Accuracy
0.900
0.875
0.850
0.825
0 5 10 15 20 25
#Neighbors
Kuhn (Pfizer R&D) APM 24 / 57
Typical Process for Model Building
Now that we know how to evaluate models on the training set, we can try
di↵erent techniques (including pre-processing) and try to optimize model
performance.
Performance might not be the only consideration. Others might include:
simplicty of prediction
redusing the number of predictors (aka features) in the model to
reduce cost or complexity
smoothness of the prediction equation
robustness of the solution
Once we have 1-2 candidate models, we can evaluate the results on the
test set.
A simple model for fitting linear class boundaries to these data is linear
discriminant analysis (LDA).
The model computes the mean vector for the data within each class and a
common covariance matrix across the entire training set then uses the
di↵erences (discriminant functions):
D(u) = u0 S 1
(x̄P x̄W )
A few packages have this model, but we’ll use the lda function in the
MASS package:
> library(MASS)
> lda_fit <- lda(Class ~ ., data = seg_train, tol = 1.0e-15)
$class
[1] WS WS PS
Levels: PS WS
$posterior
PS WS
2 0.26358158 0.73641842
3 0.05269107 0.94730893
4 0.95044135 0.04955865
$x
LD1 LD2
2 1.055130 -1.50625016
3 2.015937 -0.26134111
4 -1.013353 -0.07737231
For our example, let’s choose the event to be a poorly segmented cell:
# PS predicted to be PS
Sensitivity =
# truly PS
# truly WS predicted to be WS
Specificity =
# truly WS
With two classes the Receiver Operating Characteristic (ROC) curve can
be used to estimate performance using a combination of sensitivity and
specificity.
Here, many alternative cuto↵s are evaluated and, for each cuto↵, we
calculate the sensitivity and specificity.
The ROC curve plots the sensitivity (eg. true positive rate) by one minus
specificity (eg. the false positive rate).
The area under the ROC curve is a common metric of performance.
1.0
0.00 (Sp = 0.00, Sn = 1.00)
0.20 (Sp = 0.76, Sn = 0.97)
0.50 (Sp = 0.86, Sn = 0.93)
0.8
0.80 (Sp = 0.95, Sn = 0.82)
0.6
Sensitivity
0.4
0.2
0.0
Call:
roc.default(response = seg_test$Class, predictor = lda_test_pred$posterior[, "PS"
Data: lda_test_pred$posterior[, "PS"] in 346 controls (seg_test$Class WS) < 664 cases
Area under the curve: 0.874
1.0
0.8
0.500 (0.723, 0.827)
0.6
Sensitivity
0.4
0.2
0.0
Reference
Prediction PS WS
PS 549 96
WS 115 250
Accuracy : 0.7911
95% CI : (0.7647, 0.8158)
No Information Rate : 0.6574
P-Value [Acc > NIR] : <2e-16
Kappa : 0.5422
Mcnemar's Test P-Value : 0.2153
Sensitivity : 0.8268
Specificity : 0.7225
Pos Pred Value : 0.8512
Neg Pred Value : 0.6849
Prevalence : 0.6574
Detection Rate : 0.5436
Detection Prevalence : 0.6386
Balanced Accuracy : 0.7747
'Positive' Class : PS
Since there are many modeling packages written by di↵erent people, there
are some inconsistencies in how models are specified and predictions are
made.
For example, many models have only one method of specifying the model
(e.g. formula method only)
How can we get resampled estimates of the area under the ROC curve for
the LDA model (without going to the test set)?
Let’s use five repeats of 10-fold cross-validation to assess the area under
the ROC curve with the LDA model.
First, we need to specify the model terms and what type of technique that
we are using:
> ## setting the seed before calling �train� controls the resamples
> set.seed(20792)
> lda_mod <- train(Class ~ ., data = seg_train, method = "lda")
train can use the formula and the non–formula method. The two
interfaces may lead to di↵erent results for some models that do not need
dummy variable conversions of factors.
The default resampling scheme is the bootstrap. Let’s use five repeats of
10–fold cross–validation instead.
Instead, let’s measure the area under the ROC curve, sensitivity, and
specificity.
> twoClassSummary(fakeData)
However, to calculate the ROC curve, we need the model to predict the
class probabilities. The classProbs option will also do this:
Finally, we tell the function to optimize the area under the ROC curve
using the metric argument:
> ctrl <- trainControl(method = "repeatedcv", repeats = 5,
+ classProbs = TRUE,
+ summaryFunction = twoClassSummary)
>
> set.seed(20792)
> lda_mod <- train(Class ~ ., data = seg_train,
+ method = "lda",
+ ## Add the metric argument
+ trControl = ctrl, metric = "ROC",
+ ## Also pass in options to �lda� using �...�
+ tol = 1.0e-15)
To loop through the models and data sets, caret uses the foreach package,
which parallelizes for loops.
For example, doMC uses the multicore package, which forks processes to
split computations (for unix and OS X). doParallel works well for Windows
(I’m told)
> lda_mod
1009 samples
58 predictor
2 classes: 'PS', 'WS'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 908, 908, 909, 909, 908, 908, ...
Resampling results
The value 0.8735 is the average of the 50 resamples. The test set estimate
was 0.874.
Another new argument that we can pass to train is preProc. This applies
di↵erent types of pre–processing to the predictors and is done within
resamples. It is also automatically applied when predicting too.
We will center and scale the predictors so that the distance metric isn’t
biased by scale.
> ## The same resamples are used
> set.seed(20792)
> knn_mod <- train(Class ~ ., data = seg_train,
+ method = "knn",
+ trControl = ctrl,
+ ## tuning parameter values to evaluate
+ tuneGrid = data.frame(k = seq(1, 25, by = 2)),
+ preProc = c("center", "scale"),
+ metric = "ROC")
k-Nearest Neighbors
1009 samples
58 predictor
2 classes: 'PS', 'WS'
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 23.
> ggplot(knn_mod)
ROC (Repeated Cross−Validation)
0.85
0.80
0.75
0 5 10 15 20 25
#Neighbors
[1] PS PS WS WS PS WS
Levels: PS WS
PS WS
1 0.9565217 0.04347826
2 0.8695652 0.13043478
3 0.2173913 0.78260870
4 0.3043478 0.69565217
5 1.0000000 0.00000000
6 0.4782609 0.52173913
Many of the predictors are skewed. Would transforming them via the
Yeo–Johnson transformation help?
> ## The same resamples are used
> set.seed(20792)
> knn_yj_mod <- train(Class ~ ., data = seg_train,
+ method = "knn",
+ trControl = ctrl,
+ tuneGrid = data.frame(k = seq(1, 25, by = 2)),
+ preProc = c("center", "scale", "YeoJohnson"),
+ metric = "ROC")
>
> ## What was the best area under the ROC curve?
> getTrainPerf(knn_yj_mod)
> ## Conduct o a paired t-test on the resampled AUC values to control for
> ## resample-to-resample variability:
> compare_models(knn_yj_mod, knn_mod, metric = "ROC")
data: x
t = 4.4595, df = 49, p-value = 4.796e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.004999086 0.013200217
sample estimates:
mean of x
0.009099651
There’s a lot more to tell about predictive modeling in R and what caret
can do.
There are many functions for feature selection in the package. The website
has more information on this and other aspects.
Thanks for listening!