Akash Mahanty 1
Akash Mahanty 1
Akash Mahanty 1
R topics documented:
autoplot.partial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
boston . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
partial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
pdp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
pima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
plotPartial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
topPredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
2 autoplot.partial
Index 15
Description
Plots partial dependence functions (i.e., marginal effects) using ggplot2 graphics.
Usage
## S3 method for class 'partial'
autoplot(object, center = FALSE, plot.pdp = TRUE,
pdp.color = "red", pdp.size = 1, pdp.linetype = 1, rug = FALSE,
smooth = FALSE, smooth.method = "auto", smooth.formula = y ~ x,
smooth.span = 0.75, smooth.method.args = list(), contour = FALSE,
contour.color = "white", palette = "Spectral", train = NULL,
xlab = NULL, ylab = NULL, main = NULL, legend.title = NULL, ...)
Arguments
object An object that inherits from the "partial" class.
center Logical indicating whether or not to produce centered ICE curves (c-ICE curves).
Only useful when object represents a set of ICE curves; see partial for de-
tails. Default is FALSE.
plot.pdp Logical indicating whether or not to plot the partial dependence function on top
of the ICE curves. Default is TRUE.
pdp.color Character string specifying the color to use for the partial dependence function
when plot.pdp = TRUE. Default is "red".
pdp.size Positive number specifying the line width to use for the partial dependence func-
tion when plot.pdp = TRUE. Default is 1.
pdp.linetype Positive number specifying the line type to use for the partial dependence func-
tion when plot.pdp = TRUE. Default is 1.
rug Logical indicating whether or not to include rug marks on the predictor axes.
Default is FALSE.
smooth Logical indicating whether or not to overlay a LOESS smooth. Default is FALSE.
autoplot.partial 3
smooth.method Character string specifying the smoothing method (function) to use (e.g., "auto",
"lm", "glm", "gam", "loess", or "rlm"). Default is "auto". See geom_smooth
for details.
smooth.formula Formula to use in smoothing function (e.g., y ~ x, y ~ poly(x, 2), or
y ~ log(x)).
smooth.span Controls the amount of smoothing for the default loess smoother. Smaller num-
bers produce wigglier lines, larger numbers produce smoother lines. Default is
0.75.
smooth.method.args
List containing additional arguments to be passed on to the modelling function
defined by smooth.method.
contour Logical indicating whether or not to add contour lines to the level plot. Only
used when levelplot = TRUE. Default is FALSE.
contour.color Character string specifying the color to use for the contour lines when contour = TRUE.
Default is "white".
palette If a string, will use that named palette. If a number, will index into the list of
palettes of appropriate type. Default is "Spectral".
train Data frame containing the original training data. Only required if rug = TRUE
or chull = TRUE.
xlab Charater string specifying the text for the x-axis label.
ylab Charater string specifying the text for the y-axis label.
main Character string specifying the text for the main title of the plot.
legend.title Charater string specifying the text for the legend title. Default is "yhat".
... Additional optional arguments to be passed onto geom_line.
Value
A "ggplot" object.
Examples
## Not run:
#
# Regression example (requires randomForest package to run)
#
## End(Not run)
Description
Data on median housing values from 506 census tracts in the suburbs of Boston from the 1970
census. This data frame is a corrected version of the original data by Harrison and Rubinfeld
(1978) with additional spatial information. The data were taken directly from BostonHousing2 and
unneeded columns (i.e., name of town, census tract, and the uncorrected median home value) were
removed.
Usage
data(boston)
Format
A data frame with 506 rows and 16 variables.
References
Harrison, D. and Rubinfeld, D.L. (1978). Hedonic prices and the demand for clean air. Journal of
Environmental Economics and Management, 5, 81-102.
Gilley, O.W., and R. Kelley Pace (1996). On the Harrison and Rubinfeld Data. Journal of Environ-
mental Economics and Management, 31, 403-405.
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine
learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html] Irvine, CA: University of
California, Department of Information and Computer Science.
Pace, R. Kelley, and O.W. Gilley (1997). Using the Spatial Configuration of the Data to Improve
Estimation. Journal of the Real Estate Finance and Economics, 14, 333-340.
Friedrich Leisch & Evgenia Dimitriadou (2010). mlbench: Machine Learning Benchmark Prob-
lems. R package version 2.1-1.
Examples
head(boston)
Description
Compute partial dependence functions (i.e., marginal effects) for various model fitting objects.
Usage
partial(object, ...)
## Default S3 method:
partial(object, pred.var, pred.grid, pred.fun = NULL,
grid.resolution = NULL, ice = FALSE, center = FALSE,
quantiles = FALSE, probs = 1:9/10, trim.outliers = FALSE,
type = c("auto", "regression", "classification"), inv.link = NULL,
which.class = 1L, prob = FALSE, recursive = TRUE, plot = FALSE,
smooth = FALSE, rug = FALSE, chull = FALSE, train, cats = NULL,
check.class = TRUE, progress = "none", parallel = FALSE,
paropts = NULL, ...)
6 partial
Arguments
object A fitted model object of appropriate class (e.g., "gbm", "lm", "randomForest",
"train", etc.).
... Additional optional arguments to be passed onto predict.
pred.var Character string giving the names of the predictor variables of interest. For
reasons of computation/interpretation, this should include no more than three
variables.
pred.grid Data frame containing the joint values of interest for the variables listed in
pred.var.
pred.fun Optional prediction function that requires two arguments: object and newdata.
If specified, then the function must return a single prediction or a vector of
predictions (i.e., not a matrix or data frame). Default is NULL.
grid.resolution
Integer giving the number of equally spaced points to use for the continuous
variables listed in pred.var when pred.grid is not supplied. If left NULL, it
will default to the minimum between 51 and the number of unique data points
for each of the continuous independent variables listed in pred.var.
ice Logical indicating whether or not to compute individual conditional expectation
(ICE) curves. Default is FALSE. See Goldstein et al. (2014) for details.
center Logical indicating whether or not to produce centered ICE curves (c-ICE curves).
Only used when ice = TRUE. Default is FALSE. See Goldstein et al. (2014) for
details.
quantiles Logical indicating whether or not to use the sample quantiles of the continuous
predictors listed in pred.var. If quantiles = TRUE and grid.resolution = NULL
the sample quantiles will be used to generate the grid of joint values for which
the partial dependence is computed.
probs Numeric vector of probabilities with values in [0,1]. (Values up to 2e-14 out-
side that range are accepted and moved to the nearby endpoint.) Default is
1:9/10 which corresponds to the deciles of the predictor variables. These spec-
ify which quantiles to use for the continuous predictors listed in pred.var when
quantiles = TRUE.
trim.outliers Logical indicating whether or not to trim off outliers from the continuous pre-
dictors listed in pred.var (using the simple boxplot method) before generating
the grid of joint values for which the partial dependence is computed. Default is
FALSE.
type Character string specifying the type of supervised learning. Current options
are "auto", "regression" or "classification". If type = "auto" then
partial will try to extract the necessary information from object.
inv.link Function specifying the transformation to be applied to the predictions before
the partial dependence function is computed (experimental). Default is NULL
(i.e., no transofrmation). This option is intended to be used for models that
allow for non-Gaussian response variables (e.g., counts). For these models, pre-
dictions are not typically returned on the original response scale by default. For
example, Poisson GBMs typically return predictions on the log scale. In this
case setting inv.link = exp will return the partial dependence function on the
response (i.e., raw count) scale.
partial 7
which.class Integer specifying which column of the matrix of predicted probabilities to use
as the "focus" class. Default is to use the first class. Only used for classification
problems (i.e., when type = "classification").
prob Logical indicating whether or not partial dependence for classification problems
should be returned on the probability scale, rather than the centered logit. If
FALSE, the partial dependence function is on a scale similar to the logit. Default
is FALSE.
recursive Logical indicating whether or not to use the weighted tree traversal method de-
scribed in Friedman (2001). This only applies to objects that inherit from class
"gbm". Default is TRUE which is much faster than the exact brute force approach
used for all other models. (Based on the C++ code behind plot.gbm.)
plot Logical indicating whether to return a data frame containing the partial depen-
dence values (FALSE) or plot the partial dependence function directly (TRUE).
Default is FALSE. See plotPartial for plotting details.
smooth Logical indicating whether or not to overlay a LOESS smooth. Default is FALSE.
rug Logical indicating whether or not to include a rug display on the predictor axes.
The tick marks indicate the min/max and deciles of the predictor distributions.
This helps reduce the risk of interpreting the partial dependence plot outside the
region of the data (i.e., extrapolating). Only used when plot = TRUE. Default
is FALSE.
chull Logical indicating wether or not to restrict the values of the first two variables
in pred.var to lie within the convex hull of their training values; this affects
pred.grid. This helps reduce the risk of interpreting the partial dependence
plot outside the region of the data (i.e., extrapolating).Default is FALSE.
train An optional data frame, matrix, or sparse matrix containing the original training
data. This may be required depending on the class of object. For objects that
do not store a copy of the original training data, this argument is required. For
reasons discussed below, it is good practice to always specify this argument.
cats Character string indicating which columns of train should be treated as cat-
egorical variables. Only used when train inherits from class "matrix" or
"dgCMatrix".
check.class Logical indicating whether or not to make sure each column in pred.grid has
the correct class, levels, etc. Default is TRUE.
progress Character string giving the name of the progress bar to use while constructing
the partial dependence function. See create_progress_bar for details. Default
is "none".
parallel Logical indicating whether or not to run partial in parallel using a backend
provided by the foreach package. Default is FALSE.
paropts List containing additional options to be passed onto foreach when parallel = TRUE.
Value
By default, partial returns an object of class c("data.frame", "partial"). If ice = TRUE and
center = FALSE then an object of class c("data.frame", "ice") is returned. If ice = TRUE
and center = TRUE then an object of class c("data.frame", "cice") is returned. These three
8 partial
classes determine the behavior of the plotPartial function which is automatically called when-
ever plot = TRUE. Specifically, when plot = TRUE, a "trellis" object is returned (see lattice
for details); the "trellis" object will also include an additional attribute, "partial.data", con-
taining the data displayed in the plot.
Note
In some cases it is difficult for partial to extract the original training data from object. In these
cases an error message is displayed requesting the user to supply the training data via the train
argument in the call to partial. In most cases where partial can extract the required training
data from object, it is taken from the same environment in which partial is called. Therefore, it
is important to not change the training data used to construct object before calling partial. This
problem is completely avoided when the training data are passed to the train argument in the call
to partial.
It is recommended to call partial with plot = FALSE and store the results. This allows for more
flexible plotting, and the user will not have to waste time calling partial again if the default plot
is not sufficient.
It is possible to retrieve the last printed "trellis" object, such as those produced by plotPartial,
using trellis.last.object().
If ice = TRUE or the prediction function given to pred.fun returns a prediction for each observa-
tion in newdata, then the result will be a curve for each observation. These are called individual
conditional expectation (ICE) curves; see Goldstein et al. (2015) and ice for details.
References
J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics,
29: 1189-1232, 2001.
Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E., Peeking Inside the Black Box: Visualizing
Statistical Learning With Plots of Individual Conditional Expectation. (2014) Journal of Computa-
tional and Graphical Statistics, 24(1): 44-65, 2015.
Examples
## Not run:
#
# Regression example (requires randomForest package to run)
#
#
# Individual conditional expectation (ICE) curves
#
#
# Classification example (requires randomForest package to run)
#
## End(Not run)
Description
Partial dependence plots (PDPs) help visualize the relationship between a subset of the features
(typically 1-3) and the response while accounting for the average effect of the other predictors in
the model. They are particularly effective with black box models like random forests and support
vector machines.
Details
The development version can be found on GitHub: https://github.com/bgreenwell/pdp. As of right
now, pdp only exports two functions:
• partial - construct partial dependence functions (i.e., objects of class "partial") from var-
ious fitted model objects;
• plotPartial - plot partial dependence functions (i.e., objects of class "partial") using
lattice graphics;
• autoplot - plot partial dependence functions (i.e., objects of class "partial") using ggplot2
graphics;
• topPredictors - extract most "important" predictors from various types of fitted models.
Description
Diabetes test results collected by the the US National Institute of Diabetes and Digestive and Kidney
Diseases from a population of women who were at least 21 years old, of Pima Indian heritage, and
living near Phoenix, Arizona. The data were taken directly from PimaIndiansDiabetes2.
Usage
data(pima)
Format
A data frame with 768 observations on 9 variables.
• pregnant Number of times pregnant.
• glucose Plasma glucose concentration (glucose tolerance test).
• pressure Diastolic blood pressure (mm Hg).
• triceps Triceps skin fold thickness (mm).
• insulin 2-Hour serum insulin (mu U/ml).
• mass Body mass index (weight in kg/(height in m)^2).
• pedigree Diabetes pedigree function.
• age Age (years).
• diabetes Factor indicating the diabetes test result (neg/pos).
plotPartial 11
References
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine
learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University
of California, Department of Information and Computer Science.
Brian D. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge University Press,
Cambridge.
Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995), Soft Classification a.k.a.
Risk Estimation via Penalized Log Likelihood and Smoothing Spline Analysis of Variance, in D.
H. Wolpert (1995), The Mathematics of Generalization, 331-359, Addison-Wesley, Reading, MA.
Friedrich Leisch & Evgenia Dimitriadou (2010). mlbench: Machine Learning Benchmark Prob-
lems. R package version 2.1-1.
Examples
head(pima)
Description
Plots partial dependence functions (i.e., marginal effects) using lattice graphics.
Usage
plotPartial(x, ...)
Arguments
x An object that inherits from the "partial" class.
... Additional optional arguments to be passed onto dotplot, levelplot, xyplot,
or wireframe.
center Logical indicating whether or not to produce centered ICE curves (c-ICE curves).
Only useful when object represents a set of ICE curves; see partial for de-
tails. Default is FALSE.
plot.pdp Logical indicating whether or not to plot the partial dependence function on top
of the ICE curves. Default is TRUE.
pdp.col Character string specifying the color to use for the partial dependence function
when plot.pdp = TRUE. Default is "red".
pdp.lwd Integer specifying the line width to use for the partial dependence function when
plot.pdp = TRUE. Default is 1. See par for more details.
pdp.lty Integer or character string specifying the line type to use for the partial depen-
dence function when plot.pdp = TRUE. Default is 1. See par for more details.
rug Logical indicating whether or not to include rug marks on the predictor axes.
Default is FALSE.
train Data frame containing the original training data. Only required if rug = TRUE
or chull = TRUE.
smooth Logical indicating whether or not to overlay a LOESS smooth. Default is FALSE.
chull Logical indicating wether or not to restrict the first two variables in pred.var
to lie within the convex hull of their training values; this affects pred.grid.
Default is FALSE.
levelplot Logical indicating whether or not to use a false color level plot (TRUE) or a 3-D
surface (FALSE). Default is TRUE.
contour Logical indicating whether or not to add contour lines to the level plot. Only
used when levelplot = TRUE. Default is FALSE.
number Integer specifying the number of conditional intervals to use for the continuous
panel variables. See co.intervals and equal.count for further details.
overlap The fraction of overlap of the conditioning variables. See co.intervals and
equal.count for further details.
col.regions Color vector to be used if levelplot is TRUE. Defaults to the wonderful Mat-
plotlib ’viridis’ color map provided by the viridis package. See viridis for
details.
Examples
## Not run:
#
# Regression example (requires randomForest package to run)
#
library(randomForest)
## End(Not run)
Description
Extract the most "important" predictors for regression and classification models.
Usage
topPredictors(object, n = 1L, ...)
## Default S3 method:
topPredictors(object, n = 1L, ...)
Arguments
object A fitted model object of appropriate class (e.g., "gbm", "lm", "randomForest",
etc.).
n Integer specifying the number of predictors to return. Default is 1 meaning
return the single most important predictor.
... Additional optional arguments to be passed onto varImp.
14 topPredictors
Details
This function uses the generic function varImp to calculate variable importance scores for each
predictor. After that, they are sorted at the names of the n highest scoring predictors are returned.
Examples
## Not run:
#
# Regression example (requires randomForest package to run)
#
# Topfour predictors
top4 <- topPredictors(mtcars.rf, n = 4)
## End(Not run)
Index
∗Topic datasets
boston, 4
pima, 10
autoplot.cice (autoplot.partial), 2
autoplot.ice (autoplot.partial), 2
autoplot.partial, 2
boston, 4
BostonHousing2, 4
co.intervals, 12
create_progress_bar, 7
equal.count, 12
foreach, 7
geom_smooth, 3
ggplot2, 2, 10
ice, 8
lattice, 8, 10, 11
par, 12
partial, 2, 5, 12
pdp, 9
pdp-package (pdp), 9
pima, 10
PimaIndiansDiabetes2, 10
plot.gbm, 7
plotPartial, 7, 11
predict, 6
topPredictors, 13
varImp, 13, 14
viridis, 12
15