0% found this document useful (0 votes)
16 views24 pages

Visualizing Trees and Forests

This document discusses different techniques for visualizing decision tree models, including individual trees and forests of trees. It describes hierarchical views as a natural way to visualize the structure of a single tree, showing how trees can be represented as graphs. The document also discusses using visualization to analyze split stability and compare multiple tree models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views24 pages

Visualizing Trees and Forests

This document discusses different techniques for visualizing decision tree models, including individual trees and forests of trees. It describes hierarchical views as a natural way to visualize the structure of a single tree, showing how trees can be represented as graphs. The document also discusses using visualization to analyze split stability and compare multiple tree models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/227249230

Visualizing Trees and Forests

Chapter · January 2008


DOI: 10.1007/978-3-540-33037-0_11

CITATIONS READS
13 1,579

4 authors:

Simon Urbanek Chun-houh Chen


University of Auckland Academia Sinica
74 PUBLICATIONS 1,112 CITATIONS 171 PUBLICATIONS 5,846 CITATIONS

SEE PROFILE SEE PROFILE

Wolfgang Karl Karl Härdle Antony Unwin


Humboldt-Universität zu Berlin Universität Augsburg
944 PUBLICATIONS 25,301 CITATIONS 189 PUBLICATIONS 3,852 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Wolfgang Karl Karl Härdle on 09 February 2016.

The user has requested enhancement of the downloaded file.


Visualizing Trees and Forests

Simon Urbanek1

AT&T Labs - Research, 180 Park Ave, Florham Park, NJ 07932


urbanek@research.att.com

1 Introduction

Tree based models provide an appealing alternative to conventional models


for many reasons. They are more readily interpretable, can handle both con-
tinuous and categorical covariates, can accommodate data with missing val-
ues, provide an implicit variable selection and model interactions well. Most
frequently used tree based models are classification, regression and survival
trees.
Visualization is important in conjunction with tree models, because in their
graphical form they are easily interpretable even without special knowledge.
Interpretation of decision trees displayed as a hierarchy of decision rules is
highly intuitive.
Moreover tree models reflect properties of the underlying data and have
other supplemental information associated with them, such as quality of cut
points, split stability and prediction trustworthiness. All this information
along with the complex structure of the trees themselves gives a plenty of
information that needs to be explored and conveyed. Visualization provides a
powerful tool for presenting different key aspects of the models in a concise
manner that allows quick comparisons.
In this chapter we will first quickly introduce tree models and present
techniques for visualizing individual trees. Those range from classical hierar-
chical views up to less widely known methods such as treemaps and sectioned
scatterplots.
In the next section we will use visualization tools to discuss the stability of
splits and entire tree models, motivating the use of tree ensembles and forests.
Finally we present methods for displaying entire forests at a glance and other
ways for analysis of multiple tree models.
2 Simon Urbanek

2 Individual Trees
The basic principle of all tree-based methods is a recursive partitioning of the
covariates space to separate subgroups that constitute a basis for prediction.
This means that starting with the full data set at each step a rule is consulted
that specifies how the data are split into disjoint partitions. This process is
repeated recursively until there is no rule defined for further partitioning.
Commonly used classification and regression trees use univariate decision
rules in each partitioning step, that is the rule specifying which cases fall into
which partition evaluates only one data variable at a time. For continuous
variables the rule usually creates two partitions satisfying the equations xi < s
and xi ≥ s respectively, where s is a constant. Partitions induced by rules using
categorical variables are based on the categories assigned to each partition.
We refer to a partitioning step often as split and speak of the value s as the
cut point.
The recursive partitioning process can be described by a tree. The root
node correspond to the first split and its children to subsequent splits in
the resulting partitions. The tree is built recursively in the same way as the
partitioning and terminal nodess (also called leaves) represent final partitions.
Therefore each inner node corresponds to a partitioning rule and each terminal
node to a final partition.
Each final partition has been assigned a prediction value or model. For
classification trees the value is the predicted class, for regression trees it is
the predicted constant, but more complex tree models exist such as those
featuring linear models in terminal nodes. In the following we will mostly use
classification trees with binary splits for illustration purposes, but all methods
can be generalized for more complex tree models unless specifically stated
otherwise. We call a tree consisting of rules in inner nodes regardless of the
type of prediction in the leaves a decision tree.

2.1 Hierarchical Views

Probably the most natural way to visualize a tree model is to display its
hierarchical structure. Let us describe more precisely what it is we want to
visualize. In order to describe the topology of a tree we want to borrow some
terminology for the graph theory. A graph is is a set of nodes (sometimes called
vertices) and edges. There a tree is defined as a connected, acyclic graph.
Topologically, decision trees are a special subset of those, namely connected
directed acyclic graphs (DAGs) with exactly one node of indegree 0 (the root
- it has no parent) and outdegrees other than 1 (i.e at least two children or
none at all).
In order to fully describe a decision tree, additional information is associ-
ated with each node. For inner nodes this information represents the splitting
rule, for terminal nodes it consists of the prediction. Plots of tree models at-
tempt to make such information visible in addition to displaying the graph
Visualizing Trees and Forests 3

aspect of the model. Three different ways to visualize the same classification
tree model are shown in Fig. 1.

eicosenoic>=6.5 eicosenoic
|

≥6.5 < 6.5


linoleic linoleic

≥951 < 951 < 1053.5


palmitoleic

Apulia North Sardinia


linoleic>=951 linoleic< 1054 ≥95.5 < 95.5
stearic
palmitoleic>=95.5
Apulia
stearic< 264.5 Calabria
Calabria < 264.5
Apulia Sicily

Apulia Sicily
North Sardinia

≥951
Apulia
linoleic ≥95.5
≥6.5 palmitoleic Calabria
eicosenoic <951 stearic <264.5
<95.5 Apulia
≥264.5
Sicily
linoleic <1053.5
<6.5 North
≥1053.5
Sardinia
/Volumes/Caladan/Users/urbanek/Datasets/Olives

Fig. 1. Different ways to visualize a classification tree model.

The tree model is based on the Italian olive oil dataset (?), which records
the composition of Italian olive oils from different regions of Italy. Each co-
variate corresponds to the proportion (in 1/10000th) of a fatty acid (in the
order of concentration): oleic, palmitic, linoleic, stearic, palmitoleic, arachidic,
linolenic and eicosenoic acid. The response variable is categorical and spec-
ifies the region of origin. The goal is to determine how the composition of
olive oils varies across regions of Italy. For illustration purposes we perform a
classification using five regions: Sicily, Calabria, Sardinia, Apulia and North
(the latter consolidating regions north of Apulia).
Although the underlying model is the same for all plots in Fig. 1, the visual
representation is different in each plot. Visualization of a tree model based on
its hierarchical structure has to contemplate the following tasks:
• placement of nodes
• visual representation of nodes
• visual representation of edges
• annotation
4 Simon Urbanek

Each task can be used to represent additional information associated with


the model or data. Visual representation of a node is probably the most ob-
vious way to add such information. In the first (top-left) plot a node consists
solely of a tick mark with an annotation describing the split rule for the left
child. In the second (top-right) plot a node is represented by a rectangle whose
size corresponds to the number of cases of the training data falling into that
node. The annotation describes the split variable. Finally in the third (bot-
tom) plot, each node is represented by a rectangle of the same size, but the
colors within show the proportions of classes falling into that node.
Advanced techniques known from area-based plots can be used in hierar-
chical views as well if we consider nodes as area-based representation of the
underlying data. Fig. 2 illustrates the use of censored zooming in conjunction
with tree node size.

eicosenoic

≥6.5 < 6.5


linoleic linoleic

≥951 < 951 < 1053.5 ≥1053.5


palmitoleic

Apulia ≥95.5 < 95.5 North Sardinia


stearic

Calabria < 264.5 ≥264.5

Apulia Sicily

⇓ 4×
eicosenoic

≥6.5 < 6.5


linoleic linoleic

≥951 < 951 < 1053.5 ≥1053.5


palmitoleic

Apulia ≥95.5 < 95.5 North Sardinia


stearic

Calabria < 264.5 ≥264.5

Apulia Sicily

Fig. 2. Censored zoom of nodes - bottom plot is a censored zoom (4×) of the top
plot. Nodes that would appear too large are censored at a maximum allowed size
and flagged by a red line.
Visualizing Trees and Forests 5

The top plot shows node representation without zoom, that is the size
of the root node corresponds to all data. All subsequent splits partition this
data and hence the node area until terminal nodes are reached. If plotted truly
proportionally, last two leaves split by the stearic variable would be hardly
visible. Therefore a minimal size of a node is enforced and the fact that this
representation is not truly proportional is denoted by a red border.
In order to provide truly proportional comparison of small nodes, we can
enlarge all nodes by a given factor. In the bottom plot a factor of four was
used. Now those small nodes can be distinguished along with the class pro-
portions, but large nodes would need to be four times as big as in the first
plot, obscuring large portions of the plot and possibly other nodes. Therefore
we also enforce a maximal size of a node. Again, to denote nodes that are not
shown proportionally due to upper censoring, we use a red line along the top
edge of the node.
The placement of nodes is a task that has been discussed intensely in the
graph visualization community. For small trees, simple approaches, such as a
bottom-up space partitioning, work well. As the trees grow larger, node layout
becomes more challenging. For tree model visualization, however, associated
information is in most cases more important than differences in local topol-
ogy, especially where the structure is imposed by the tree growing algorithm.
Therefore interactive approaches, allowing the user to explore the tree model
by local magnification while retaining global context, are recommended for
large tree models.
In the above examples, basic lateral placement is performed by an equidis-
tant partition of the available space. Only the first plot uses non-equidistant
placement of nodes in the direction of tree depth, namely the distance of two
nodes in this direction is proportional to the impurity decrease and thus in a
sense to the ‘quality’ of the split. The third plot uses special placement in that
it is rotated by 90 degree counter-clockwise relative to the usual representa-
tion and all terminal nodes are aligned in order to facilitate easy comparison
of the class proportions.
The visual representation of edges is usually restricted to drawing direct
or orthogonal lines. Nevertheless, more elaborate representation of edges, such
as polygons whose width is proportional to the number of cases following that
particular path is another possibility, creating a visual representation of the
‘flow’ of data through the tree.
Annotations are textual or symbolic representations displayed along the
nodes or edges. In Fig. 1 annotations describe predictions and splitting rules.
Although annotations can be useful, they should be used with caution, because
they can easily clutter the plot and thus distract from the key points to be
conveyed.
Overloading plots with information can offset the benefits of the plot, in
particular its ability to provide information at a glance. When the represen-
tation of a node is too large, such as including list of statistics or additional
plots, it will consume so much space that it is only possible to display very
6 Simon Urbanek

few levels of the tree on a screen. The same applies to a printed version, be-
cause the size of a sheet of paper is still limited. Therefore additional tools
are necessary to keep track of the overall structure in order not to get lost.
Most of these tools, such as zoom, pan, overview window or toggling of differ-
ent labels are available in interactive context only. Especially for an analysis,
a visualization of additional information is required. There are basically two
possibilities of providing such information:
• Integrate the information in the tree visualization.
• Use external linked graphics.
Direct integration is limited by the spatial constraints posed by the fixed
dimension of a computer screen or other output medium. Its advantage is the
immediate impact on the viewer and therefore easier usage. It is recommended
to use this kind of visualization for properties that are directly tied to the tree.
It makes less sense to display a histogram of the underlying dataset directly in
a node because it displays derived information which can be more comfortably
displayed outside the tree, virtually linked to a specific node. It is more sensible
to add information directly related to the tree structure, such as the criterion
used for the growth of the tree.
External linked graphics are more flexible, because they are not displayed
directly in the tree structure for each node, but are only logically linked to a
specific node. Spatial constraints are less of a problem because one graphic is
displayed instead of many for each node. The disadvantage of linked graphics
is that they must be interpreted more carefully. The viewer has to bear in mind
the logical link used to construct the graphics as it is not visually attached to
its source (node in our case).
There is no fixed rule as of what kind of information should be displayed
inside or outside the tree structure. A rule of thumb says that any more
complex graphic should use the external linked approach, whereas less complex
information directly connected with the tree structure should be displayed in
the tree visualization.

2.2 Recursive Views

In the introduction we have described tree models as recursive partitioning


methods. Consequently it is only natural to display partitions induced by the
tree, resulting in an alternative way of visualizing tree models. In the following
we will describe visualization methods that are based on the partitioning
aspect of the models instead of the hierarchical structure.

Sectioned scatterplots

Splitting rules are formulated in the covariates space therefore a way to visu-
alize a tree model is to visualize this space along with the induced partition
Visualizing Trees and Forests 7

boundaries. For univariate partitioning rules those boundaries lie on hyper-


planes parallel to the covariate axes.
Due to this fact we can use an as simple 2-dimensional projection as a
scatterplot. In this view, all splits featuring any of the two plotted variables
are clearly visible. Such sectioned scatterplot featuring first two split variables
is shown in Fig. 3 along with the associated tree.

eicosenoic
1400

Apulia
Calabria
North ≥6.5 < 6.5

Sardinia
Sicily linoleic linoleic
1200

≥951 < 951 < 1053.5


1000
linoleic

palmitoleic

Apulia North Sardinia


800

< 95.5

stearic
600

Calabria

< 264.5

0 10 20 30 40 50 60 Apulia Sicily

eicosenoic

Fig. 3. Sectioned scatterplot (left) showing the root splits and splits in its children
of a classification tree (right).

It is the same model as in Fig. 1. Each region is denoted by a particular


color in the scatterplot and partition boundaries are added based on the tree
model.
The first split of the tree uses eicosenoic variable to separate oils originat-
ing in northern Italy and Sardinia from other regions. It is clearly visible in
the sectioned scatterplot that this split is indeed very distinctive.
Both following inner nodes use linoleic variable to further separate Sar-
dinian oils apart from norther parts of Italy and on the right hand side Apu-
lian oils from Sicily and Calabria. Further splits are no longer visible in this
projection, because they feature other variables and are thus parallel to the
visible plane. Nevertheless, it is possible to analyze such subsequent splits,
especially in an interactive context, by a succession of sectioned scatterplots
using drill-down techniques, following a few basic rules.
8 Simon Urbanek

350
300
stearic
250
200

50 100 150 200 250

palmitoleic

Fig. 4. Drill-down using a sectioned scatterplot based on a subgroup induced by


the tree model.

Sectioned scatterplots should be preferably created using variables that are


adjacent in the tree, that is using split variables of nodes that are connected
by an edge. This ensures that both splits are visible in the projection.
Also the plotted data should be restricted to the data falling in the node
closer to the root. In Fig. 3 we have used the entire dataset, since we were
interested in showing splits of the root node and its children. Fig. 4 presents
a sectioned scatterplot based on data further down the tree, namely the par-
tition in the bottom-right part of the scatterplot in Fig. 3.
Sectioned scatterplots are useful for investigating the vicinity of a cut
point. Some cut points are placed in a noisy area, others are much more
clear as is illustrated in Fig. 3. However, they cannot capture more than two
covariates used in subsequent splits and thus remain suitable mainly for local
analysis of a tree model. In an interactive setting, however, it is possible to
quickly ‘drill-down’ from Fig. 3 to Fig. 4. Linked highlighting further helps to
retain the context, especially with the help of a linked hierarchical view.
A generalization to multiple dimensions is not straight-forward. Although
a rotating 3D-scatterplot with splits as hyperplanes represented by meshes
proves to be useful, higher dimensions are beyond reach.
Several extensions to the sectioned scatterplots are useful in specific situ-
ations. In cases where the same variable is used in the tree several times at
different depths, it is advisable to vary the opacity or luminance of partition
boundary lines according to their depth, making more ‘deep’ splits lighter.
This provides a visual aid when interpreting the plot.
Visualizing Trees and Forests 9

Another technique involves shading of the plot background based on either


the depth of the visible partition (depth denoted in shades of gray: depth-
shading) or the predicted value (semi-transparent category color or hue of the
predicted value: prediction shading). The latter emphasizes misclassified cases
or outliers with high absolute residual value, because correctly predicted cases
blend better into the background.
Scatterplots are primarily useful for continuous variables. If a tree model
uses categorical variables, a local treemap can prove to be useful. Such plot is
in principle similar to mosaic plots, but categories falling into the same node
are grouped together. We will discuss treemaps in the following section.

Treemaps

One way of displaying all partitions is to use area-based plot where each termi-
nal node is represented by a rectangle. Treemaps belong to this plot category.
The main idea is to partition available rectangular plot space recursively in
the same way that the tree model partitions data. Therefore treemaps are
data-driven representations of the model.
The rectangular area of the treemap corresponds to the full dataset. In
the first step this area is partitioned horizontally according to the propor-
tions of cases passed to each child node. In the next step each such partition
is partitioned vertically corresponding to case proportions in its children. This
process is repeated recursively with alternating horizontal and vertical parti-
tioning directions as illustrated in Fig. 5 until terminal nodes are reached.

→ →
Step 1 Step 2 Step 3

Fig. 5. Construction of a treemap consisting of three subsequent binary splits.

In the resulting plot each rectangle corresponds to a terminal node. The


area of each rectangle is proportional to the number of cases falling into that
terminal node. It is helpful to adjust spaces between partitions to reflect the
depth at which a given partitioning took place, showing splits closer to the
root with larger gaps.
Treemaps are useful to asses the balance of a tree model. In very noisy
scenarios trees tend to attempt splitting off small, reasonably homogenous
10 Simon Urbanek

subgroups, while leaving a large chunk of cases in one node that is hard to
separate. Such behavior is easily detected in treemaps as large terminal nodes.
Moreover, treemaps are suitable for highlighting or brushing, allowing the
comparison of groups within terminal nodes. Treemap of the model from Fig. 1
with colors stacked by group is shown in Fig. 6.

Fig. 6. Treemap with stacked bars representing response classes. Color coding and
data are the same as in Fig. 3.

It is clearly visible that the tree model is able to split off large homoge-
nous groups successfully, but more subsequent splitting is necessary for nodes
visible in the upper-left part of the plot.
Treemaps described here are an extension of those used in computer sci-
ence information visualization of hierarchically stored contents. They are also
related to mosaic plots. More precisely a mosaic plot is a treemap of a de-
composition tree, that is a tree whose splits of the same depth use the same
categorical splitting variable and have as many children as there are categories
in the data.
The main advantage of treemaps is very efficient use of display space. They
allow absolute comparison of nodes and subgroup sizes while maintaining
context of the tree model. They scale well with both increasing data set size
and tree model complexity. What they cannot show is information about
splitting criteria and they do not allow direct relative comparison of groups
within nodes. An alternative visualization technique exists for the latter task.
Visualizing Trees and Forests 11

Spineplots of leaves

Another useful plot for tree model visualization is the spineplot of leaves
(SPOL). By not alternating the partitioning direction as in treemaps, but
constantly using horizontal partitioning, we obtain a plot showing all termi-
nal nodes in one row.
Due to the fixed height, it is possible to visually compare sizes of the
terminal nodes which are proportional to the width of the corresponding bar.
Moreover, relative proportions of groups are easily comparable when using
highlighting or brushing.
A sample spineplot of leaves is shown in Fig. 7. The displayed data and
model is the same as in Fig. 6, as well as the color brushing. Each bar corre-
sponds to a leaf and the width of each bar is proportional to number of cases
in that particular node.

Fig. 7. Spineplot of leaves brushed by response categories with superimposed tree


model. The associated tree is sketched on top for easier identification of individual
leaves.

We can clearly see relative proportions of groups within each node. In


addition, it is possible to add a simple annotation on top of the plot in the form
of a dendrogram of the represented tree. As with treemaps, it is advantageous
to choose the size of gaps between bars according to the depth of the split.
SPOLs are mainly useful for comparing group proportions within terminal
nodes. They are acquainted with spineplots, which allow the comparison of
groups within categories of a variable. They differ in that a “category” is in
fact membership of a particular terminal node and use different rules for gaps
between bars.
In this section we have discussed several alternative techniques for vi-
sualizing tree models based on the idea of recursive partitioning. The shown
12 Simon Urbanek

methods focus on the visualization of splits, their sequence and the application
of the model to data. One important property of all visualization techniques
presented is their applicability to arbitrary subsets of the data. Although most
illustrations used training data and the corresponding fitted tree model, it is
also feasible to visualize test data instead. Where a view of the training data
highlights the adaptability of the model, the view of test data focuses on sta-
bility and overfitting. Moreover it is possible to compare both views side by
side.
This leads us to further important aspects of a tree model which are cred-
ibility and quality of the splits and the entire model. In the next section
we want to briefly discuss tree model construction and present visualization
methods that incorporate information about split quality into both existing
and new plots.

2.3 Fitting tree models

So far we have discussed methods for visualizing tree models on their own and
including data the models are applied to. There is, however, more information
associated with each node that waits to be visualized. In order to understand
tree models better, we need to know more about the process of fitting tree
models.
Although a tree model is straight-forward to interpret and apply, its con-
struction is not trivial. In theory, we would like to consider all possible tree
models and pick the one that fits the given data best, based on some loss
function. Unfortunately this proves to be unfeasible save for trivial examples,
because the computational cost increases exponentially with tree size.
Therefore several other approaches were suggested for fitting tree models.
The most commonly used algorithm CART (Classification and Regression
Trees) was introduced by ?). It performs a greedy local optimization as follows:
for a given node, consider all possible splits and chose the one which reduces
the relative impurity of the child nodes most relative to the parent node.
This decrease of impurity (and hence increase of purity) is assessed using an
impurity criterion. Such locally optimal split is then used and the search is
performed recursively in each child node.
The growing is stopped if one of the stopping rules is met. The most
common stopping rules are minimal number of cases in a node and minimal
requirement on the impurity decrease. In practice it is common to relax the
stopping rule and use pruning methods, however, discussion of pruning meth-
ods is beyond the scope of this chapter. Nevertheless, visualization can be
useful for pruning, especially in an interactive context where pruning param-
eters can be changed on the fly and reflected in various displays.
Measures of impurity can be any arbitrary convex functions, but the com-
monly used measures are entropy and Gini index which have theoretical foun-
dations (c.f. ?). It is important to note that this search looks for a local
optimum only. It has no way for ‘looking ahead’ and considering multiple
Visualizing Trees and Forests 13

splits at once. Nevertheless, it is computationally inexpensive compared to a


full search and performs considerably well in practice.
The consequence of committing to a local optimum at each split point is
a potential instability of the model. Small changes in the training data can
cause a different split to be chosen. Although the alternate split may lead
to a very similar decrease of impurity, the resulting partition can be entirely
different. This will have big impact on any following splits and thus produce
an entirely different tree model. We want to present a visualization technique
that allows us to learn more about decision made at node level during tree
model fitting.

Mountain plots

The basic idea of a mountain plot (?) is to visualize the decrease of impurity
over the entire range of the split variable. This is illustrated on a binary
classification problem in Fig. 8. In this particular example a binary response
denotes whether a patient was able to recover from a diagnosed meningitis
disease, whereas the predictive variable Age refers to patient’s age at the time
of the diagnosis.

Recover

no

yes

Age

20 30 40 50 60 70 80 90

I
10

Age
0
20 30 40 50 60 70 80 90

Fig. 8. Stacked dotplot side-by-side of the split variable (Age) and the target vari-
able (Recover ) along with the corresponding mountain plot showing the impurity
decrease for each cut point. Optimal cut point is denoted as a solid red line, runner-
up splits as dotted red lines.
14 Simon Urbanek

The top part of the figure shows a stacked dotplot of the split variable
grouped by the binary response. The bottom part of the plot shows a mountain
plot. The value of the empirical impurity measure is constant between data
points and can change only at values taken by the data. The value of the
impurity decrease is by definition zero outside the data range.
In the presented example it is clearly visible that there are three alternative
splits that come very close to the ‘optimal’ cut point chosen by the greedy
algorithm.

50 50

40 40

30 30

20 20

10 10

4 5 6 7 8 9 0 10 20 30

104 104

0 0
4 5 6 7 8 9 0 10 20 30
Rooms LowStat

Fig. 9. Two mountain plots of the variables Rooms and LowStat and the corre-
sponding scatterplots vs the response variable. The optimal splits are denoted by
red lines, means for each partition are represented by gray lines in the scatterplots.

The competition for the best split is not limited to a single variable. Fig. 9
illustrates a competition among two different variables in a regression tree.
The models are based on the Boston housing dataset by ?).
Although both splits have almost identical impurity decrease maxima, the
data show different patterns. The relationship seen in the left part of the plot
is probably better modeled by a linear model, whereas on the right hand side
we see a change of behavior around the chosen cut point.
By plotting mountain plots of candidate variables on the same scale, we
can assess the stability of a split. If there is a dominating covariate with a clear
optimum, the split will be stable. On the other hand the presence of competing
splits in the range of the optimal split indicates possible instability. Mountain
plots also show which regions of competing variables are in the vicinity of the
optimum, thus allowing domain knowledge to be taken into account.
Visualizing Trees and Forests 15

The name “mountain” plots is derived from the fact that the plots usually
resemble a profile of a mountain range. They are mainly useful for assessing the
quality of a split along with potential competing splits. This information can
be used to interactively influence the tree construction process or to construct
multiple tree models and compare their behavior.

3 Visualizing Forests
So far we have been discussing visualization of individual tree models. We
have shown, however, that there is an inherent volatility in the choice of
splits that may affect the stability of a given model. Therefore it is useful to
grow multiple trees. In the following we will briefly introduce tree ensemble
methods and present visualization methods for forests consisting of multiple
tree models.
There are two main approaches of generating different tree models by
changing:
• training data - changes in the training data will produce different models
if the original tree was unstable. Bootstrapping is an useful technique to
assess the variability of the model fitting process.
• splits - allow locally suboptimal splits that create different partitions in
order to prevent the greedy algorithm from getting stuck in a local opti-
mum, which may not necessarily be a global optimum.
Model ensemble methods leverage the instability of individual models to
improve prediction accuracy by constructing a predictor as an aggregate of
multiple individual models. Bagging (?) uses bootstrapping to obtain many
tree models and combines their prediction results by aggregation: majority
voting for classification trees and averaging for regression trees. In addition,
random forests (?) add randomness by choosing candidate split variables from
a different random set in each node.

3.1 Split variables

Bootstrapping models provides a useful tool to analyze properties of the fitted


models and therefore shed more light on the underlying data. One of the many
advantages of tree models is their ability to perform implicit variable selection.
Given a dataset, a tree growing algorithm will create a hierarchical structure
of splits forming the tree. Only variables used in the splits will be evaluated
by the model, therefore any other variables are implicitly dropped. In the
following we want to illustrate the visualization of forests on the Wisconsin
breast cancer data (?).
For this purpose we generate 20 trees using bootstrapping. In each boot-
strap iteration we grow a tree using the regular CART algorithm. Let us first
concentrate on the variables used in the models. A global overview is given in
16 Simon Urbanek

UCS BNi CTh UCH BCn MAn NNi SECS Mts UCS UCH BNi CTh BCn NNi MAn SECS Mts
weight: frequency weight: cummulated deviance gain

Fig. 10. Left: frequency of the use of individual variables in 20 bootstrapped tree
models. Right: cumulated deviance gain in splits featuring the corresponding vari-
able.

the left plot of Fig. 10. Each bar displays how often the corresponding vari-
able was used in the models. The most often used variable is UCS (20 times)
and the least often used variable is Mts which was used just once. Due to the
rather small number of variables to choose from there is no variable omitted
by all models.
Clearly this view is very coarse, because it doesn’t take into account which
role the variable plays in the models. The number of splits can double with
increasing depth, whereas the number of involved cases decreases. Therefore
the fact that a variable is used often does not necessarily mean that it is really
important, especially if used mainly in the fringe for small groups. Therefore
it is advisable to weight the contribution of each split by the a cumulative
statistic such as the decrease of impurity.
The cumulative value of impurity decrease for each variable of the 20
bootstrapped trees is displayed in the right plot of Fig. 10. The variables in
each plot are ordered by the bar height, representing their importance. We
see that UCS is by far the most influential variable, followed by UCH and BNi.
When making inference on the displayed information, we need to be cau-
tious and keep the tree properties in mind. Variable masking can heavily
influence the results of such analyses. Given two highly correlated variables,
it is very likely that they produce very similar split results. Therefore the
CART algorithm guided by the bootstrap will pick one of them at random.
Since the decision was made, the other variable is not likely to be used any-
more. If one of the variables is ‘weaker’, it will hardly appear in any model,
even though in the absence of the stronger variable it may still perform best
of all other variables.
In order to analyze that behavior, but also to see how different the tree
models are, it is necessary to take both the variable and the individual tree
into account. Two-dimensional weighted fluctuation diagrams showing trees
Visualizing Trees and Forests 17

and split variables is shown in Fig. 11. Variables are plotted on the y-axis,
the models on the x-axis. The area of each rectangle is proportional to the
cumulative impurity decrease of all splits using a specific variable in the tree
model. In general, fluctuation diagrams are useful for detecting patterns and
comparisons both in x and y directions.

Mts

SECS

MAn

NNi

BCn

CTh

BNi

UCH

UCS

t16 t07 t04 t02 t12 t18 t01 t19 t09 t10 t15 t08 t13 t14 t20 t11 t03 t05 t17 t06

Fig. 11. Fluctuation diagram of trees and variables, displaying cumulated deviance
gain of splits featuring that combination of tree and split variable.

Focusing on the largest gains we can distinguish four different model


groups. In 15 models UCS is the most influential variable, followed by UCH
with 3 models and BNi, BCn with one model each. Looking at the large group
of 15 models we can also spot several patterns. In 8 cases UCH is also used,
although not contributing as heavily as in its dominant position, but then we
see another 7 cases where UCH is not used at all. Visually we get the impres-
sion that BNi replaces UCH in those cases, which hints at variable masking.
We see a similar behavior with UCS and UCH, too. This overall impression
indicates that bootstrapping indeed produces very different models and we
see a confirmation of the tree model instability for this dataset.
With large number of trees, an alternative representation based on parallel
coordinate plots can be used. Each coordinate corresponds to a tree and each
case to a variable. The value of the cumulative gain for each combination of
tree and variable is then plotted on the axes. The ordering of axes is important
to obtain a coherent picture. Some possible heuristics include ordering by the
value of the most influential variable and distance measures based on the
global weight of each variable.
18 Simon Urbanek

3.2 Data view

Importance and use of variables in splits is just one aspect of the tree models
to consider. In section 2.2, we have discussed another way of visualizing trees
which allowed an assessment of cut point in the data context, sectioned scat-
terplots. Fortunately sectioned scatterplots can also be used for visualization
of forests, preferably using semi-transparent partition boundaries.

1500

1400

1300

1200

1100

1000

900

800

700

600

500

400
50 100 150 200 250

Fig. 12. Sectioned scatterplot of a forest of 100 trees.

Such sectioned scatterplot of a forest is shown in Fig. 12. In order to


make the classification more difficult we have increased the granularity of the
response variable of the olive oil data to 9 regions. The sectioned scatter-
plot displays variables linoleic vs palmitoleic and partition boundaries of 100
bootstrapped trees. The use of semi-transparent boundaries allows us to dis-
tinguish between occasionally used cut points that are shown as very faint
lines and frequently used cut points shown in dark blue.
In contrast to sectioned scatterplots for individual trees, we do not have
the convenient ability of a drill-down, unless several models agree on the same
subset. Therefore the aim of the visualization technique described in the next
section is to show all trees and their splits at a glance.
Visualizing Trees and Forests 19

3.3 Trace plot

The aim of a trace plot is to provide a plot that allows comparison of arbitrary
many trees with respect to splits, cut points and the hierarchical structure.
This is not possible using any of the visualization methods described so far.
The basis of the trace plot is a rectangular grid consisting of split variables
as columns and node depths as rows. Each cell in this grid represents a possible
tree node. In order to distinguish actual split points, each cell contains a glyph
representing possible split points. For continuous variables it consists of a
horizontal axis and a split point is represented by a tick mark. Categorical
variables are shown as boxes corresponding to possible split combinations.
Each two adjacent inner nodes are connected by an edge between their split
points.
A classification tree and its trace plot is shown in Fig. 13. The root node
features a split on the variable palmitoleic, which is represented by the right-
most column. Its child nodes use splits on the variables linoleic and oleic,
hence the two edges leading from the root node to the next row of splits.
There are no further inner nodes as children of the linoleic split, therefore the
branch ends there. Analogously, all inner nodes are drawn in the trace plot
until terminal nodes are reached.
It is evident that all splits of the tree can be reconstructed from its repre-
sentation in the trace plot, because every cut point is shown in the trace plot.
Equally, it is possible to reconstruct the hierarchical structure of the tree due
to the presence of edges in the trace plot.
Moreover, the trace plot removes an ambiguity known from hierarchical
views: the order of the children nodes is irrelevant for the model, whereas
swapping left and right children in the hierarchical view produces quite differ-
ent hierarchical plots. In a trace plot the order of the children nodes is defined
by the grid and therefore fixed for all trees in the plot.
One important advantage of trace plots is the ability to display multiple
tree models simultaneously, superimposing all models on the same grid. A
trace plot of 100 bootstrapped classification trees is shown in Fig. 14. This
confirms the ability of bootstrapping to produce models that deviate from
certain local optima.
In order to prevent overplotting, we use semi-transparent edges. Conse-
quently, often used paths are more opaque than infrequently used paths. We
can clearly see that the first split always uses the palmitoleic variable. In the
next step, however, there are several alternatives for the splits. Some patterns
seem to be repeated further down the tree, indicating a rather stable subgroup
that can be reached by several different ways along the tree. In this particular
example we can recognize substructures that affirm the partial stability of the
tree models.
The remaining instability in this particular example is in most cases given
by the sequence in which the subgroups are separated. This is partially due
to the fact that we are dealing with a multi-class problem, thus the reduction
20 Simon Urbanek

palmitoleic
South-Apulia

< 142 >=142


oleic linoleic
Inland-Sardinia South-Apulia

< 7895 >=7895 < 930.5 >=930.5


linoleic palmitic

Inland-Sardinia Umbria
Calabria South-Apulia
>=1055 < 1055 < 1043.5 >=1043.5
oleic eicosenoic
Inland-Sardinia West-Liguria
North-Apulia Umbria
< 7261 >=7261 >=9 <9
palmitoleic arachidic

Sicily West-Liguria
Coast-Sardinia Inland-Sardinia
>=95 < 95 >=25 < 25
arachidic stearic
Calabria Sicily
East-Liguria West-Liguria
< 80.5 >=80.5 < 262 >=262

Calabria Sicily North-Apulia Sicily

palmitoleic

arachidic linoleic oleic eicosenoic stearic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

Fig. 13. A classification tree and its trace plot.


Visualizing Trees and Forests 21

palmitoleic

arachidic linoleic oleic eicosenoic stearic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic oleic eicosenoic palmitic stearic palmitoleic

linolenic arachidic linoleic oleic eicosenoic palmitic stearic palmitoleic

Fig. 14. Trace plot of 100 bootstrapped trees.


oleic

of impurity can be achieved by splitting off an arbitrary class or a group of


classes. Nevertheless, our tree specimen from Fig. 13 is a rather rare one as
we see in the trace plot in Fig. 14, because its trace does not match with the
main, opaque paths.

4 Conclusion

Tree models are very rich and versatile. Equally rich is the variety of pos-
sible visualization techniques that provide various views of trees, each time
shedding light on different properties of the models.
Hierarchical views are the most commonly used graphical representations
and highlight the sequence of splits. They are easy to interpret even by un-
trained personnel. Node placement and representation can convey additional
information associated with the model or data. Size of nodes can be intuitively
associated with the size of the data passed into that node. Highlighting and
brushing is easily possible in this context, which facilitates interpretation in
conjunction with available data. Hierarchical views often allow for additional
annotation and supplemental information, such as split quality. Complemental
methods are available for large trees and data, such as censored or context-
preserving local zoom.
A less known group of tree model visualization techniques are those based
on the recursive partitioning aspect. Direct view of the partition boundaries in
22 Simon Urbanek

the observation space can be obtained using sectioned scatterplots. The focus
here lies on the cut points and their relative position in the data space. They
are limited in terms of the number and types of covariates used, but prove to
be useful as a drill-down technique for local analysis of subgroups throughout
the tree model.
Other methods based on recursive partitioning of the plot space are
treemaps and spineplots of leaves. Both allow a concise view of all termi-
nal nodes while retaining hints of the splitting sequence. In conjunction with
highlighting and brushing the main focus here is on the model behavior with
respect to data points. As such the plots can be created using training and test
data separately and compared. Treemaps are more suitable for absolute com-
parisons and large, complex trees, whereas spineplots of leaves can be used
for relative comparison of groups within terminal nodes up to moderately
complex trees.
Tree models are possibly unstable, that is small changes in the data can
lead to entirely different trees. In order to analyze the stability of splits it
is possible to visualize the optimality criterion for candidate variables using
mountain plots. Competing splits within a variable become clearly visible and
the comparison of mountain plots of multiple candidate variables allows a
quick assessment of the magnitude and cause for potential instability.
The instability of a tree model can be used to obtain additional insight in
the data and to improve prediction accuracy. Bootstrapping provides a useful
method for the analysis of model variation by creating a whole set of tree
models. Visualization of the use of covariates in the splits as weighted bar-
charts with aggregate impurity criterion as weight allows quick assessment of
variable importance. Variable masking can be detected using weighted fluc-
tuation diagrams of variables and trees. This view is also useful for finding
groups of related tree models.
Sectioned scatterplots also allow the visualization of partition boundaries
for multiple trees. The resulting plot can no longer be used for global drill-
down due to the lack of shared subgroups, but it provides a way of analyzing
the ‘fuzziness’ of a cut-point in conjunction with the data.
Finally, trace plots allow us to visualize split rules and the hierarchical
structure of arbitrary many trees in a single view. They are based on a grid of
variables and tree levels (nodes of the same depth) where each cell corresponds
to a candidate split variable, corresponding to a potential tree node. Actually
used cells are connected in the same way as in the hierarchical view, thus
reflecting the full structure of the tree. Multiple trees can be superimposed
on this grid, each leaving its own ‘trace’. The resulting plot shows frequently
used paths, common subgroups and alternate splits.
All plots in this chapter have been produced using R software for statistical
computing and KLIMT interactive software for visualization and analysis of
trees and forests. Visualization methods presented in this chapter are suitable
for both presentation of particular findings and exploratory work. The indi-
vidual techniques complement each other well by providing various different
Visualizing Trees and Forests 23

viewpoints on the models and data. Therefore they can be successfully used
in an interactive framework. Trace plots, for example, represent a very useful
overview which can be linked to individual hierarchical views. Subgroups de-
fined by cells in the trace plot can be linked to data-based plots, its edges to
sectioned scatterplots.
The methods presented here were mostly illustrated on classification exam-
ples, but they can be equally used for regression trees and mostly for survival
trees as well. Also all methods described here are not limited to binary trees,
even though those represent the most commonly used models. The variety of
tree models and further development of ensemble methods still leaves room
for enhancements or new plots. For exploratory work it is of benefit to have a
big toolbox to choose from, for presentation graphics it is important to have
the ability to display the ‘key point’ we want to convey.

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy