Decision Tree
Decision Tree
DECISION TREES
Lior Rokach
Department of lndustrial Engineering
Tel-Aviv University
Oded Maimon
DepaHment of Industrial Engineering
Tel-Aviv University
maimon@eng.tau.ac.il
Abstract Decision Trees are considered to be one of the most popular approaches for rep-
resenting classifiers. Researchers from various disciplines such as statistics, ma-
chine learning, pattern recognition, and Data Mining have dealt with the issue of
growing a decision tree from available data. This paper presents an updated sur-
vey of current methods for constructing decision tree classifiers in a top-down
manner. The chapter suggests a unified algorithmic framework for
these algorithms and describes various splitting criteria and pruning methodolo-
gies.
Keywords: Decision tree, Information Gain, Gini Index, Gain Ratio, Pruning, Minimum
Description Length, C4.5, CART, Oblivious Decision Trees
1. Decision Trees
A decision tree is a classifier expressed as a recursive partition of the in-
stance space. The decision tree consists of nodes that form a rooted tree,
meaning it is a directed tree with a node called "root" that has no incoming
edges. All other nodes have exactly one incoming edge. A node with outgoing
edges is called an internal or test node. All other nodes are called leaves (also
known as terminal or decision nodes). In a decision tree, each internal node
splits the instance space into two or more sub-spaces according to a certain
discrete function of the input attributes values. In the simplest and most fre-
166 DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
quent case, each test considers a single attribute, such that the instance space is
partitioned according to the attribute's value. In the case of numeric attributes,
the condition refers to a range.
Each leaf is assigned to one class representing the most appropriate target
value. Alternatively, the leaf may hold a probability vector indicating the prob-
ability of the target attribute having a certain value. Instances are classified by
navigating them from the root of the tree down to a leaf, according to the
outcome of the tests along the path. Figure 9.1 describes a decision tree that
reasons whether or not a potential customer will respond to a direct mailing.
Internal nodes are represented as circles, whereas leaves are denoted as tri-
angles. Note that this decision tree incorporates both nominal and numeric at-
tributes. Given this classifier, the analyst can predict the response of a potential
customer (by sorting it down the tree), and understand the behavioral charac-
teristics of the entire potential customers population regarding direct mailing.
Each node is labeled with the attribute it tests, and its branches are labeled with
its corresponding values.