Classification With Decision Trees I: Instructor: Qiang Yang
Classification With Decision Trees I: Instructor: Qiang Yang
Classification With Decision Trees I: Instructor: Qiang Yang
1
INTRODUCTION
3
Example
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
Evaluation Criteria
Accuracy on test set
the rate of correct
classification on the testing
set. E.g., if 90 are classified
correctly out of the 100
testing cases, accuracy is Predicted class
90%.
Error Rate on test set
The percentage of wrong Yes No
predictions on test set
Confusion Matrix
For binary class values, “yes” Actual Yes True False
and “no”, a matrix showing class positive negativ
true positive, true negative,
false positive and false e
negative rates
Speed and scalability
the time to build the No False True
classifier and to classify new
cases, and the scalability positive negativ
with respect to the data size. e
Robustness: handling noise
and missing values
Evaluation Techniques
— Hold aside one group for testing and use the rest to build model
— Repeat
iteration
Test
8
8
Continuous Classes
9
DECISION TREE [Quinlan93]
10
Training Set
Outlook Tempreature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
Example
Outlook
sunny overcast
overcast rain
humidity P windy
N P N P
Building Decision Tree [Q93]
13
Choosing the Splitting
Attribute
14
Which attribute to select?
15
A criterion for attribute
selection
16
Computing information
17
Example: attribute “Outlook”
“Outlook” = “Sunny”:
info([2,3]) entropy(2/5,3/5) 2 / 5 log(2 / 5) 3 / 5 log(3 / 5) 0.971 bits
0.247 bits
Information gain for attributes from weather data:
gain(" Outlook" ) 0.247 bits
gain(" Temperatur e" ) 0.029 bits
gain(" Humidity" ) 0.152 bits
gain(" Windy" ) 0.048 bits
19
Continuing to split
20
The final decision tree
21
Highly-branching attributes
22
The gain ratio
Gain ratio: a modification of the information gain that
reduces its bias on high-branch attributes
Gain ratio takes number and size of branches into
account when choosing an attribute
It corrects the information gain by taking the intrinsic
information of a split into account
Also called split ratio
Intrinsic information: entropy of distribution of
instances into branches
(i.e. how much info do we need to tell which branch an
instance belongs to)
23
Gain Ratio
GainRatio(S, A) Gain(S , A) .
IntrinsicInfo(S , A)
Computing the gain ratio
Humidity Windy
26
More on the gain ratio
27
Gini Index
If a data set T contains examples from n classes, gini index,
gini(T) is defined as
n
gini (T ) 1 p2j
j 1
where pj is the relative frequency of class j in T. gini(T) is
minimized if the classes in T are skewed.
After splitting T into two subsets T1 and T2 with sizes N1 and
N2, the gini index of the split data is defined as
29