0% found this document useful (0 votes)
136 views

Classification: Decision Tree Hunt's Algorithm ID3 Rule Based Classifier C4.5

The document discusses classification techniques in data mining. It begins by defining classification and describing the general classification process of building a model from a training set and testing it on a separate test set. It then discusses decision tree based classification specifically, covering Hunt's algorithm, ID3, and C4.5 rule-based classifier. It describes how decision trees work, including their structure and how they are built recursively by splitting records at internal nodes based on attribute tests. Methods for splitting, determining the best split, and measures of impurity like entropy are also summarized.

Uploaded by

Ricky Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Classification: Decision Tree Hunt's Algorithm ID3 Rule Based Classifier C4.5

The document discusses classification techniques in data mining. It begins by defining classification and describing the general classification process of building a model from a training set and testing it on a separate test set. It then discusses decision tree based classification specifically, covering Hunt's algorithm, ID3, and C4.5 rule-based classifier. It describes how decision trees work, including their structure and how they are built recursively by splitting records at internal nodes based on attribute tests. Methods for splitting, determining the best split, and measures of impurity like entropy are also summarized.

Uploaded by

Ricky Chandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Classification

■ Definition
■ Decision Tree
■ Hunt’s Algorithm

■ ID3

■ Rule Based Classifier


■ C4.5

2-Sep-20 Data Mining: Classification 1


Classification : Definition
■ Given a set of records (The training set)
■ Each record contains a set of attributes

■ One of the attributes is the class

■ Find a model for the class attribute as a function of the


values of other attributes
■ Goal: Previously unseen records should be assigned to a
class as accurately as possible
■ Usually, the given data set is divided into training and test
set:
■ Training set used to build the model.

■ Test set used to validate it.

■ The accuracy of the model is determined on the test set.

2-Sep-20 Data Mining: Classification 2


General Approach

2-Sep-20 Data Mining: Classification 3


Classification : Example

Name Give Birth Lay Eggs Can Fly Live in water Have Legs
Chinese Dragon No No Yes Yes No
2-Sep-20 Data Mining: Classification 4
Examples of Classification Task
■ Banking: determining whether a mortgage application is
a good or bad credit risk, or whether a particular credit
card transaction is fraudulent
■ Education: placing a new student into a particular track
with regard to special needs
■ Medicine: diagnosing whether a particular disease is
present
■ Law: determining whether a will was written by the
actual person deceased or fraudulently by someone else
■ Homeland Security: identifying whether or not certain
financial or personal behavior indicates a possible terrorist
threat
2-Sep-20 Data Mining: Classification 5
Classification Model
■ In general a classification model can
be used for the following purposes:
■ It can serve as a explanatory tool for
distinguishing objects of different
classes (descriptive).
■ It can be used to predict the class
labels of new records (predictive).

2-Sep-20 Data Mining: Classification 6


Classification Techniques
■ Decision Tree based Methods
■ Rule-based Methods
■ Instance-Based Classifier
■ Memory based reasoning
■ Neural Networks

■ Naïve Bayes and Bayesian Belief

Networks
■ Etc.

2-Sep-20 Data Mining: Classification 7


Decision Tree Structure
■ A decision tree is a hierarchical
structure of nodes and directed
edges.
■ There are three types of nodes in
a decision tree:
■ A root node, which has no
incoming edges and zero or
more outgoing edges
■ Internal nodes, each of which
have exactly one incoming
edge and two or more
outgoing edges
■ Leaf nodes, each of which
have exactly one incoming
edge and no outgoing edges.
Each leaf node also has a
class label attached to it

2-Sep-20 Data Mining: Classification 8


Decision Tree Based Classification
■ One of the most widely used classification
technique
■ Highly expressive in terms of capturing
relationships among discrete variables
■ Relatively inexpensive to construct and extremely
fast at classifying new records
■ Easy to interpret
■ Can effectively handle both missing values and noisy
data
■ Comparable or better accuracy than other
techniques in many applications

2-Sep-20 Data Mining: Classification 9


Example Decision Tree

2-Sep-20 Data Mining: Classification 10


Another Example Decision Tree

2-Sep-20 Data Mining: Classification 11


Decision Tree Classification Task

2-Sep-20 Data Mining: Classification 12


Hunt’s Algorithm
■ Most of the decision tree induction algorithms are based on
original ideas proposed in Hunt’s Algorithm
■ Let Dt be the training set and y be the set of class labels {y1, y2,
… , yc}
■ If Dt contains records that belong to the same class, yk,
then its decision tree consists of leaf node labeled as yk
■ If Dt contains records that belong to more than one class,
use an attribute test to split the data into smaller
subsets. Recursively apply the procedure to each subset.
■ If Dt is an empty set, then its decision tree is a leaf node
whose class label is determined from other information such
as the majority class of the records

2-Sep-20 Data Mining: Classification 13


Example of Hunt’s Algorithm

2-Sep-20 Data Mining: Classification 14


Decision Tree Classification Task

2-Sep-20 Data Mining: Classification 15


Apply Model to Test Data

2-Sep-20 Data Mining: Classification 16


Tree Induction
■ Determine how to split the records
■ Use greedy heuristics to make a series of locally optimum
decision about which attribute to use for partitioning the data
■ At each step of the greedy algorithm, a test condition is
applied to split the data in to subsets with a more
homogenous class distribution
■ How to specify test condition for each attribute?
■ How to determine the best split?
■ Determine when to stop splitting
■ A stopping condition is needed to terminate tree growing
process. Stop expanding a node
■ if all the instances belong to the same class
■ if all the instances have similar attribute values

2-Sep-20 Data Mining: Classification 17


Methods for splitting the records
■ Depends on attribute types
■ Binary: true / false, yes/no, +/-, etc.

■ Nominal: ID number, eye color, zip codes, etc.

■ Ordinal: rankings (e.g., taste of potato chips on a

scale from 1-10), grades, height in {tall, medium,


short}, etc.
■ Continuous/Ratio: calendar dates, temperatures

in Celsius or Fahrenheit, age, etc.


■ Depends on number of ways to split
■ 2-way split (Binary split)

■ Multi-way split

2-Sep-20 Data Mining: Classification 18


Splitting based on Nominal attributes

■ Each partition has subset


of values signifying it
■ Multi-way split: Use

as many partitions as
distinct values.
■ Binary split: Divides

values in to two
subsets. Need to find
optimal partitioning.

2-Sep-20 Data Mining: Classification 19


Splitting based on Ordinal attributes
■ Multi-way split:
■ Use as many

partitions as distinct
values
■ Binary split:
■ Divides values into

two subsets
■ Need to find optimal

partitioning
■ Preserve order

property among
attribute values
2-Sep-20 Data Mining: Classification 20
Splitting based on Continuous attributes
■ Different ways of handling
■ Discretization to form an ordinal categorical attribute
■ Static – discretize once at the beginning
■ Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), clustering, ect.
■ Binary Decision: (A < v) or (A ≥ v)
■ Consider all possible splits and finds the best cut
■ Can be more compute intensive

2-Sep-20 Data Mining: Classification 21


How to Determine The Best Split?

2-Sep-20 Data Mining: Classification 22


How to Determine The Best Split?

■ Greedy approach:
■ Nodes with purer class distribution are

preferred
■ Need a measure of node impurity:

2-Sep-20 Data Mining: Classification 23


Finding The Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
■ Compute impurity measure of each child node

■ Compute the average impurity of the children (M)

3. Choose the attribute test condition that produces the


highest gain
Gain = P – M
or equivalently, lowest impurity measure after splitting
(M)

2-Sep-20 Data Mining: Classification 24


Measure of Impurity: Entropy
■ Entropy at a given node t:

■ (NOTE: p( j | t) is the relative frequency of class j at node t).

■ Information Gain:

■ Parent Node, p is split into k partitions;


■ ni is number of records in partition i

■ Choose the split that achieves most reduction (maximizes GAIN)


■ Used in the ID3 and C4.5 decision tree algorithms

2-Sep-20 Data Mining: Classification 25


Entropy Example

2-Sep-20 Data Mining: Classification 26


Entropy Example

■ The ‘Wind’ attribute:

■ Then:

■ Information Gain of “Wind” attribute:

2-Sep-20 Data Mining: Classification 27


Entropy Example
■ The “Sky” attribute:

■ Then:

■ Information Gain of “Sky” attribute:

2-Sep-20 Data Mining: Classification 28


Entropy Example
■ The “Barometer” attribute:

■ Then:

■ Information Gain of “Barometer” attribute:

2-Sep-20 Data Mining: Classification 29


Entropy Example
0.549

0.156

0.049

2-Sep-20 Data Mining: Classification 30


Measures of Node Impurity

■ Entropy

■ Gini Index

■ Misclassification error

2-Sep-20 Data Mining: Classification 31


Practical Challenges in Classification

■ Over-fitting
■ Model performs well on training set, but

poorly on test set


■ If the model is too simple, it may not fit the

training and test sets well. If the model is


too complex, over- fitting may occur and
reduce its ability to generalize beyond
training instances
■ Missing Values
■ Data Heterogeneity
2-Sep-20 Data Mining: Classification 32
Example of over-fitting

2-Sep-20 Data Mining: Classification 33


How to Address Over-fitting
■ Pre-Pruning (Early Stopping Rule)
■ Stop the algorithm before it becomes a

fully-grown tree
■ Typical stopping conditions for a node:
■ Stop if all instances belong to the same class
■ Stop if all the attribute values are the same
■ More restrictive conditions:
■ Stop if number of instances is less than some
user-specified threshold
■ Stop if class distribution of instances are independent
of the available features
■ Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain)

2-Sep-20 Data Mining: Classification 34


How to Address Over-fitting
■ Post-pruning
■ Grow decision tree to its entirety

■ Trim the nodes of the decision tree in a


bottom-up fashion. If generalization
error improves after trimming, replace
sub-tree by a leaf node
■ Class label of leaf node is determined
from majority class of instances in the
sub-tree

2-Sep-20 Data Mining: Classification 35


Other Issues
■ Missing values affect decision tree construction in three different ways:
■ Affects how impurity measures are computed
■ Affects how to distribute records with missing value to child nodes
■ Affects how a test record with missing value is classified
■ Data Fragmentation
■ Number of records get smaller as you traverse down the tree
■ Number of records at the leaf nodes could be too small to make any
statistically significant decision
■ Difficult to interpret large-sized trees
■ Tree could be large because of using a single attribute in the test
condition
■ Oblique decision trees
■ Tree Replication
■ Sub-Tree may appear at different parts of a decision tree
■ Constructive induction: create new attributes by combining existing
attributes
2-Sep-20 Data Mining: Classification 36
Other Issues

2-Sep-20 Data Mining: Classification 37


Rule Based Classifier
■ Classify records by using a collection of “if…then…” rules
■ Rule: (Condition) → y
■ where

■ Condition is a conjunctions of attributes


■ y is the class label
■ LHS: rule antecedent or condition
■ RHS: rule consequent
■ Examples of classification rules:
■ (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
■ (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

2-Sep-20 Data Mining: Classification 38


Application of Rule Based Classifier

2-Sep-20 Data Mining: Classification 39


Rule Coverage and Accuracy

2-Sep-20 Data Mining: Classification 40


Building Classification Rules
■ Direct Method:
■ Extract rules directly from data

■ e.g.: RIPPER, CN2, Holte’s 1R

■ Indirect Method:
■ Extract rules from other classification
models (e.g. decision trees, neural
networks, etc).
■ e.g: C4.5 rules

(Extract rules from decision tree)


2-Sep-20 Data Mining: Classification 41
From Decision Trees to Rules

2-Sep-20 Data Mining: Classification 42


Rules can be Simplified

2-Sep-20 Data Mining: Classification 43


Scalable Decision Tree Induction
Methods in Data Mining Studies
■ SLIQ (EDBT’96 — Mehta et al.)
■ builds an index for each attribute and only class list
and the current attribute list reside in memory
■ SPRINT (VLDB’96 — J. Shafer et al.)
■ constructs an attribute list data structure

■ PUBLIC (VLDB’98 — Rastogi & Shim)


■ integrates tree splitting and tree pruning: stop
growing the tree earlier
■ RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
■ separates the scalability aspects from the criteria
that determine the quality of the tree
■ builds an AVC-list (attribute, value, class label)

2-Sep-20 Data Mining: Classification 44


Exercise

2-Sep-20 Data Mining: Classification 45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy