Classification: Decision Tree Hunt's Algorithm ID3 Rule Based Classifier C4.5
Classification: Decision Tree Hunt's Algorithm ID3 Rule Based Classifier C4.5
■ Definition
■ Decision Tree
■ Hunt’s Algorithm
■ ID3
Name Give Birth Lay Eggs Can Fly Live in water Have Legs
Chinese Dragon No No Yes Yes No
2-Sep-20 Data Mining: Classification 4
Examples of Classification Task
■ Banking: determining whether a mortgage application is
a good or bad credit risk, or whether a particular credit
card transaction is fraudulent
■ Education: placing a new student into a particular track
with regard to special needs
■ Medicine: diagnosing whether a particular disease is
present
■ Law: determining whether a will was written by the
actual person deceased or fraudulently by someone else
■ Homeland Security: identifying whether or not certain
financial or personal behavior indicates a possible terrorist
threat
2-Sep-20 Data Mining: Classification 5
Classification Model
■ In general a classification model can
be used for the following purposes:
■ It can serve as a explanatory tool for
distinguishing objects of different
classes (descriptive).
■ It can be used to predict the class
labels of new records (predictive).
Networks
■ Etc.
■ Multi-way split
as many partitions as
distinct values.
■ Binary split: Divides
values in to two
subsets. Need to find
optimal partitioning.
partitions as distinct
values
■ Binary split:
■ Divides values into
two subsets
■ Need to find optimal
partitioning
■ Preserve order
property among
attribute values
2-Sep-20 Data Mining: Classification 20
Splitting based on Continuous attributes
■ Different ways of handling
■ Discretization to form an ordinal categorical attribute
■ Static – discretize once at the beginning
■ Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), clustering, ect.
■ Binary Decision: (A < v) or (A ≥ v)
■ Consider all possible splits and finds the best cut
■ Can be more compute intensive
■ Greedy approach:
■ Nodes with purer class distribution are
preferred
■ Need a measure of node impurity:
■ Information Gain:
■ Then:
■ Then:
■ Then:
0.156
0.049
■ Entropy
■ Gini Index
■ Misclassification error
■ Over-fitting
■ Model performs well on training set, but
fully-grown tree
■ Typical stopping conditions for a node:
■ Stop if all instances belong to the same class
■ Stop if all the attribute values are the same
■ More restrictive conditions:
■ Stop if number of instances is less than some
user-specified threshold
■ Stop if class distribution of instances are independent
of the available features
■ Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain)
■ Indirect Method:
■ Extract rules from other classification
models (e.g. decision trees, neural
networks, etc).
■ e.g: C4.5 rules