Knowledge Discovery and Data Mining: Concepts and Fundamental Aspects
Knowledge Discovery and Data Mining: Concepts and Fundamental Aspects
Knowledge Discovery and Data Mining: Concepts and Fundamental Aspects
Goodness of fit
1.4.1 Overview
Another common terminology used by the machine-learning community
refers to the prediction methods as supervised learning as opposed to unsu-
pervised learning. Unsupervised learning refers to modeling the distribution
of instances in a typical, high-dimensional input space.
Concepts and Fundamental Aspects 5
attribute values. The bag schema provides the description of the attributes
and their domains. For the purpose of this book, a bag schema is denoted
as B(AU y) where A denotes the set of input attributes containing n attri-
butes: A = {al, . . . , a,, . . . , a,) and y represents the class variable or the
target attribute.
Attributes (sometimes called field, variable or feature) are typically
one of two types: nominal (values are members of an unordered set), or
numeric (values are real numbers). When the attribute a, is nominal it
is useful to denote by dom(a,) = {v,,l,v,,z,. . . , ~ , , l d ~ ~ ( its ~ , domain
)l) val-
ues, where (dom(a,)( stands for its finite cardinality. In a similar way,
dom(y) = {cl, . . . , c ~ ~ ~represents ~ ( ~ the
) domain
~ ) of the target attribute.
Numeric attributes have infinite cardinalities.
The instance space (the set of all possible examples) is defined as a
Cartesian product of all the input attributes domains: X = dom(al) x
dom(az) x . . . x dom(a,). The Universal Instance Space (or the Labeled
Instance Space) U is defined as a Cartesian product of all input attribute
domains and the target attribute domain, i.e.: U = X x dom(y).
The training set is a Bag Instance consistin~ofa set of m tuples. For-
mally the training set is denoted as S ( B ) = ( ( x l , yl), . . . , (x,, y,)) where
x, E X and y, E dom(y).
Usually, it is assumed that the training set tuples are generated ran-
domly and independently according t o some fixed and unknown joint prob-
ability distribution D over U . Note that this is a generalization of the deter-
ministic case when a supervisor classifies a tuple using a function y = f (x).
This book uses the common notation of bag algebra t o present pro-
jection (T) and selection ( a ) of tuples ([Grumbach and Milo (1996)].
For example given the dataset S presented in Table 1.1, the expression
Ta,,asUal=nYesnANDa4>6Sresult with the dataset presented in Table 1.2.
assigns each possible example t o the concept or not. Thus, a concept can be
regarded as a function from the Instance space t o the Boolean set, namely:
c : X -+ {-1,l). Alternatively one can refer a concept c as a subset of X ,
namely: {x E X : c ( x ) = 1). A concept class C is a set of concepts.
To learn a concept is t o infer its general definition from a set of examples.
This definition may be either explicitly formulated or left implicit, but
either way it assigns each possible example to the concept or not. Thus, a
concept can be formally regarded as a function from the set of all possible
examples to the Boolean set {True, False).
Other communities, such as the KDD community prefer t o deal with a
straightforward extension of Concept Learning, known as The Classzjication
Problem. In this case we search for a function that maps the set of all
possible examples into a predefined set of class labels which are not limited
to the Boolean set. Most frequently the goal of the Classifiers Inducers is
formally defined as:
Given a training set S with input attributes set A = {al, aa, . . . ,a,)
and a nominal target attribute y from an unknown fixed distribution D
over the labeled instance space, the goal is to induce an optimal classifier
with minimum generalization error.
8 Decomposition Methodology for Knowledge Discovery and Data Mining
generalizes the relationship between the input attributes and the target
attribute. For example, an inducer may take as an input specific training
tuples with the corresponding class label, and produce a classifier.
The notation I represents an inducer and I ( S ) represents a model which
was induced by performing I on a training set S . Using I(S) it is possible
to predict the target value of a tuple x,. This prediction is denoted as
I(S)(xq).
Given the long history and recent growth of the field, it is not surpris-
ing that several mature approaches to induction are now available t o the
practitioner.
Classifiers may be represented differently from one inducer to another.
For example, C4.5 [Quinlan (1993)] represents model as a decision tree
while Na'ive Bayes [ ~ u d and
a Hart (1973)] represents a model in the form
of probabilistic summaries. Furthermore, inducers can be deterministic (as
in the case of C4.5) or stochastic (as in the case of back propagation)
The classifier generated by the inducer can be used t o classify an unseen
tuple either by explicitly assigning it t o a certain class (Crisp Classifier) or
by providing a vector of probabilities representing the conditional proba-
bility of the given instance t o belong to each class (Probabilistic Classifier).
Inducers that can construct Probabilistic Classifiers are known as Proba-
bilistic Inducers. In this case it is possible to estimate the conditional prob-
ability PI(S)(y = cj la, = xq,i ; i = 1 , . . . , n) of an observation x,. Note the
addition of the "hat" - - t o the conditional probability estimation is
used for distinguishing it from the actual conditional probability.
The following sections briefly review some of the major approaches
t o concept learning: Decision tree induction, Neural Networks, Genetic
Algorithms, instance- based learning, statistical methods, Bayesian meth-
ods and Support Vector Machines. This review focuses more on methods
that have the greatest attention in this book.
Rule induction algorithms generate a set of if-then rules that jointly repre-
sent the target function. The main advantage that rule induction offers is
its high comprehensibility. Most of the Rule induction algorithms are based
on the separate and conquer paradigm [Michalski (1983)]. For that reason
these algorithms are capable of finding simple axis parallel frontiers, are
well suited t o symbolic domains, and can often dispose easily of irrelevant
10 Decomposition Methodology for Knowledge Discovery and Data Mining
attributes; but they can have difficulty with nonaxisparallel frontiers, and
suffer from the fragmentation problem (i.e., the available data dwindles as
induction progresses [Pagallo and Huassler (1990)] and the small disjuncts
problem i.e., rules covering few training examples have a high error rate
[Holte et al. (1989)l.
1.7.1 Overview
Bayesian approaches employ probabilistic concept representations, and
range from the Nai've Bayes [Domingos and Pazzani (1997)l t o Bayesian
networks. The basic assumption of Bayesian reasoning is that the relation
between attributes can be represented as a probability distribution [Mai-
mon and Last (2000)]. Moreover if the problem examined is supervised then
the objective is to find the conditional distribution of the target attribute
given the input attribute.
The predicted value of the target attribute is the one which maximizes
the following calculated probability:
UMAP(X,) = argmax
cj Edom(y)
P ( ~= cj) .
n
n P(ai
i= 1
A
= x,,~Jy = cj ) (1.2)
where P ( ~ = cj) denotes the estimation for the a-priori probability of the
target attribute obtaining the value cj. Similarly P ( a i = x,,~ly = cj )
denotes the conditional probability of the input attribute ai obtaining
the value x,,i given that the target attribute obtains the value cj. Note
that the hat above the conditional probability distinguishes the probability
estimation from the actual conditional probability.
A simple estimation for the above probabilities can be obtained using
the corresponding frequencies in the training set, namely:
Using the Bayes rule, the above equations can be rewritten as:
=) argmar log P ( =
UMAP(Z~ ~ cj))
cjEdom(y) ( A
+ F (log (P (y = y
2= 1
/0i = Z C i ) ) - log (B(y = CJ))
If the "naive" assumption is true, this classifier can easily be shown to be op-
timal (i.e. minimizing the generalization error), in the sense of minimizing
the misclassification rate or zero-one loss (misclassification rate), by a di-
rect application of Bayes' theorem. [Domingos and Pazzani (1997)] showed
that the Naive Bayes can be optimal under zero-one loss even when the in-
dependence assumption is violated by a wide margin. This implies that the
Bayesian classifier has a much greater range of applicability than previously
thought, for instance for learning conjunctions and disjunctions. Moreover,
a variety of empirical research shows surprisingly that this method can per-
form quite well compared t o other methods, even in domains where clear
attribute dependencies exist.
14 Decomposition Methodology for Knowledge Discovery and Data Mining
There are two known corrections for the simple probability estimation
which avoid this phenomenon. The following sections describe these cor-
rections.
In order to use the above correction, the values of p and k should be se-
lected. It is possible to use p = 1/ Idom(y)l and k = Idom(y)l. [Ali and
Pazzani (1996)] suggest t o use k = 2 and p = 112 in any case even if
Idom(y)l > 2 in order to emphasize the fact that the estimated event is
always compared to the opposite event. [Kohavi et al. (1997)] suggest to
use k = I d o m ( y ) l / IS( and p = l / ldom(y)l.
1.7.2.5 N o Match
According t o [Clark and Niblett (1989)]only zero probabilities are corrected
and replaced by the following value: pa/lSI. [ ~ o h a v et
i al. (1997)l suggest
to use pa = 0.5. They also empirically compared the Laplace correction and
the No-Match correction and indicate that there is no significant difference
16 Decomposition Methodology for Knowledge Discovery and Data Mining
between them. However, both of them are significantly better than not
performing any correction at all.
attributes and the neurons in the output layer correspond t o the target
attribute. Neurons in the hidden layer are connected t o both input and
output neurons and are key t o inducing the classifier. Note that the signal
flow is one directional from the input layer to the output layer and there
are no feedback connections.
Hidden T .aver ( I I I* . . * I
Inpu~tLayer
Transfer Function
w'\
these induction paradigms can be found in the above reference and in the
following section.
error. However, using the training error as-is will typically provide an
optimistically biased estimate, especially if the learning algorithm over-
fits the training data. There are two main approaches for estimating the
generalization error: Theoretical and Empirical. In the context of this book
we utilize both approaches.
may however be quite different from the number of free parameters and in
many cases it might be very difficult to compute it accurately. In this case
it is useful t o calculate a lower and upper bound for the VC-Dimension,
for instance [Schmitt (2002)] have presented these VC bounds for neural
networks.
Definition 1.1 Let C be a concept class defined over the input instance
space X with n attributes. Let I be an inducer that considers hypothesis
space H . C is said to be PAC-learnable by I using H if for all c E C,
distributions D over X, E such that 0 < E < 112 and b such that 0 < b <
112, learner I with a probability of at least (1 - 6) will output a hypothesis
h E H such that ~ ( hD), < E, in time that is polynomial in 1 / ,~116, n, and
size(c), where size(c) represents the encoding length of c in C , assuming
some representation for C.
t ( I , S, c j , x ) = 1 PI(S)(Y
= cj 1%)> &(s)(y = c* ( z )Vc* E dorn(y), # cj
0 Otherwise
Note that the probability t o misclassify the instance x using inducer I and
a training set of size m is:
26 Decomposition Methodology for Knowledge Discovery and Data Mining
It is important to note that in case of zero-one loss there are other definitions
for the bias-variance components. These definitions are not necessarily
consistent. In fact there is a considerable debate in the literature about
what should be the most appropriate definition. For a complete list of
these definitions please refer to [Hansen (2000)l.
Nevertheless in context of regression a single definition of bias and vari-
ance has been adopted by the entire community. In this case it is useful t o
define the bias-variance components by referring t o the quadratic loss, as
follows:
where fR(x) represents the prediction of the regression model and f(x)
represents the actual value. The intrinsic variance and bias components
are respectively defined as:
Simpler models tend t o have a higher bias error and smaller variance
error than complicated models. [Bauer and Kohavi (1999)]have provided
an experimental result supporting the last argument for Naive Bayes, while
[Dietterich and Kong (1995)]have examined the bias-variance issue in deci-
sion trees. Figure 1.5 illustrates this argument. The figure shows that there
is a trade-off between variance and bias. When the classifier is simple it has
a large bias and small variance. As the classifier become more complicated,
it has larger variance but smaller bias. The minimum generalization error
is obtained somewhere in between, where both bias and variance are small.
-Generalization Error
------- Variance
..............Bias
Complexy
Fig. 1.5 Bias vs. Variance in the Deterministic Case: Hansen, 2000.
current classifier such that the new classifier reflects the new data?
Computational Complexity for classifying a new instance: Generally
this type is neglected because it is relatively small. However, in certain
methods (like k-Nearest Neighborhood) or in certain real time applica-
tions (like anti-missiles applications), this type can be critical.
1.9.6 Comprehensibility
Comprehensibility criterion (also known as Interpretability) refers t o how
well humans grasp the classifier induced. While the generalization error
measures how the cIassifier fits the data, comprehensibility measures the
"Mental fit" of that classifier.
Many techniques, like neural networks or SVM (Support Vector Ma-
chines), are designed solely to achieve accuracy. However, as their classifiers
are represented using large assemblages of real valued parameters, they are
also difficult to understand and are referred to as black-box models.
It is often important for the researcher t o be able t o inspect an induced
classifier. For domains such as medical diagnosis, the users must understand
how the system makes its decisions in order to be confident of the outcome.
28 Decomposition Methodology for Knowledge Discovery and Data Mining
Data mining can also play an important role in the process of scientific
discovery. A system may discover salient features in the input data whose
importance was not previously recognized. If the representations formed
by the inducer are comprehensible, then these discoveries can be made
accessible to human review [Hunter and Klein (1993)l.
Comprehensibility can vary between different classifiers created by the
same inducer. For instance, in the case of decision trees, the size (number
of nodes) of the induced trees is also important. Smaller trees are preferred
because they are easier to interpret. However, this is only a rule of thumb,
in some pathologic cases a large and unbalanced tree can still be easily
interpreted [ ~ u j and
a Lee (2001)l.
As the reader can see the accuracy and complexity factors can be quan-
titatively estimated, while the comprehensibility is more subjective.
Another distinction is that the complexity and comprehensibility
depend mainly only on the induction method and much less on the specific
domain considered. On the other hand, the dependence of error metric on
specific domain can not be neglected.
and the domains that occur in the real world, and are therefore the ones
of primary interest [ R ~ Q et al. (1995)l. Without doubt there are many
domains in the former set that are not in the latter, and average accuracy
in the realworld domains can be increased at the expense of accuracy in the
domains that never occur in practice. Indeed, achieving this is the goal of
inductive learning research. It is still true that some algorithms will match
certain classes of naturallyoccurring domains better than other algorithms,
and so achieve higher accuracy than these algorithms, and that this may be
reversed in other realworld domains; but this does not preclude an improved
algorithm from being as accurate as the best in each of the domain classes.
Indeed, in many application domains the generalization error of even
the best methods is far above 0%, and the question of whether it can
be improved, and if so how, is an open and important one. One part
of answering this question is determining the minimum error achievable
by any classifier in the application domain (known as the optimal Bayes
error). If existing classifiers do not reach this level, new approaches are
needed. Although this problem has received considerable attention (see for
instance [Tumer and Ghosh (1996)]), no generally reliable method has so
far been demonstrated.
The "no free lunch" concept presents a dilemma t o the analyst
approaching a new task: which inducer should be used?
If the analyst is looking for accuracy only, one solution is to try each one
in turn, and by estimating the generalization error, to choose the one that
appears to perform best [Schaffer (1994)l. Another approach, known as
multistrategy learning [Michalski and Tecuci (1994)], attempts t o combine
two or more different paradigms in a single algorithm. Most research in
this area has been concerned with combining empirical approaches with
analytical methods (see for instance o ow ell and Shavlik (1994)l. Ideally, a
multistrategy learning algorithm would always perform as well as the best
of its "parents" obviating the need to try each one and simplifying the
knowledge acquisition task. Even more ambitiously, there is hope that this
combination of paradigms might produce synergistic effects (for instance
by allowing different types of frontiers between classes in different regions
of the example space), leading to levels of accuracy that neither atomic
approach by itself would be able t o achieve.
Unfortunately, this approach has often been only moderately successful.
Although it is true that in some industrial applications (like in the case of
demand planning) this strategy proved to boost the error performance, in
many other cases the resulting algorithms are prone to be cumbersome,
30 Decomposition Methodology for Knowledge Discovery and Data Mining
and often achieve an error that lie between those of their parents, instead
of matching the lowest.
The dilemma of what method to choose becomes even greater, if other
factors such as comprehensibility are taken intoconsideration. For instance
for a specific domain, neural network may outperform decision trees in
accuracy. However, from the comprehensibility aspect, decision trees are
considered better. In other words, in this case even if the researcher knows
that neural network is more accurate, he still has a dilemma what method
to use.
reading.
Despite its popularity, the usage of feature selection methodologies for
overcoming the obstacles of high dimensionality has several drawbacks:
A number of linear dimension reducers have been developed over the years.
The linear methods of dimensionality reduction include projection pursuit
[Friedman and Tukey (1973)], factor analysis [ ~ i m and Mueller (1978)],
and principal components analysis [Dunteman (1989)]. These methods are
not aimed directly at eliminating irrelevant and redundant features, but
are rather concerned with transforming the observed variables into a small
number of "projections" or LLdimensions".The underlying assumptions are
that the variables are numeric and the dimensions can be expressed as linear
combinations of the observed variables (and vice versa). Each discovered
dimension is assumed t o represent an unobserved factor and thus provide
a new way of understanding the data (similar to the curve equation in the
regression models).
The linear dimension reducers have been enhanced by constructive
induction systems that use a set of existing features and a set of pre-
defined constructive operators to derive new features [Pfahringer (1994);
Ragavan and Rendell (1993)l. These methods are effective for high dimen-
sionality applications only if the original domain size of the input feature
can be in fact decreased dramatically.
34 Decomposition Methodology for Knowledge Discovery and Data Mining
One way to deal with the above mentioned disadvantages is to use a very
large training set (which should increase in an exponential manner as the
number of input features increases). However, the researcher rarely enjoys
this privilege, and even if it does happen, the researcher will probably
encounter the aforementioned difficulties derived from a high number of
instances.
Practically most of the training sets are still considered "small" not due
to their absolute size but rather due to the fact that they contain too few
instances given the nature of the investigated problem, namely the instance
space size, the space distribution and the intrinsic noise. Furthermore, even
if a sufficient dataset is available, the researcher will probably encounter the
aforementioned difficulties derived from high number of records.