Data Mining: Practical Machine Learning Tools and Techniques
Data Mining: Practical Machine Learning Tools and Techniques
Data Mining: Practical Machine Learning Tools and Techniques
2
Just apply a learner? NO!
• Scheme/parameter selection
• Treat selection process as part of the learning process to avoid
optimistic performance estimates
• Modifying the input:
• Data engineering to make learning possible or easier
• Modifying the output
• Converting multi-class problems into two-class ones
• Re-calibrating probability estimates
3
Attribute selection
4
Scheme-independent attribute selection
5
Scheme-independent attribute selection
6
Attribute subsets for weather data
7
Searching the attribute space
8
Scheme-specific selection
• Wrapper approach to attribute selection: attributes are
selected with target scheme in the loop
• Implement “wrapper” around learning scheme
Evaluation criterion: cross-validation performance
• Time consuming in general
• greedy approach, k attributes k2 time
• prior ranking of attributes linear in k
• Can use significance test to stop cross-validation for a
subset early if it is unlikely to “win” (race search)
Can be used with forward, backward selection, prior ranking, or special-
purpose schemata search
• Scheme-specific attribute selection is essential for learning
decision tables
• Efficient for decision tables and Naïve Bayes 9
Attribute discretization
10
Discretization: unsupervised
• Unsupervised discretization: determine intervals without
knowing class labels
When clustering, the only possible way!
• Two well-known strategies:
• Equal-interval binning
• Equal-frequency binning
(also called histogram equalization)
• Unsupervised discretization is normally inferior to
supervised schemes when applied in classification tasks
But equal-frequency binning works well with naïve Bayes if the number
of intervals is set to the square root of the size of dataset
(proportional k-interval discretization)
11
Discretization: supervised
12
Example: temperature attribute
Temperature 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Play Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
13
Formula for MDL stopping criterion
15
Error-based vs. entropy-based
• Question:
could the best discretization ever have two adjacent
intervals with the same class?
• Wrong answer: No. For if so,
• Collapse the two
• Free up an interval
• Use it somewhere else
• (This is what error-based discretization will do)
• Right answer: Surprisingly, yes.
• (and entropy-based discretization can do it)
16
Error-based vs. entropy-based
A 2-class,
2-attribute problem
18
Projections
• Simple transformations can often make a large
difference in performance
• Example transformations (not necessarily for
performance improvement):
• Difference of two date attributes
• Ratio of two numeric (ratio-scale) attributes
• Concatenating the values of nominal attributes
• Encoding cluster membership
• Adding noise to data
• Removing data randomly or selectively
• Obfuscating the data
19
Principal component analysis
• Unsupervised method for identifying the important
directions in a dataset
• We can then rotate the data into the (reduced) coordinate
system that is given by those directions
• PCA is a method for dimensionality reduction
• Algorithm:
1.Find direction (axis) of greatest variance
2.Find direction of greatest variance that is perpendicular to previous
direction and repeat
• Implementation: find eigenvectors of the covariance matrix
of the data
• Eigenvectors (sorted by eigenvalues) are the directions
• Mathematical details are covered in chapter on “Probabilistic methods”
20
Example: 10-dimensional data
22
Partial least-squares regression
• PCA is often used as a pre-processing step before applying
a learning algorithm
• When linear regression is applied, the resulting model is known as
principal components regression
• Output can be re-expressed in terms of the original attributes because
PCA yields a linear transformation
• PCA is unsupervised and ignores the target attribute
• The partial least-squares transformation differs from PCA
in that it takes the class attribute into account
• Finds directions that have high variance and are strongly correlated
with the class
• Applying PLS as a pre-processing step for linear regression
yields partial least-squares regression
23
An algorithm for PLS
1. Start with standardized input attributes
2. Attribute coefficients of the first PLS direction:
● Compute the dot product between each attribute vector and the
class vector in turn, this yields the coefficients
3. Coefficients for next PLS direction:
● Replace attribute value by difference (residual) between the
attribute's value and the prediction of that attribute from a
simple regression based on the previous PLS direction
● Compute the dot product between each attribute's residual
vector and the class vector in turn, this yields the coefficients
4. Repeat from 3
24
Independent component analysis (ICA)
25
Correlation vs. statistical independence
P(A, B) = P(A)P(B)
26
ICA and Mutual Information
27
ICA & FastICA
H(x) = - ò p(x)logp(x)dx
28
Linear discriminant analysis
29
The LDA classifier
30
Quadratic discriminant analysis (QDA)
31
Fisher’s linear discriminant analysis
• Let us now consider Fisher’s LDA projection for dimensionality
reduction, considering the two-class case first
• We seek a projection vector a that can be used to compute
scalar projections y = ax for input vectors x
• This vector is obtained by computing the means of each class,
μ1 and μ2, and then computing two special matrices
• The between-class scatter matrix is calculated as
(note the use of the outer product of two vectors here, which
gives a matrix)
• The within-class scatter matrix is
32
Fisher’s LDA: the solution vector
33
Multi-class FLDA
A T SB A
• Then, find the projection matrix A that maximizes J(A) =
A T SW A
Determinants are analogs of variances computed in multiple dimensions,
along the principal directions of the scatter matrices, and multiplied together
• Solutions for finding A are based on solving a “generalized
eigenvalue problem” for each column of the matrix A.
34
Fisher’s LDA vs PCA
Class A
Class B
x2 FLDA
PCA
x1
36
Time series
• In time series data, each instance represents a
different time step
• Some simple transformations:
• Shift values from the past/future
• Compute difference (delta) between instances (i.e., “derivative”)
• In some datasets, samples are not regular but time is
given by timestamp attribute
• Need to normalize by step size when transforming
• Transformations need to be adapted if attributes
represent different time steps
37
Sampling
38
Automatic data cleansing
39
Robust regression
40
Example: least median of squares
Number of international phone calls from Belgium,
1950–1973
41
Detecting anomalies
42
One-Class Learning
43
Outlier detection
44
Using artificial data for one-class classification
• Can we apply standard multi-class techniques to obtain
one-class classifiers?
• Yes: generate artificial data to represent the unknown
non-target class
• Can then apply any off-the-shelf multi-class classifier
• Can tune rejection rate threshold if classifier produces probability
estimates
• Too much artificial data will overwhelm the target class!
• But: unproblematic if multi-class classifier produces accurate class
probabilities and is not focused on misclassification error
• Generate uniformly random data?
• Curse of dimensionality – as # attributes increases it becomes
infeasible to generate enough data to get good coverage of the space
45
Generating artificial data
• Idea: generate data that is close to the target class
• T – target class, A – artificial class
• Generate artificial data using appropriate distribution P(X | A)
• Data no longer uniformly distributed ->
must take this distribution into account when computing
membership scores for the one-class model
• Want P(X | T), for any instance X; we know P(X | A)
• Train probability estimator P(T | X) on two classes T and A
• Then, rewrite Bayes' rule:
(1- Pr[T ])Pr[T | X]
Pr[X | T ] = Pr[X | A]
Pr[T ](1- Pr[T | X])
• For classification, choose a threshold to tune rejection rate
• How to choose P(X | A)? Apply a density estimator to the target
class and use resulting function to model the artificial class
46
Transforming multiple classes to binary ones
47
Error-correcting output codes
48
More on ECOCs
• Two optimization criteria for code matrix:
1. Row separation: minimum distance between rows
2. Column separation: minimum distance between columns (and
columns’ complements)
• Why is column separation important? Because if columns are
identical, column classifiers will likely make the same errors
• Even if columns are not identical, error-correction is
weakened if errors are correlated
• 3 classes only 23 possible columns
• (and 4 out of the 8 are complements)
• Cannot achieve row and column separation
• ECOCs only work for problems with > 3 classes
49
Exhaustive ECOCs
51
Ensembles of nested dichotomies
52
Example with four classes
Full set of classes: [a, b, c, d] A two-class
classifier is learned
Two disjoint subsets: [a, b] [c, d] at each internal node
of this tree
54
Ensembles of nested dichotomies
• If there is no reason a priori to prefer any particular
decomposition, then use them all
• Impractical for any non-trivial number of classes
• Consider a subset by taking a random sample of possible
tree structures
• Implement caching of models for efficiency (since a given two class
problem may occur in multiple trees)
• Average probability estimates over the trees
• Experiments show that this approach yields accurate multiclass
classifiers
• Can even improve the performance of methods that can already
handle multiclass problems!
55
Calibrating class probabilities
• Class probability estimation is harder than
classification:
• Classification error is minimized as long as the correct class
is predicted with maximum probability
• Estimates that yield correct classification may be quite poor
with respect to quadratic or informational loss
• But: it is often important to have accurate class
probabilities
• E.g. cost-sensitive prediction using the minimum expected
cost method
56
Visualizing inaccurate probability estimates
• Consider a two class problem. Probabilities that are correct for
classification may be:
• Too optimistic – too close to either 0 or 1
• Too pessimistic – not close enough to 0 or 1
Reliability diagram
showing overoptimistic
probability estimation
for a two-class problem
57
Calibrating class probabilities
58
Calibrating class probabilities
• Can view calibration as a function estimation problem
• One input – estimated class probability – and one output – the calibrated
probability
• Reasonable assumption in many cases: the function is
piecewise constant and monotonically increasing
• Can use isotonic regression, which estimates a monotonically
increasing piece-wise constant function:
Minimizes squared error between observed class
“probabilities” (0/1) and resulting calibrated class probabilities
• Alternatively, can use logistic regression to estimate the
calibration function
• Note: must use the log-odds of the estimated class probabilities as input
• Advantage: multiclass logistic regression can be used for
calibration in the multiclass case
59
Weka implementations
• Attribute selection
• CfsSubsetEval (correlation-based attribute subset evaluator)
• ConsistencySubsetEval (measures class consistency for a given set of
attributes, in the consistencySubsetEval package)
• ClassifierSubsetEval (uses a classifier for evaluating subsets of attributes, in the
classifierBasedAttributeSelection package)
• SVMAttributeEval (ranks attributes according to the magnitude of the
coefficients learned by an SVM, in the SVMAttributeEval package)
• ReliefF (instance-based approach for ranking attributes)
• WrapperSubsetEval (uses a classifier plus cross-validation)
• GreedyStepwise (forward selection and backward elimination search)
• LinearForwardSelection (forward selection with a sliding window of attribute
choices at each step of the search, in the linearForwardSelection package)
• BestFirst (search method that uses greedy hill-climbing with backtracking)
• RaceSearch (uses the race search methodology, in the raceSearch package)
• Ranker (ranks individual attributes according to their evaluation)
60
Weka implementations
61
Weka implementations
• OneClassClassifier
• Implements one-class classification using artificial data (available in
the oneClassClassifier package)
• MultiClassClassifier
• Includes several ways of handling multiclass problems with two-class
classifiers, including error-correcting output codes
• END
• Ensembles of nested dichotomies, in the
ensemblesOfNestedDichotomies package
• Many other preprocessing tools are available:
• Arithmetic operations; time-series operations; obfuscation;
generating cluster membership values; adding noise; various
conversions between numeric, binary, and nominal attributes; and
various data cleansing operations
62
Further Reading and Bibliographic Notes
63
Further Reading and Bibliographic Notes
• Kira and Rendell (1992) used instance-based methods to select features, leading to
a scheme called RELIEF for Recursive Elimination of Features
• Gilad-Bachrach, Navot, and Tishby (2004) show how this scheme can be modified
to work better with redundant attributes
• The correlation-based feature selection method is due to Hall (2000)
• The use of wrapper methods for feature selection is due to John, Kohavi, and
Pfleger (1994) and Kohavi and John (1997)
• Genetic algorithms have been applied within a wrapper framework by Vafaie and
DeJong (1992) and Cherkauer and Shavlik (1996)
• The selective naïve Bayes learning scheme is due to Langley and Sage (1994)
• Guyon, Weston, Barnhill, and Vapnik (2002) present and evaluate the recursive
feature elimination scheme in conjunction with support vector machines
• The method of raced search was developed by Moore and Lee (1994)
• Gütlein, Frank, Hall, and Karwath (2009) show how to speed up scheme-specific
selection for datasets with many attributes using simple ranking-based methods
64
Further Reading and Bibliographic Notes
• Dougherty, Kohavi, and Sahami (1995) show results comparing the entropy-
based discretization method with equal-width binning and the 1R method
• Frank and Witten (1999) describe the effect of using the ordering information
in discretized attributes
• Proportional k-interval discretization for Naive Bayes was proposed by Yang
and Webb (2001)
• The entropy-based method for discretization, including the use of the MDL
stopping criterion, was developed by Fayyad and Irani (1993)
• The bottom-up statistical method using the χ2 test is due to Kerber (1992)
• An extension to an automatically determined significance level is described by
Liu and Setiono (1997)
• Fulton, Kasif, and Salzberg (1995) use dynamic programming for discretization
and present a linear-time algorithm for error-based discretization
• The example used for showing the weakness of error-based discretization is
adapted from Kohavi and Sahami (1996)
65
Further Reading and Bibliographic Notes
66
Further Reading and Bibliographic Notes
• Barnett and Lewis (1994) address the general topic of outliers in data from a
statistical point of view
• Pearson (2005) describes the statistical approach of fitting a distribution to
the target data
• Schölkopf, Williamson, Smola, Shawe-Taylor, and Platt (2000) describe the
use of support vector machines for novelty detection
• Abe, Zadrozny, and Langford (2006), amongst others, use artificial data as a
second class
• Combining density estimation and class probability estimation using artificial
data is suggested for unsupervised learning by Hastie et al. (2009)
• Hempstalk, Frank, and Witten (2008) describe it in the context of one-class
classification
• Hempstalk and Frank (2008) discuss how to fairly compare one-class and
multiclass classification when discriminating against new classes of data
67
Further Reading and Bibliographic Notes
• Vitter (1985) describes the algorithm for reservoir sampling we used, he
called it method R; its computational complexity is O(#instances)
• Rifkin and Klautau (2004) show that the one-vs-rest method for multiclass
classification can work well if appropriate parameter tuning is applied
• Friedman (1996) describes the technique of pairwise classification and
Fürnkranz (2002) further analyzes it
• Hastie and Tibshirani (1998) extend it to estimate probabilities using
pairwise coupling
• Fürnkranz (2003) evaluates pairwise classification as a technique for
ensemble learning
• ECOCs for multi-class classification were proposed by Dietterich and Bakiri
(1995); Ricci and Aha (1998) showed how to apply them to nearest
neighbor classifiers
• Frank and Kramer (2004) introduce ensembles of nested dichotomies
• Dong, Frank, and Kramer (2005) considered balanced nested dichotomies
to reduce training time
68
Further Reading and Bibliographic Notes
69