Unit 5
Unit 5
Unit 5
09/02/2023 1
Introduction
• Traditional Data Mining Categories
• Majority of Objects
• Dependency detection
• Class identification
• Class description
• Exceptions
• Exception/outlier detection
09/02/2023 2
Motivation for Outlier Analysis
• Fraud Detection (Credit card, telecommunications, criminal activity in e-
Commerce)
• Customized Marketing (high/low income buying habits)
• Medical Treatments (unusual responses to various drugs)
• Analysis of performance statistics (professional athletes)
• Weather Prediction
• Financial Applications (loan approval, stock tracking)
09/02/2023 3
What Are Outliers?
• Outlier: A data object that deviates significantly from the normal objects as if it
were generated by a different mechanism
• Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
• Noise is random error or variance in a measured variable
• Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into
the model
• Applications:
• Credit card fraud detection
• Telecom fraud detection
• Customer segmentation
• Medical analysis
09/02/2023 4
When trying to detect outliers in a dataset
it is very important to keep in mind the
context and try to answer the
question: “¿Why do I want to detect
outliers?” The meaning of your findings
will be dictated by the context.
09/02/2023 5
Causes of Outliers
• Poor data quality / contamination
• Low quality measurements, malfunctioning equipment, manual error
• Correct but exceptional data
09/02/2023 6
Why outlier analysis?
09/02/2023 7
Types of Outliers
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly)
• Object is Og if it significantly deviates from the rest of the data set
• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
Global Outlier
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
• Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
• Issue: How to define or formulate meaningful context?
09/02/2023 8
What Are Outliers? (Cont’d)
• Collective Outliers
• A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data objects
may not be outliers
• Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of- Collective Outlier
service packages to each other
Detection of collective outliers
Consider not only behavior of individual objects, but also that of
groups of objects
Need to have the background knowledge on the relationship among
09/02/2023 9
Challenges of Outlier Detection
Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in an application
The border between normal and outlier objects is often a gray area
normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of the detection
09/02/2023 11
OUTLIER DETECTION METHODS
Two ways to categorize outlier detection methods:
1. Based on whether user-labeled examples of outliers can
be obtained:
• Supervised
• semi-supervised
• unsupervised methods
09/02/2023 13
Outlier Detection I: Supervised Methods
• Modeling outlier detection as a classification problem
• Samples examined by domain experts used for training & testing
• Methods for Learning a classifier for outlier detection effectively:
• Model normal objects & report those not matching the model as
outliers, or
• Model outliers and treat those not matching the model as normal
• Challenges
• Imbalanced classes, i.e., outliers are rare: Boost the outlier class and
make up some artificial outliers
• Catch as many outliers as possible, i.e., recall is more important than
accuracy (i.e., not mislabeling normal objects as outliers)
09/02/2023 14
Supervised Methods (Cont’d)
09/02/2023 15
Outlier Detection II: Unsupervised Methods
• Assume the normal objects are somewhat “Clustered” into multiple
groups, each having some distinct features
• An outlier is expected to be far away from any groups of normal
objects
• Weakness: Cannot detect collective outlier effectively
• Ex. In some intrusion or virus detection, normal activities are diverse
• Many clustering methods can be adapted for unsupervised methods.
09/02/2023 16
Unsupervised Methods (Cont’d)
• Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
• An outlier is expected to be far away from any groups of normal objects
• Weakness: Cannot detect collective outlier effectively
• Normal objects may not share any strong patterns, but the collective outliers may share
high similarity in a small area
• Ex. In some intrusion or virus detection, normal activities are diverse
• Unsupervised methods may have a high false positive rate but still miss many real
outliers.
• Supervised methods can be more effective, e.g., identify attacking some key resources
• Many clustering methods can be adapted for unsupervised methods
• Find clusters, then outliers: not belonging to any cluster
• Problem 1: Hard to distinguish noise from outliers
• Problem 2: Costly since first clustering: but far less outliers than normal objects
09/02/2023 17
• Newer methods: tackle outliers directly
Unsupervised Methods (Cont’d)
09/02/2023 18
Outlier Detection III: Semi-Supervised
Methods
• Labels could be on outliers only, normal objects only, or both
• Semi supervised outlier detection: Regarded as application of semi
supervised learning
• This can be done in 2 ways
If some labeled normal objects are available
If only some labeled outliers are available
09/02/2023 19
Semi-Supervised Methods (Cont’d)
• Situation: In many applications, the number of labeled data is often small: Labels could
be on outliers only, normal objects only, or both
• Semi-supervised outlier detection: Regarded as applications of semi-supervised learning
• If only some labeled outliers are available, a small number of labeled outliers many not
cover the possible outliers well
• To improve the quality of outlier detection, one can get help from models for normal
objects learned from unsupervised methods
09/02/2023 20
Semi-Supervised Methods (Cont’d)
09/02/2023 21
Method 2
Outlier detection based on assumptions about normal
data and outliers
09/02/2023 22
Outlier Detection (1): Statistical Methods
• Statistical methods (also known as model-based methods) assume that the
normal data follow some statistical model (a stochastic model)
• The data not following the model are outliers.
Example (right side figure):
09/02/2023 24
Outlier Detection (2): Proximity-Based Methods
• An object is an outlier if the nearest neighbors of the object are far
away, i.e., the proximity of the object is significantly deviates from the
proximity of most of the other objects in the same data set
Example (right figure): Model the proximity of an object
09/02/2023 25
25
Outlier Detection (3): Clustering-Based Methods
• Normal data belong to large and dense clusters, whereas outliers belong to small
clusters, or do not belong to any clusters.
Example: two clusters
• All points not in R form a large cluster
• The two points in R form a tiny cluster, thus are outliers
09/02/2023
26
Statistical Approaches
• Statistical approaches assume that the objects in a data set are generated by a
stochastic process (a generative model)
• Idea: learn a generative model fitting the given data set, and then identify the
objects in low probability regions of the model as outliers
• Methods are divided into two categories: parametric vs. non-parametric
• Parametric method
• Assumes that the normal data is generated by a parametric distribution
with parameter θ
• The probability density function of the parametric distribution f(x, θ) gives
the probability that object x is generated by the distribution
• The smaller this value, the more likely x is an outlier
• Non-parametric method
• Not assume an a-priori statistical model and determine the model from the
input data
• Not completely parameter free but consider the number and nature of the
parameters are flexible and not fixed in advance
• Examples: histogram and kernel density estimation
27
Parametric Methods I: Detection Univariate Outliers
Based on Normal Distribution
• Univariate data: A data set involving only one attribute or variable
• Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability as
outliers
• Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
• Use the maximum likelihood method to estimate μ and σ
09/02/2023
30
Parametric Methods (II) - Detection of
Multivariate Outliers
• Multivariate data: A data set involving two or more attributes or variables
09/02/2023
31
Detection of Multivariate Outliers (Cont’d)
• Method 1. Compute Mahalaobis distance
• Let ō be the mean vector for a multivariate data set. Mahalaobis distance for
an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō) where S is the
covariance matrix
• Use the Grubb's test on this measure to detect outliers
09/02/2023 32
32
Parametric Methods (III) – Using
Mixture of Parametric Distributions
• Assuming data generated by a normal distribution could
be sometimes overly simplified.
• Example: The objects between the two clusters cannot
be captured as outliers since they are close to the
estimated mean.
• To overcome this problem, assume the normal.
• data is generated by two normal distributions.
09/02/2023
33
Using Mixture of Parametric Distributions (Cont’d)
• Assuming data generated by a normal distribution
could be sometimes overly simplified
• Example (right figure): The objects between the two
clusters cannot be captured as outliers since they are
close to the estimated mean
To overcome this problem, assume the normal data is generated by two normal
distributions. For any object o in the data set, the probability that o is generated
by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
09/02/2023 An object o is an outlier if it does not belong to any cluster 34
34
Non-Parametric Methods: Detection Using Histogram
• The model of normal data is learned from the input data without any a
priori structure.
• Often makes fewer assumptions about the data, and thus can be
applicable in more scenarios
• Outlier detection using histogram:
Figure shows the histogram of purchase amounts in transactions
A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an
amount higher than $5,000
Problem: Hard to choose an appropriate bin size for histogram
Too small bin size → normal objects in empty/rare bins, false positive
Too big bin size → outliers in some frequent bins, false negative
Solution: Adopt kernel density estimation to estimate the probability density distribution of the 35
data. If the estimated density function is high, the object is likely normal. Otherwise, it is likely
09/02/2023
an outlier. 35
Proximity-Based Approaches
• Intuition: Objects that are far away from the others are outliers
• Assumption of proximity-based approach: The proximity of an outlier
deviates significantly from that of most of the others in the data set
• Two types of proximity-based outlier detection methods
• Distance-based outlier detection: An object o is an outlier if its
neighborhood does not have enough other points
• Density-based outlier detection: An object o is an outlier if its density is
relatively much lower than that of its neighbors
09/02/2023 36
Proximity-Based Approaches (I) – Distance
based Approaches
• General Idea
• Judge a point based on the distance(s) to its neighbors
• Several variants proposed
• Basic Assumption
• Normal data objects have a dense neighborhood
• Outliers are far apart from their neighbors, i.e., have a less dense
neighborhood
09/02/2023 37
Distance based Approaches (Cont’d)
• DB(,)-Outliers
• Basic model [Knorr and Ng 1997]
• Given a radius and a percentage
• A point p is considered an outlier if at most percent of all other points have a distance to
p less than
p1
Card ({q DB | dist ( p, q ) })
OutlierSet ( , ) { p | }
Card ( DB)
range-query with radius
p2
09/02/2023 38
Distance based Approaches (Cont’d)
• Deriving intentional knowledge [Knorr and Ng 1999]
• Relies on the DB(,)-outlier model
• Find the minimal subset(s) of attributes that explains the “outlierness” of a point, i.e., in
which the point is still an outlier
• Example
• Identified outliers
09/02/2023
Distance based Approaches (Cont’d)
•Nested-loop based [Knorr and Ng 1998]
– Divide buffer in two parts.
– Use second part to scan/compare all points with the points from the first
part.
09/02/2023
40
Distance based Approaches (Cont’d)
• Index-based [Knorr and Ng 1998]
– Compute distance range join using spatial index structure.
–Exclude point from further consideration if its ε-neighborhood contains more than
Card(DB) . π points.
09/02/2023
41
A Grid-Based Method
• Why efficiency is still a concern? When the complete set of objects cannot be held into
main memory, cost I/O swapping
• The major cost: (1) each object tests against the whole data set, why not only its close
neighbor? (2) check objects one by one, why not group by group?
• Grid-based method (CELL): Data space is partitioned into a multi-D grid. Each cell is a
hyper cube with diagonal length r/2
Pruning using the level-1 & level 2 cell properties:
For any possible point x in cell C and any possible point y in a level-1 cell,
dist(x,y) ≤ r
For any possible point x in cell C and any point y such that dist(x,y) ≥ r, y is
in a level-2 cell
Thus we only need to check the objects that cannot be pruned, and even for such an object o, only
need to compute the distance between o and the objects in the level-2 cells (since beyond level-2,
the distance from o is more than r)
09/02/2023 42
42
Density-based Approaches
• General idea
• Compare the density around a point with the density around its local neighbors
• The relative density of a point compared to its neighbors is computed as an
outlier score
• Approaches essentially differ in how to estimate density
• Basic assumption
• The density around a normal data object is similar to the density around its
neighbors
• The density around an outlier is considerably different to the density around its
neighbors
09/02/2023 43
Density-Based Outlier Detection
Basic assumptions
The density around a normal data object is similar to the density around its
neighbors.
The density around an outlier is considerably different to the density around
its neighbors.
09/02/2023
45
Density-based Approaches
• Local Outlier Factor (LOF) [Breunig et al. 1999], [Breunig et al. 2000]
• Motivation:
• Distance-based outlier detection models have problems with different densities
• How to compare the neighborhood of points from areas of different densities?
• Example
C1
• DB(,)-outlier model
• Parameters and cannot be chosen
so that o2 is an outlier but none of the
points in cluster C1 (e.g. q) is an outlier
• Outliers based on kNN-distance q
• kNN-distances of objects in C 1 (e.g. q)
are larger than the kNN-distance of o 2 C2
o2
09/02/2023 46
Local Outlier Factor: LOF
LOF (Local outlier factor) of an object o is the average of the ratio of local reachability of
o and those of o’s k-nearest neighbors
The lower the local reachability density of o, and the higher the local reachability density
of the kNN of o, the higher LOF
This captures a local outlier whose local density is relatively low comparing to the local
densities of its kNN
09/02/2023 47
47
Density-based Approaches
• Model
• Reachability distance reach dist k ( p, o) max{k distance (o), dist ( p, o)}
• Introduces a smoothing factor
• Local reachability distance (lrd) of point p
• Inverse of the average reach-dists of the kNNs of p
reach dist k ( p, o)
okNN ( p )
lrd k ( p ) 1 /
Card kNN ( p )
09/02/2023 48
Density-based Approaches
• Properties
• LOF 1: point is in a cluster
(region with homogeneous
density around the point and
its neighbors)
• Discussion
• Choice of k (MinPts) specifies the reference set
• Originally implements a local approach (resolution depends on the user’s choice for k)
• Outputs a scoring (assigns an LOF value to each point)
09/02/2023 49
Density-based Approaches
• Variants of LOF
• Mining top-n local outliers [Jin et al. 2001]
• Idea:
• Usually, a user is only interested in the top-n outliers
• Do not compute the LOF for all data objects => save runtime
• Method
• Compress data points into micro clusters using the CFs of BIRCH [Zhang et al. 1996]
• Derive upper and lower bounds of the reachability distances, lrd-values, and LOF-values for points
within a micro clusters
• Compute upper and lower bounds of LOF values for micro clusters and sort results w.r.t. ascending
lower bound
• Prune micro clusters that cannot accommodate points among the top-n outliers (n highest LOF
values)
• Iteratively refine remaining micro clusters and prune points accordingly
09/02/2023 50
Density-based Approaches
• Variants of LOF (cont.)
• Connectivity-based outlier factor (COF) [Tang et al. 2002]
• Motivation
• In regions of low density, it may be hard to detect outliers
• Choose a low value for k is often not appropriate
• Solution
• Treat “low density” and “isolation” differently
• Example
09/02/2023 51
Statistical-Based Outlier Detection
(Distribution-based)
• Assumptions:
• Knowledge of data (distribution, mean,
variance)
• Statistical discordancy test
• Data is assumed to be part of a working
hypothesis (working hypothesis)
• Each data object in the dataset is compared
to the working hypothesis and is either
accepted in the working hypothesis or Working Hypothesis: H : oi F , where i 1,2,..., n.
rejected as discordant into an alternative
hypothesis (outliers) Discordancy Test: is oi in F within standard deviation 15
Alternative Hypothesis:
-Inherent Distribution: H : oi G , where i 1,2,..., n.
Mixture Distribution: H : oi (1 ) F G, where i 1,2,..., n.
- Slippage Distibution : H : oi (1 ) F F , where i 1,2,..., n.
09/02/2023 52
Statistical-Based Outlier Detection
(Distribution-based)
• Assumptions:
• Knowledge of data (distribution, mean, variance)
Alternative Hypothesis:
-Inherent Distribution:H : oi G , where i 1,2,..., n.
Mixture Distribution: H : oi (1 ) F G, where i 1,2,..., n.
-Slippage Distibution: H : oi (1 ) F F , where i 1,2,..., n.
09/02/2023 53
Statistical-Based Outlier detection
(Depth-based)
• Data is organized into layers according to some
definition of depth
• Shallow layers are more
likely to contain
outliers than deep
layers
• Can efficiently handle
computation for k < 4
09/02/2023 54
Statistical-Based Outlier Detection
• Strengths
• Most outlier research has been done in this area, many data
distributions are known
• Weakness
• Almost all of the statistical models are univariate (only handle one
attribute) and those that are multivariate only efficiently handle k<4
• All models assume the distribution is known –this is not always the
case
• Outlier detection is completely subjective to the distribution used
09/02/2023 55
Major Statistical Data Mining Methods
• Regression
• Generalized Linear Model
• Analysis of Variance
• Mixed-Effect Models
• Factor Analysis
• Discriminant Analysis
• Survival Analysis
09/02/2023 56
Statistical Data Mining (1)
• There are many well-established statistical techniques for data
analysis, particularly for numeric data
• applied extensively to data from scientific experiments and data
from economics and the social sciences
Regression
predict the value of a response (dependent)
variable from one or more predictor (independent)
variables where the variables are numeric
forms of regression: linear, multiple, weighted,
polynomial, nonparametric, and robust
09/02/2023 57
Scientific and Statistical Data Mining (2)
• Generalized linear models
• allow a categorical response variable (or some
transformation of it) to be related to a set of predictor
variables
• similar to the modeling of a numeric response variable
using linear regression
• include logistic regression and Poisson regression
Mixed-effect models
For analyzing grouped data, i.e. data that can be classified according to one or
more grouping variables
Typically describe relationships between a response variable and some
covariates in data grouped according to one or more factors
09/02/2023 58
Scientific and Statistical Data Mining (3)
• Regression trees
• Binary trees used for classification and
prediction
• Similar to decision trees:Tests are performed at
the internal nodes
• In a regression tree the mean of the objective
attribute is computed and used as the predicted
value
• Analysis of variance
• Analyze experimental data for two or more
populations described by a numeric response
variable and one or more categorical variables
09/02/2023 (factors) 59
Statistical Data Mining (4)
• Factor analysis
• determine which variables are combined to
generate a given factor
• e.g., for many psychiatric data, one can
indirectly measure other quantities (such as test
scores) that reflect the factor of interest
• Discriminant analysis
• predict a categorical response variable,
commonly used in social science
• Attempts to determine several discriminant
functions (linear combinations of the
independent variables) that discriminate among
the groups defined by the response variable
09/02/2023 www.spss.com/datamine/factor.htm
60
Statistical Data Mining (5)
Survival analysis
Predicts the probability that a patient
undergoing a medical treatment
would survive at least to time t (life
span prediction)
09/02/2023 61
Data Mining Applications
• Data mining: A young discipline with broad and diverse applications
• There still exists a nontrivial gap between generic data mining methods and
effective and scalable data mining tools for domain-specific applications
• Some application domains (briefly discussed here)
• Data Mining for Financial data analysis
• Data Mining for Retail and Telecommunication Industries
• Data Mining in Science and Engineering
• Data Mining for Intrusion Detection and Prevention
• Data Mining and Recommender Systems
09/02/2023 62
Data Mining and Recommender Systems
09/02/2023 64
Data Mining for Financial Data Analysis (II)
• Classification and clustering of customers for targeted marketing
• multidimensional segmentation by nearest-neighbor, classification,
decision trees, etc. to identify customer groups or associate a new
customer to an appropriate customer group
• Detection of money laundering and other financial crimes
• integration of from multiple DBs (e.g., bank transactions, federal/state
crime history DBs)
• Tools: data visualization, linkage analysis, classification, clustering
tools, outlier analysis, and sequential pattern analysis tools (find
unusual access sequences)
09/02/2023 65
Data Mining for Retail & Telcomm. Industries (I)
09/02/2023 66
Data Mining Practice for Retail Industry
• Design and construction of data warehouses
• Multidimensional analysis of sales, customers, products, time, and region
• Analysis of the effectiveness of sales campaigns
• Customer retention: Analysis of customer loyalty
• Use customer loyalty card information to register sequences of purchases of
particular customers
• Use sequential pattern mining to investigate changes in customer consumption
or loyalty
• Suggest adjustments on the pricing and variety of goods
• Product recommendation and cross-reference of items
• Fraudulent analysis and the identification of usual patterns
• Use of visualization tools in data analysis
09/02/2023 67
Data Mining in Science and Engineering