Unit 8
Unit 8
Unit 8
Definition
Classification and prediction are two forms of data analysis that can be used to
extract models describing important data classes or to predict future data
trends.
Such analysis can help provide us with a better understanding of the data at
large.
Whereas classification predicts categorical (discrete, unordered) labels,
prediction model continuous valued functions.
Many classification and prediction methods have been proposed by researchers
in machine learning, pattern recognition, and statistics.
Page 1 Unit 8
Classification Techniques
Decision Tree Identification
Classification Problem
Weather Play( Yes, No)
i. Add element I to the i-1 element item-sets from the previous iteration
.2 done
Classification Techniques
Decision Tree Identification
Sunny Yes
Cloudy Yes/No
Overcast Yes/No
Page 2 Unit 8
Decision Tree Identification
Cloudy Yes
Warm
Cloudy No
Chilly
Cloudy Yes
Pleasant
Overcast
Warm
Overcast No
Chilly
Overcast Yes
Pleasant
Page 3 Unit 8
Decision Tree Identification Example:
Top down technique for decision tree identification
Decision tree created is sensitive to the order in which items are considered
If an N-item-set does not result in a clear decision.
Classification classes have to be modeled by rough sets.
Clustering Techniques
Clustering partitions the data set into clusters or equivalence classes.
Similarity among members of a class more than similarity among members
across classes.
Similarity measures: Euclidian distance or other application specific measures.
Clustering Techniques
Nearest Neighbour Clustering Algorithm:
Given n elements X1, X2, …. Xn, and threshold t, .
1. j 1, k 1, cluster = { }
2. Repeat
I. Find the nearest neighbour of xj
II. Let the nearest neighbour be in cluster m
III. If distance to nearest neighbour >t, then create a new cluster and
k k+1; else assign xj to cluster m
IV. j j+1
3. until j>n
Regression
Numeric prediction is the task of predicting continuous (or ordered) values for given
input
For example:
We may wish to predict the salary of college graduates with 10 years of work
experience, or the potential sales of a new product given its price.
The mostly used approach for numeric prediction is regression
A statistical methodology that was developed by Sir Frances Galton (1822-1911), a
mathematician who was also a cousin of Charles Darwin
In many texts use the terms “regression” and “numeric prediction” synonymously
Regression analysis can be used to model the relationship between one or more
independent or predictor variables and a dependent or response variable (which is
continuous value)
In the context of data mining, the predictor variables are the attributes of interest
describing the tuple
The response variable is what we want to predict
Page 4 Unit 8
Types of Regression
The types of Regression are as:
Linear Regression
NonLinear Regression
Linear Regression
Straight-line regression analysis involves a response variable, y, and a single
predictor variable, x.
It is the simplest form of regression, and models y as a linear function of x.
That is,
y=b+wx
Where the variance of y is assumed to be constant, and b and w are regression
coefficients specifying the Y-intercept and slope of the line, respectively.
The regression coefficient, w and b, can also be thought of as weight, so that we can
equivalent write,
y=w0+w1x.
The regression coefficient can be estimated using this method with the following
equations:
[Refer to write board:]
Example Too:
NonLinear Regression
The straight-line linear regression case where dependent response variable, y, is
modeled as a linear function of a single independent predictor variable, x.
If we can get more accurate model using a nonlinear model, such as a parabola or
some other higher-order polynomial?
Polynomial regression is often of interest when there is just one predictor variable.
Consider a cubic polynomial relationship given by
y=w0+w1x+w2xsq2+w3xcu3
NonLinear Regression
In statistics, nonlinear regression is a form of regression analysis in which
observational data are modeled by a function which is a nonlinear combination of the
model parameters and depends on one or more independent variables. The data are
fitted by a method of successive approximations.
Contents
Clustering
Page 5 Unit 8
Definition
The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters. A cluster of data objects can
be treated collectively as one group and so may be considered as a form of data
compression.
First the set is partitioned into groups based on data similarity (e.g., using
clustering), and then labels are assigned to the relatively small number of groups.
It is also called unsupervised learning. Unlike classification, clustering and
unsupervised learning do not rely on predefined classes and class-labeled training
examples. For this reason, clustering is a form of learning by observation, rather than
learning by examples.
Definition
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers (values that are “far
away” from any cluster) may be more interesting than common cases.
Advantages
Advantages of such a clustering-based process:
Adaptable to changes
Helps single out useful features that distinguish different groups.
Applications of Clustering
Market research
Pattern recognition
Data analysis
Image processing
Biology
Geography
Automobile insurance
Outlier detection
K-Mean Algorithm
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based
on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
(5) until no change;
Page 6 Unit 8
K-Medoids Algorithm
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects in D as the initial representative objects or seeds;
(2) repeat
(3) assign each remaining object to the cluster with the nearest representative
object;
(4) randomly select a nonrepresentative object, orandom;
(5) compute the total cost, S, of swapping representative object, oj, with orandom;
(6) if S < 0 then swap oj with orandom to form the new set of k representative
objects;
(7) until no change;
Bayesian Classification
Bayesian Classification is based on Baye’s theorem.
Studies comparing classification algorithms have found a simple Bayesian classifier
known as the naïve Bayesian classifier to be comparable in performance with decision
tree and selected neural network classifiers.
Bayesian classifiers have also exhibited high accuracy and speed when applied to
large database.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given
class is independent of the values of the other attributes. This assumption is called
class conditional independence.
Bayesian belief networks are graphical models, which unlike naïve Bayesian
classifiers, allow the representation of dependencies among subsets of attributes.
Bayes’ Theorem
Bayes’
Page 7 Unit 8