Classification VS Prediction: Classification: Defn Classification, Where A Model or Classifier Is Constructed To
Classification VS Prediction: Classification: Defn Classification, Where A Model or Classifier Is Constructed To
Classification VS Prediction: Classification: Defn Classification, Where A Model or Classifier Is Constructed To
Classification : defn
classification, where a model or classifier is constructed to
predict class (categorical) labels, such as “safe” or “risky” for
the loan application data; “yes” or “no” for the marketing data;
or “treatment A,” “treatment B,” or “treatment C” for the medical
data. These categories can be represented by discrete values,
where the ordering among values has no meaning. For
example, the values 1, 2, and 3 may be used to represent
treatments A, B, and C, where there is no ordering implied
among this group of treatment regimes.
Prediction Defn
Suppose that the marketing manager wants to
predict how much a given customer will spend
during a sale at AllElectronics. This data
analysis task is an example of numeric
prediction, where the model constructed
predicts a continuous-valued function, or
ordered value, as opposed to a class label. This
model is a predictor. Regression analysis is a
statistical methodology that is most often
used for numeric prediction
General Approach to Classification
Data classification is a two-step process,
consisting of a learning step (where a
classification model is constructed) and a
classification step (where the model is used to
predict class labels for given data). The
process is shown for the loan application data
Because the class label of each training tuple is provided, this step is also known as
supervised learning.
Decision Tree Induction
Similarly,
P(X|buys computer = no) = 0.600 × 0.400 × 0.200 × 0.400 = 0.019. To find the class, Ci ,
that maximizes P(X|Ci )P(Ci ), we compute
P(X|buys computer = yes)P(buys computer = yes) = 0.044 × 0.643 = 0.028
P(X|buys computer = no)P(buys computer = no) = 0.019 × 0.357 = 0.007
Therefore, the naıve Bayesian classifier predicts buys computer = yes for tuple X.
buys computer = yes = 0.044/0.643 =0.0684
buys computer = no =0.007/0.357=0.196
Rule-based classifier
A rule-based classifier uses a set of IF-THEN rules for
classification. An IF-THEN rule is an expression of the form
The “IF” part (or left side) of a rule is known as the rule
antecedent or precondition. The “THEN” part (or right
side) is the rule consequent. In the rule antecedent, the
condition consists of one or more attribute tests (e.g., age =
youth and student = yes) that are logically ANDed.
The rule’s consequent contains a class prediction (in this case, we are
predicting whether a customer will buy a computer). R1 can also be
written as
If the condition (i.e., all the attribute tests) in a rule antecedent holds true for
a given tuple, we say that the rule antecedent is satisfied (or simply, that
the rule is satisfied) and that the rule covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a
class- labeled data set, D, let ncovers be the number of tuples covered by
R; ncorrect be the number of tuples correctly classified by R; and |D| be
the number of tuples in D. We can define the coverage and accuracy of R
as
Example 8.6 Rule accuracy and coverage. Let’s
go back to our data in Table 8.1. These are
class- labeled tuples from the AllElectronics
customer database. Our task is to predict
whether a customer will buy a computer.
Consider rule R1, which covers 2 of the 14
tuples. It can correctly classify both tuples.
Therefore, coverage(R1) = 2/14 = 14.28% and
accuracy(R1) = 2/2 = 100%.
Backpropagation
Backpropagation is a neural network learning
algorithm.
The neural networks field was originally kindled by
The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for numeric prediction).
For each training tuple, the weights are modified so as to minimize the mean-squared
error between the network’s prediction and the actual target value.
These modifications are made in the “backwards” direction (i.e., from the output layer)
through each hidden layer down to the first hidden layer (hence the name
backpropagation).
Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
The steps are described below.
Initialize the weights: The weights in the network are initialized to small random num-
bers (e.g., ranging from −1.0 to 1.0, or −0.5 to 0.5). Each unit has a bias associated with
it. The biases are similarly initialized to small random numbers.
Propagate the inputs forward: First, the training tuple is fed to the network’s input
layer. The inputs pass through the input units, unchanged. That is, for an input unit, j,
its output, Oj, is equal to its input value, Ij.
Next, the net input and output of each unit in the hidden and output layers are
computed. The net input to a unit in the hidden or output layers is computed as a
linear combination of its inputs. To help illustrate this point, a hidden layer or output
layer unit is shown in Figure. Each such unit has a number of inputs to it that are, in
fact, the outputs of the units connected to it in the previous layer. Each connection has
a weight. To compute the net input to the unit, each input connected to the unit is
multiplied by its corresponding weight, and this is summed. Given a unit, j in a hidden
or output layer, the net input, Ij, to unit j is
Support vector machines
• Support vector machines were invented by V. Vapnik and
his co-workers in 1970s in Russia and became known to
the West in 1992.
• SVMs are linear classifiers that find a hyperplane to
separate two class of data, positive and negative.
• SVM not only has a rigorous theoretical foundation, but
also performs classification more accurately than most
other methods in applications, especially for high
dimensional data.
• It is perhaps the best classifier for text classification.
Basic concepts
• Let the set of training examples D be
{(x1, y1), (x2, y2), …, (xr, yr)},
where xi = (x1, x2, …, xn) is an input vector in a real-
valued space X Rn and yi is its class label (output
value), yi {1, -1}.
1: positive class and -1: negative class.
• SVM finds a linear function of the form (w: weight
vector)
f(x) = w x + b
1 if w x i b 0
yi
1 if w x i b 0
CS583, Bing Liu, UIC 34
The hyperplane
• The hyperplane that separates positive and negative
training data is
w x + b = 0
• It is also called the decision boundary (surface).
• So many possible hyperplanes, which one to choose?
Let the data set D be given as (X1, y1),(X2, y2), ... , (X|D|, y|D|), where
Xi is the set of training tuples with associated class labels, yi. Each yi
can take one of two values, either +1 or −1 (i.e., yi ∈ {+1, − 1}),
corresponding to the classes buys computer = yes and buys computer =
no, respectively.
Margin :we can say that the shortest distance from a hyperplane to one side of its
margin is equal to the shortest distance from the hyperplane to the other side of its
margin, where the “sides” of the margin are parallel to the hyperplane.
A separating hyperplane can be written as
W ·X + b = 0, (9.12)
where W is a weight vector, namely, W = {w1, w2,..., wn}; n is the
number of attributes; and b is a scalar, often referred to as a bias. To
aid in visualization, let’s consider two input attributes, A1 and A2, as
in Figure 9.8(b). Training tuples are 2-D (e.g., X = (x1, x2)), where x1
and x2 are the values of attributes A1 and A2, respectively, for X.
If we think of b as an additional weight, w0, we can rewrite Eq.
(9.12) as
w0 + w1x1 + w2x2 = 0. (9.13)
Thus, any point that lies above the separating hyperplane satisfies
w0 + w1x1 + w2x2 > 0. (9.14)
Similarly, any point that lies below the separating hyperplane satisfies
w0 + w1x1 + w2x2 < 0. (9.15)
The Case When the Data Are Linearly Inseparable
We obtain a nonlinear SVM by extending the approach for linear SVMs as follows.
There are two main steps.
In the first step, we transform the original input data into a higher dimensional space
using a nonlinear mapping.
Once the data have been transformed into the new higher space, the second step
searches for a linear separating hyperplane in the new space.
We again end up with a quadratic optimization problem that can be solved using
the linear SVM formulation. T
K-Nearest Neighbor (KNN)
Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a
specialized training phase and uses all the data for training while classification.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
3.1 − Calculate the distance between test data and each row of training data with the help
of any of the method namely: Euclidean, Manhattan or Hamming distance. The most
commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm −
We can see in the above diagram the three nearest neighbors of the data point with black
dot. Among those three, two of them lies in Red class hence the black dot will also be
assigned in red class.
Linear & Non Linear Regression
Regression is a form of a supervised machine learning technique that tries to predict any
continuous valued attribute. It analyses the relationship between a target variable
(dependent) and its predictor variable (independent). Regression is an important tool for
data analysis that can be used for time series modelling, forecasting, and others.
Y = a + b*X + e,
where a is the intercept, b is the slope of the regression line and e is the error. X and Y are
the predictor and target variables respectively. When X is made up of more than one
variables (or features) it is termed as multiple linear regression.
If a regression equation doesn’t follow the rules for a linear model, then it must be a
nonlinear model. It’s that simple! A nonlinear model is literally not linear.
Model 'Y', is a linear function of 'X'.
The value of 'Y' increases or decreases in linear manner according to which the
value of 'X' also changes.
The following table represents the survey results from the 7 online stores.
Online
Store Monthly E-commerce Sales
(in 1000 s) Online Advertising Dollars (1000 s)
1 368 1.7
2 340 1.5
3 665 2.8
4 954 5
5 331 1.3
6 556 2.2
7 376 1.3
We can see that there is a positive relationship between the monthly e-commerce sales (Y) and
online advertising costs (X).
The positive correlation means that the values of the dependent variable (y) increase when the
values of the independent variable (x) rise.
So, if we want to predict the monthly e-commerce sales from the online advertising costs, the
higher the value of advertising costs, the higher our prediction of sales.
Y = a + b*X^2
In this particular regression, the line of best fit is not a straight line like in Linear Regression.
However, it is a curve that is fitted to all the data points.
Implementing polynomial regression can result in over-fitting when you are tempted to
reduce your errors by making the curve more complex. Hence, always try to fit the curve by
generalizing it to the problem.
Logistic Regression
Logistic regression is used when the dependent variable is of binary nature (True or False, 0
or 1, success or failure). Here the target value (Y) ranges from 0 to 1 and it is popularly used
for classification type problems. Logistic Regression doesn’t require the dependent and
independent variables to have a linear relationship, as is the case in Linear Regression.
Problem using Back probagatiom
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
To calculate the final result of H1, we performed the sigmoid function as
H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925