7.simple Classification
7.simple Classification
7.simple Classification
Nave Rule
Classify all records as the majority class Not a real method Introduced so it will serve as a benchmark against which to measure other results
S Y Charge N Truthful 60% Size L
Nave Bayes
Size
For a given new record to be classified, find other records like it (i.e., same values for the predictors) What is the prevalent class among those records? Assign that class to your new record
Usage
Requires categorical variables Numerical variable must be binned and converted to categorical Can be used with very large data sets Example: Spell check computer attempts to assign your misspelled word to an established class (i.e., correctly spelled word)
Relies on finding other records that share same predictor values as record-to-beclassified. Want to find probability of belonging to class C, given specified values of predictors. Conditional probability P (Y= C| X = (x1, xp))
Prior pending legal charges (yes/no) Size of firm (small/large) S Classify based on the majority in each cell
Error rate 20% Y Charge N
Size
Goal: classify (as fraudulent or as truthful) a small firm with charges filed There are 2 firms like that, one fraudulent and the other truthful P(fraud|charges=y, size=small) = = 0.50 Note: calculation is limited to the two firms matching those characteristics
Problem
Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.
Assume independence of predictor variables (within each class) Use multiplication rule Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values
Main idea: Instead of looking at combinations of predictors (crossed pivot table), look at each predictor separately How can this be done? A probability trick!
Based on Bayes rule Then make some simplifying assumption And get a powerful classifier! 15
Conditional Probability
A = the event X = A B = the event Y = B P ( A | B ) denotes the probability of A given B (the conditional probability that A occurs given that B occurred)
P( A B) P( A | B) = P( B)
If P(B)>0
AB
P( A | B) P( B) P ( B | A) = P ( A)
AB
P(Fraud | Charge) P(Charge)= P(Charge | Fraud) P(Fraud) P(Fraud | Charge) = P(Charge | Fraud) P(Fraud) / P(Charge)
17
P(Y = 1 | X 1 ,..., X p ) =
We want to estimate P(Y=1 | X1,,Xp) But we dont have enough examples of each possible profile X1, Xp in the training set If we had instead P(X1,,Xp | Y=1), we could separate it to P(X1|Y=1) P(X2|Y=1) P(Xp|Y=1)
True if we can assume independence between X1,,Xp within each class That means we could use single pivot tables! If the dependence is not extreme, it will work reasonably well
19
Independence Assumption
With Independence Assumption: A P(AB) = P(A)*P(B) We can thus calculate
AB
P(X1,,Xp | Y=1) = P(X1|Y=1)*P(X2|Y=1)* P(Xp|Y=1) P(X1,,Xp | Y=0) = P(X1|Y=0)*P(X2|Y=0)* P(Xp|Y=0) P(X1,,Xp ) = P(X1,,Xp | Y=1)+ P(X1,,Xp | Y=0)
3.
All predictors must be categorical. From the training set create all pivot tables of Y on each separate X. We can thus obtain P(X), P(X|Y=1),P(X|Y=0) For a to-be-predicted observation with predictors X1,X2, Xp, software computes the probability of belonging to Y=1 using the formula P ( X 1 | Y = 1) P( X 2 | Y = 1) P( X p | Y = 1) P(Y = 1) P (Y = 1 | X 1 ,..., X p ) = P( X 1 ,..., X p )
Each of the probabilities in the formula is estimated from a pivot table, and estimated P(Y=1) is the proportion of 1s in training set
1.
Use the cutoff to determine classification of this observation. Default: cutoff = 0.5 (classify to group that is most likely)
21
Note that probability estimate does not differ greatly from exact All records are used in calculations, not just those matching predictor values This makes calculations practical in most circumstances Relies on assumption of independence between predictor variables within each class
Independence Assumption
Not strictly justified (variables often correlated with one another) Often good enough
Truthful
Size
P(F|C,S) exact Y N
Small 0.5 0
P(F|C,S) Y N
Small O u tco m e (T,F) Y (1,1) (0,2) (1,3) tru th fu l N (3, 0) (2,1) (5,1) t r u t h f u l sum (4,1) (2,3) (6,4) t r u t h f u l P(C,S|F)P(F) Small Large P(C|F) 0.075 0.225 0.75 tru th fu lY N 0.025 0.075 0.25 t r u t h f u l P(S|F) 0.25 0.75 0.40 0.25*0.75*0.40 = 0.075 t r u t h f u l P(C,S|T)P(T) Small Large P(C|T) fra u d Y 0.067 0.034 0.17 0.334 0.164 0.83 fra u d N P(S|T) 0.67 0.33 0.60 fra u d P(F|C,S) = P(C,S|F)P(F)/P(C,S) fra u d = P(C|F)P(S|F)P(F)/P(C,S) P(C,S) = P(C,S|F)P(F)+P(C,S|T)P(T)
Size
The good
Simple Can handle large amount of predictors High performance accuracy, when the goal is ranking Pretty robust to independence assumption! Need to categorize continuous predictors Predictors with rare categories -> zero prob (if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each predictor
28
The bad
Sheet: NNB-Output1
P(accept=1) = 0.095
Conditional probabilities
Classes--> Input Variables Online CreditCard 1 Value 0 1 0 1 Prob 0.374125874 0.625874126 0.699300699 0.300699301 0 Value 0 1 0 1
29
Sheet: NNB-ValidScore1
Back to Navig
Row Id. 2 3 7 8 11 13 14 15 16
Predicted Class 0 0 0 0 0 0 0 0 0
Actual Class 0 0 0 0 0 0 0 0 0
Online 0 0 1 0 0 0 1 0 1
CreditCard 0 0 0 1 0 0 0 0 1
30
K-Nearest Neighbors
Basic Idea
For a given record to be classified, identify nearby records Near means records with similar predictor values X1, X2, Xp Classify the record as whatever the predominant class is among the nearby records (the neighbors)
Choosing k
k=1 means use the single nearest record k=5 means use the 5 nearest records
Typically choose that value of k which has lowest error rate in validation data
K=3
X2
X1
Low values of k (1, 3 ) capture local structure in data (but also noise) High values of k provide more smoothing, less noise, but may miss local structure Note: the extreme case of k = n (i.e. the entire data set) is the same thing as nave rule (classify all records according to majority class)
Data: 24 households classified as owning or not owning riding mowers Predictors = Income, Lot Size
Income 60.0 85.5 64.8 61.5 87.0 110.1 108.0 82.8 69.0 93.0 51.0 81.0 75.0 52.8 64.8 43.2 84.0 49.2 59.4 66.0 47.4 33.0 51.0 63.0
Lot_Size 18.4 16.8 21.6 20.8 23.6 19.2 17.6 22.4 20.0 20.8 22.0 20.0 19.6 20.8 17.2 20.4 17.6 17.6 16.0 18.4 16.4 18.8 14.0 14.8
Ownership owner owner owner owner owner owner owner owner owner owner owner owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner non-owner
XLMiner Output
For each record in validation data (6 records) XLMiner finds neighbors amongst training data (18 records). The record is scored for k=1, k=2, k=18. Best k seems to be k=8. K = 9, k = 10, k=14 also share low error rate, but best to choose lowest k.
Value of k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
% Error Training 0.00 16.67 11.11 22.22 11.11 27.78 22.22 22.22 22.22 22.22 16.67 16.67 11.11 11.11 5.56 16.67 11.11 50.00
% Error Validation 33.33 33.33 33.33 33.33 33.33 33.33 33.33 16.67 <--- Best k 16.67 16.67 33.33 16.67 33.33 16.67 33.33 33.33 33.33 50.00
Instead of majority vote determines class use average of response values May be a weighted average, weight decreasing with distance
Advantages
Simple No assumptions required about Normal distribution, etc. Effective at capturing complex interactions among variables without having to define a statistical model
Shortcomings
This is because expected distance to nearest neighbor increases with p (with large vector of predictors, all records end up far away from each other)
In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s) These constitute curse of dimensionality
Reduce dimension of predictors (e.g., with PCA) Computational shortcuts that settle for almost nearest neighbors
Summary
Nave rule: benchmark Nave Bayes and K-NN are two variations on the same theme: Classify new record according to the class of similar records No statistical models involved These methods pay attention to complex interactions and local structure Computational challenges remain