Predictive Modeling Week3
Predictive Modeling Week3
week 3
Decision Tree Algorithm
The problem
– Classification
– Prediction
Why decision tree?
• The game will be away, at 9pm, and that Joe will play
center on offense…
• A classification problem
• Generalizing the learned rule to new examples
Definition
Demo
Continuous Attribute?
Predicted
demand
looking back
Time six months
Jan Feb Mar Apr May Jun Jul Aug
Question: Can we predict the new model M-class sales based on the
data in the the table?
• Trends
• Seasonality
• Cyclical elements
Some Important Questions
35
Introduction
• In this chapter we employ Regression Analysis
to examine the relationship among
quantitative variables.
• The technique is used to predict the value of
one variable (the dependent variable - y)based
on the value of other variables (independent
variables x1, x2,…xk.)
36
The Model
• The first order linear model
y 0 1x
y = dependent variable b0 and b1 are unknown,
x = independent variable y therefore, are estimated
from the data.
b0 = y-intercept
b1 = slope of the line
Rise b1 = Rise/Run
= error variable b0 Run
x
37
Estimating the Coefficients
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2) w The smaller the sum of
w (3,1.5)
1 squared differences
the better the fit of the
1 2 3 4
line to the data.
39
To calculate the estimates of the coefficients The regression equation that estimates
that minimize the differences between the data the equation of the first order linear model
points and the line, use the formulas: is:
cov( X , Y )
b1
s 2x ŷ b 0 b1x
b 0 y b1 x
40
• Example Relationship between odometer
reading and a used car’s selling price.
Independent variable x
Dependent variable y
41
• Solution
– Solving by hand
• To calculate b0 and b1 we need to calculate several
statistics first;
x 36,009.45; s 2x
2
(x i x)
43,528,688
n 1
y 5,411.41; cov( X , Y )
( x x )(y
i i y)
1,356,256
n 1
where n = 100.
cov( X, Y) 1,356,256
b1 .0312
s 2x 43,528,688
b 0 y b1x 5411.41 ( .0312)(36,009.45) 6,533
ŷ b 0 b1x 6,533 .0312 x
42
– Using the computer (see file Xm17-01.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > OK
Odometer Price SUMMARY OUTPUT
37388 5318
6000
44758 5061 Regression Statistics 5500
Price
45833 5008 Multiple R 0.806308 5000
Price
5000
4500
0 No data 19000 29000 39000 49000
Odometer
ŷ 6,533 .0312 x
44
• Sum of squares for errors
– This is the sum of differences between the
points and the regression line.
– It can serve as a measure
n
of how well the line
fits the data. SSE i i .
( y
i 1
ŷ ) 2
cov( X , Y )
SSE (n 1)s 2Y
s 2x
6,434,890
64,999 Calculated before
Y
n 1 99
cov( X , Y ) ( 1,356 , 256 ) 2
SSE (n 1) sY2 2
99(64,999) 2,252,3
sx 43,528,688
Thus , It is hard to assess the model based
SSE 2,251,363 on se even when compared with the
s 151.6 mean value of y.
n2 98 s 151.6, y 5,411.4 47
• Testing the slope
q q
q
q q q
q q q
q q q q q
q q q
qq qq
q qq
q q q q q q
q q q qq q q q
q q q q
q qq qq qq qq qq q q q q q q q q qq q q
qq q q q q q qq q
qq qq q q q q q q q qq q q q q
q qq q q qq q q qq q q qq q qq q
2
2 [cov( X , Y )] 2 SSE
R or R 1
2 2
sx sy (y i y) 2
49
– To understand the significance of this coefficient
note:
50
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.
y2
y1
x1 x2
Total variation in y = Variation explained by the + Unexplained variation (error)
regression line)
(y1 y )2 (y 2 y)2 ( ŷ 1 y ) 2 ( ŷ 2 y ) 2 ( y 1 ŷ 1 ) 2 ( y 2 ŷ 2 ) 2
51
Variation in y = SSR + SSE
2
R 1
SSE
( y i y ) 2 SSE
SSR
(y i y ) 2
(y y)
i
2
(y i y ) 2
54
Regression Diagnostics - I
• The three conditions required for the validity
of the regression analysis are:
– the error variable is normally distributed.
– the error variance is constant for all values of x.
– The errors are independent of each other.
• How can we diagnose violations of these
conditions?
55
• Residual Analysis
– Examining the residuals (or standardized
residuals), we can identify violations of the
required conditions
– Example - continued
• Nonnormality.
– Use Excel to obtain the standardized residual histogram.
– Examine the histogram and look for a bell shaped diagram
with mean close to zero.
56
RESIDUAL OUTPUT A Partial list of For each residual we calculate
Standard residuals the standard deviation as follows:
Observation Residuals Standard Residuals
1 -50.45749927 -0.334595895 sri s 1 hi where
2 -77.82496482 -0.516076186
3 -97.33039568 -0.645421421 1 ( xi x)2
hi
4
5
223.2070978
238.4730715
1.480140312
1.58137268
n
( x j x)2
Standardized residual i =
Residual i / Standard deviation
40
30
20
We can also apply the Lilliefors test
10 or the c2 test of normality.
0
-2.5 -1.5 -0.5 0.5 1.5 2.5 More
57
• Nonindependence of error variables
– A time series is constituted if data were collected
over time.
– Examining the residuals over time, no pattern
should be observed if the errors are independent.
– When a pattern is detected, the errors are said to
be autocorrelated.
– Autocorrelation can be detected by graphing the
residuals against time.
58
• Outliers
– An outlier is an observation that is unusually small
or large.
– Several possibilities need to be investigated when
an outlier is observed:
• There was an error in recording the value.
• The point does not belong in the sample.
• The observation is valid.
– Identify outliers from the scatter diagram.
– It is customary to suspect an observation is an
outlier if its |standard residual| > 2
59
An outlier An influential observation
+++++++++++
+ +
+ … but, some outliers
+ +
+ +
may be very influential
+
+ + + +
+
+ +
+
60
• Procedure for regression diagnostics
– Develop a model that has a theoretical basis.
– Gather data for the two variables in the model.
– Draw the scatter diagram to determine whether a
linear model appears to be appropriate.
– Check the required conditions for the errors.
– Assess the model fit.
– If the model fits the data, use the regression
equation.
61
The Bayes Classifier
• Use Bayes Rule!
Likelihood Prior
Normalization Constant
• Why did this help? Well, we think that we might be able to
specify how features are “generated” by the class label
Another Example of the Naïve Bayes
Classifier
The weather data, with counts and probabilities
outlook temperature humidity windy play
yes no yes no yes no yes no yes no
A new day
outlook temperature humidity windy play
sunny cool high true ?
• Likelihood of yes
2 3 3 3 9
0.0053
9 9 9 9 14
• Likelihood of no
3 1 4 3 5
0.0206
5 5 5 5 14
• Therefore, the prediction is No
The Naive Bayes Classifier for Data Sets
with Numerical Attribute Values
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std 6.2 7.9 std 10.2 9.7 true 3/9 3/5
dev dev
rainy 3/9 2/5
• Let x1, x2, …, xn be the values of a numerical attribute
in the training data set.
1 n
xi
n i 1
1 n
xi 2
n 1 i 1
w 2
1
f ( w) e 2
2
• For examples,
6 6 73 2
1
f temperature 66 | Yes e 2 6.2 2
0.0340
2 6.2
2 3 9
• Likelihood of Yes = 0.0340 0.0221 0.000036
9 9 14
3 3 5
• Likelihood of No = 0.0291 0.038 0.000136
5 5 14