0% found this document useful (0 votes)
85 views

Predictive Modeling Week3

(1) Decision trees are a popular machine learning algorithm that can be used for classification and prediction problems. (2) A decision tree uses a tree-like model to represent rules that classify examples based on attribute values. (3) It builds classification or regression models in the form of a tree structure, where each node represents a test on an attribute value.

Uploaded by

Kunwar Rawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Predictive Modeling Week3

(1) Decision trees are a popular machine learning algorithm that can be used for classification and prediction problems. (2) A decision tree uses a tree-like model to represent rules that classify examples based on attribute values. (3) It builds classification or regression models in the form of a tree structure, where each node represents a test on an attribute value.

Uploaded by

Kunwar Rawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Predictive Modeling

week 3
Decision Tree Algorithm
The problem

• Given a set of training cases/objects and their attribute


values, try to determine the target attribute value of new
examples.

– Classification
– Prediction
Why decision tree?

• Decision trees are powerful and popular tools for


classification and prediction.
• Decision trees represent rules, which can be understood
by humans and used in knowledge system such as
database.
key requirements
• Attribute-value description: object or case must be expressible
in terms of a fixed collection of properties or attributes (e.g., hot,
mild, cold).
• Predefined classes (target values): the target function has
discrete output values (bollean or multiclass)
• Sufficient data: enough training cases should be provided to learn
the model.
A simple example

• You want to guess the outcome of next week's game


between the MallRats and the Chinooks.

• Available knowledge / Attribute


– was the game at Home or Away
– was the starting time 5pm, 7pm or 9pm.
– Did Joe play center, or forward.
– whether that opponent's center was tall or not.
– …..
Basket ball data
What we know

• The game will be away, at 9pm, and that Joe will play
center on offense…

• A classification problem
• Generalizing the learned rule to new examples
Definition

 Decision tree is a classifier in the form of a tree structure


 Decision node: specifies a test on a single attribute
 Leaf node: indicates the value of the target attribute
 Arc/edge: split of one attribute
 Path: a disjunction of test to make the final decision

 Decision trees classify instances or examples by starting


at the root of the tree and moving through it until a leaf
node.
Illustration

(1) Which to start? (root)

(2) Which node to proceed?

(3) When to stop/ come to conclusion?


Random split

• The tree can grow huge


• These trees are hard to understand.
• Larger trees are typically less accurate than smaller trees.
Principled Criterion
• Selection of an attribute to test at each node - choosing the
most useful attribute for classifying examples.
• information gain
– measures how well a given attribute separates the training
examples according to their target classification
– This measure is used to select among the candidate attributes at
each step while growing the tree
Entropy
• A measure of homogeneity of the set of examples.

• Given a set S of positive and negative examples of some target


concept (a 2-class problem), the entropy of set S relative to
this binary classification is

E(S) = - p(P)log2 p(P) – p(N)log2 p(N)


• Suppose S has 25 examples, 15 positive and 10 negatives [15+,
10-]. Then the entropy of S relative to this classification is

E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)


Some Intuitions

• The entropy is 0 if the outcome is ``certain’’.


• The entropy is maximum if we have no knowledge of
the system (or any outcome is equally possible).

Entropy of a 2-class problem


with regard to the portion of
one of the two groups
Information Gain

• Information gain measures the expected reduction in


entropy, or uncertainty.
Sv
Gain( S , A)  Entropy ( S )  
vValues ( A ) S
Entropy (S v )

– Values(A) is the set of all possible values for attribute A, and


Sv the subset of S for which attribute A has value v Sv = {s in
S | A(s) = v}.
– the first term in the equation for Gain is just the entropy of
the original collection S
– the second term is the expected value of the entropy after
S is partitioned using attribute A
• It is simply the expected reduction in entropy
caused by partitioning the examples according
to this attribute.
• It is the number of bits saved when encoding
the target value of an arbitrary member of S,
by knowing the value of attribute A.
Examples

• Before partitioning, the entropy is


– H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1
• Using the ``where’’ attribute, divide into 2 subsets
– Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1
– Entropy of the second set H(away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1
• Expected entropy after partitioning
– 12/20 * H(home) + 8/20 * H(away) = 1
Decision Tree Example
40 % Chance of a Good Economy
.4 Profit = $6M
Expand Factory
Cost = $1.5 M
.6 60% Chance Bad Economy
Profit = $2M

Good Economy (40%)


.4 Profit = $3M
Don’t Expand Factory
Cost = $0 .6 Bad Economy (60%)
Profit = $1M

NPVExpand = (.4(6) + .6(2)) – 1.5 = $2.1M

NPVNo Expand = .4(3) + .6(1) = $1.8M

$2.1 > 1.8, therefore you should expand the factory


S.no Outlook Temperature Humidity Wind Play

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

           

Data Mining and Predictive Analytics, By


Daniel Larose and Chantal Larose John 20
Wiley & Sons, Inc, Hoboken, NJ, 2015.
Data Mining and Predictive Analytics, By
Daniel Larose and Chantal Larose John 21
Wiley & Sons, Inc, Hoboken, NJ, 2015.
• Using the ``when’’ attribute, divide into 3 subsets
– Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4);
– Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12);
– Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0
• Expected entropy after partitioning
– 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65
• Information gain 1-0.65 = 0.35
Decision

• Knowing the ``when’’ attribute values provides larger


information gain than ``where’’.
• Therefore the ``when’’ attribute should be chosen for
testing prior to the ``where’’ attribute.
• Similarly, we can compute the information gain for
other attributes.
• At each node, choose the attribute with the largest
information gain.
• Stopping rule
– Every attribute has already been included along this path
through the tree, or
– The training examples associated with this leaf node all have the
same target attribute value (i.e., their entropy is zero).

Demo
Continuous Attribute?

• Each non-leaf node is a test, its edge partitioning the


attribute into subsets (easy for discrete attribute).
• For continuous attribute
– Partition the continuous value of attribute A into a discrete
set of intervals
– Create a new boolean attribute Ac , looking for a threshold
c,
true if Ac  c
Ac  
 false otherwise
How to choose
c?
Evaluation
• Training accuracy
– How many training instances can be correctly classify based on the
available data?
– Is high when the tree is deep/large, or when there is less confliction in
the training instances.
– however, higher training accuracy does not mean good generalization
• Testing accuracy
– Given a number of new instances, how many of them can we correctly
classify?
– Cross validation
What is forecasting?

Forecasting is a tool used for predicting


future demand based on
past demand information.
Why is forecasting important?

Demand for products and services is usually uncertain.


Forecasting can be used for…
• Strategic planning (long range planning)
• Finance and accounting (budgets and cost controls)
• Marketing (future sales, new products)
• Production and operations
What is forecasting all about?

Demand for Mercedes E Class We try to predict the future


by looking back at the past

Predicted
demand
looking back
Time six months
Jan Feb Mar Apr May Jun Jul Aug

Actual demand (past sales)


Predicted demand
Some general characteristics of forecasts

• Forecasts are always wrong


• Forecasts are more accurate for groups or families of
items
• Forecasts are more accurate for shorter time periods
• Every forecast should include an error estimate
• Forecasts are no substitute for calculated demand.
Key issues in forecasting

1. A forecast is only as good as the information included in the


forecast (past data)
2. History is not a perfect predictor of the future (i.e.: there is
no such thing as a perfect forecast)

REMEMBER: Forecasting is based on the assumption that the


past predicts the future! When forecasting, think carefully
whether or not the past is strongly related to what you expect
to see in the future…
Example: Mercedes E-class vs. M-class Sales
Month E-class Sales M-class Sales
Jan 23,345 -
Feb 22,034 -
Mar 21,453 -
Apr 24,897 -
May 23,561 -
Jun 22,684 -
Jul ? ?

Question: Can we predict the new model M-class sales based on the
data in the the table?

Answer: Maybe... We need to consider how much the two markets


have in common
What should we consider when looking at
past demand data?

• Trends

• Seasonality

• Cyclical elements
Some Important Questions

• What is the purpose of the forecast?


• Which systems will use the forecast?
• How important is the past in estimating the future?

Answers will help determine time horizons, techniques,


and level of detail for the forecast.
Simple Linear Regression

35
Introduction
• In this chapter we employ Regression Analysis
to examine the relationship among
quantitative variables.
• The technique is used to predict the value of
one variable (the dependent variable - y)based
on the value of other variables (independent
variables x1, x2,…xk.)

36
The Model
• The first order linear model

y   0  1x  
y = dependent variable b0 and b1 are unknown,
x = independent variable y therefore, are estimated
from the data.
b0 = y-intercept
b1 = slope of the line
Rise b1 = Rise/Run
= error variable b0 Run
 x
37
Estimating the Coefficients

• The estimates are determined by


– drawing a sample from the population of interest,
– calculating sample statistics.
– producing a straight line that cuts into the data.
y w The question is:
w Which straight line fits best?
w
w
w w w ww
w w w w w
w
x 38
The best line is the one that minimizes
the sum of squared vertical differences
between the points and the line.

Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines
4 (2,4)
w The second line is horizontal
3 w (4,3.2)
2.5
2
(1,2) w The smaller the sum of
w (3,1.5)
1 squared differences
the better the fit of the
1 2 3 4
line to the data.
39
To calculate the estimates of the coefficients The regression equation that estimates
that minimize the differences between the data the equation of the first order linear model
points and the line, use the formulas: is:

cov( X , Y )
b1 
s 2x ŷ  b 0  b1x
b 0  y  b1 x

40
• Example Relationship between odometer
reading and a used car’s selling price.

– A car dealer wants to find Car Odometer Price


the relationship between 1 37388 5318
2 44758 5061
the odometer reading and
3 45833 5008
the selling price of used cars. 4 30862 5795
– A random sample of 100 cars 5 31705 5784
6 34010 5359
is selected, and the data
. . .
recorded. . . .
– Find the regression line. . . .

Independent variable x
Dependent variable y
41
• Solution
– Solving by hand
• To calculate b0 and b1 we need to calculate several
statistics first;
x  36,009.45; s 2x
2

 (x i  x)
 43,528,688
n 1

y  5,411.41; cov( X , Y ) 
 ( x  x )(y
i i  y)
 1,356,256
n 1
where n = 100.

cov( X, Y) 1,356,256
b1    .0312
s 2x 43,528,688
b 0  y  b1x  5411.41  ( .0312)(36,009.45)  6,533
ŷ  b 0  b1x  6,533  .0312 x
42
– Using the computer (see file Xm17-01.xls)
Tools > Data analysis > Regression > [Shade the y range and the x range] > OK
Odometer Price SUMMARY OUTPUT
37388 5318
6000
44758 5061 Regression Statistics 5500

Price
45833 5008 Multiple R 0.806308 5000

30862 5795 R Square 0.650132 4500


19000 29000 39000 49000
31705 5784 Adjusted R Square
0.646562 Odometer
34010 5359 Standard Error
151.5688
45854 5235 Observations 100
19057 5845 ŷ  6,533  .0312 x
40149 5536 ANOVA
40237 5401 df SS MS
32359 5595 Regression 1 4183528 4183528
43533 5330 Residual 98 2251362 22973.09
32744 5806 Total 99 6434890
34470 5805
37720 5317 CoefficientsStandard Error t Stat
41350 5316 Intercept 6533.383 84.51232 77.30687
24469 5870 Odometer -0.03116 0.002309 -13.4947
43
6533
6000
5500

Price
5000
4500
0 No data 19000 29000 39000 49000
Odometer

ŷ  6,533  .0312 x

The intercept is b0 = 6533. This is the slope of the line.


For each additional mile on the odometer,
the price decreases by an average of $0.0312
Do not interpret the intercept as the
“Price of cars that have not been driven”

44
• Sum of squares for errors
– This is the sum of differences between the
points and the regression line.
– It can serve as a measure
n
of how well the line
fits the data. SSE   i i .
( y 
i 1
ŷ ) 2

cov( X , Y )
SSE  (n  1)s 2Y 
s 2x

– This statistic plays a role in every statistical


technique we employ to assess the model.
45
• Standard error of estimate

– The mean error is equal to zero.


– If se is small the errors tend to be close to zero
(close to the mean error). Then, the model fits
the data well.
– Therefore, we can, use se as a measure of the
suitability of using a linear model.
– An unbiased estimator of se2 is given by se2

S tan dard Error of Estimate


SSE
s 
n2
46
• Example
– Calculate the standard error of estimate for
example , and describe what does it tell you
about the model fit?
• Solution
s 
2  i i
( y  ˆ
y ) 2


6,434,890
 64,999 Calculated before
Y
n 1 99
cov( X , Y ) ( 1,356 , 256 ) 2
SSE  (n  1) sY2  2
 99(64,999)   2,252,3
sx 43,528,688
Thus , It is hard to assess the model based
SSE 2,251,363 on se even when compared with the
s   151.6 mean value of y.
n2 98 s   151.6, y  5,411.4 47
• Testing the slope

– When no linear relationship exists between two


variables, the regression line should be horizontal.

q q
q
q q q
q q q
q q q q q
q q q
qq qq
q qq
q q q q q q
q q q qq q q q
q q q q
q qq qq qq qq qq q q q q q q q q qq q q
qq q q q q q qq q
qq qq q q q q q q q qq q q q q
q qq q q qq q q qq q q qq q qq q

Linear relationship. No linear relationship.


Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).

The slope is not equal to zero The slope is equal to zero


48
• Coefficient of determination
– When we want to measure the strength of the linear
relationship, we use the coefficient of determination.

2
2 [cov( X , Y )] 2 SSE
R  or R  1 
2 2
sx sy (y i  y) 2

49
– To understand the significance of this coefficient
note:

part by The regression model


la in ed in
Ex p
Overall variability in y R
emain
s , in p a
r t, u n
expla
ined The error

50
Two data points (x1,y1) and (x2,y2) of a certain sample are shown.

y2

y1

x1 x2
Total variation in y = Variation explained by the + Unexplained variation (error)
regression line)
(y1  y )2  (y 2  y)2  ( ŷ 1  y ) 2  ( ŷ 2  y ) 2  ( y 1  ŷ 1 ) 2  ( y 2  ŷ 2 ) 2
51
Variation in y = SSR + SSE

• R2 measures the proportion of the variation


in y that is explained by the variation in x.

2
R  1
SSE

 ( y i  y ) 2  SSE

SSR
 (y i  y ) 2
 (y  y)
i
2
 (y i  y ) 2

• R2 takes on any value between zero and one.


R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
52
– Find the coefficient of determination for example ;
what does this statistic tell you about the model?
• Solution
– Solving by hand; R  2 [cov( X , Y )]2 [ 1,356 ,256 ]2
 ( 43 ,528 ,688 )( 64 ,999 )
 .6501
s 2x s 2y
– Using the computer
• From the regression output we have
Regression Statistics
Multiple R 0.8063 65% of the variation in the auction
R Square 0.6501 selling price is explained by the
Adjusted R Square 0.6466 variation in odometer reading. The
rest (35%) remains unexplained by
Standard Error 151.57
this model.
Observations 100
53
Coefficient of correlation

• The coefficient of correlation is used to measure


the strength of association between two variables.
• The coefficient values range between -1 and 1.
– If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.
– If r = 0 there is no linear pattern.
• The coefficient can be used to test for linear
relationship between two variables.

54
Regression Diagnostics - I
• The three conditions required for the validity
of the regression analysis are:
– the error variable is normally distributed.
– the error variance is constant for all values of x.
– The errors are independent of each other.
• How can we diagnose violations of these
conditions?

55
• Residual Analysis
– Examining the residuals (or standardized
residuals), we can identify violations of the
required conditions
– Example - continued
• Nonnormality.
– Use Excel to obtain the standardized residual histogram.
– Examine the histogram and look for a bell shaped diagram
with mean close to zero.

56
RESIDUAL OUTPUT A Partial list of For each residual we calculate
Standard residuals the standard deviation as follows:
Observation Residuals Standard Residuals
1 -50.45749927 -0.334595895 sri  s  1  hi where
2 -77.82496482 -0.516076186
3 -97.33039568 -0.645421421 1 ( xi  x)2
hi  
4
5
223.2070978
238.4730715
1.480140312
1.58137268
n
 ( x j  x)2

Standardized residual i =
Residual i / Standard deviation
40

30

20
We can also apply the Lilliefors test
10 or the c2 test of normality.
0
-2.5 -1.5 -0.5 0.5 1.5 2.5 More

57
• Nonindependence of error variables
– A time series is constituted if data were collected
over time.
– Examining the residuals over time, no pattern
should be observed if the errors are independent.
– When a pattern is detected, the errors are said to
be autocorrelated.
– Autocorrelation can be detected by graphing the
residuals against time.

58
• Outliers
– An outlier is an observation that is unusually small
or large.
– Several possibilities need to be investigated when
an outlier is observed:
• There was an error in recording the value.
• The point does not belong in the sample.
• The observation is valid.
– Identify outliers from the scatter diagram.
– It is customary to suspect an observation is an
outlier if its |standard residual| > 2
59
An outlier An influential observation

+++++++++++
+ +
+ … but, some outliers
+ +
+ +
may be very influential
+
+ + + +
+
+ +
+

The outlier causes a shift


in the regression line

60
• Procedure for regression diagnostics
– Develop a model that has a theoretical basis.
– Gather data for the two variables in the model.
– Draw the scatter diagram to determine whether a
linear model appears to be appropriate.
– Check the required conditions for the errors.
– Assess the model fit.
– If the model fits the data, use the regression
equation.

61
The Bayes Classifier
• Use Bayes Rule!

Likelihood Prior

Normalization Constant
• Why did this help? Well, we think that we might be able to
specify how features are “generated” by the class label
Another Example of the Naïve Bayes
Classifier
The weather data, with counts and probabilities
outlook temperature humidity windy play
yes no yes no yes no yes no yes no

sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5


overcast 4 0 mild 4 2 normal 6 1 true 3 3
rainy 3 2 cool 3 1
sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5
rainy 3/9 2/5 cool 3/9 1/5

A new day
outlook temperature humidity windy play
sunny cool high true ?
• Likelihood of yes
2 3 3 3 9
      0.0053
9 9 9 9 14
• Likelihood of no
3 1 4 3 5
      0.0206
5 5 5 5 14
• Therefore, the prediction is No
The Naive Bayes Classifier for Data Sets
with Numerical Attribute Values

• One common practice to handle numerical


attribute values is to assume normal
distributions for numerical attributes.
The numeric weather data with summary statistics
outlook temperature humidity windy play
yes no yes no yes no yes no yes no

sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std 6.2 7.9 std 10.2 9.7 true 3/9 3/5
dev dev
rainy 3/9 2/5
• Let x1, x2, …, xn be the values of a numerical attribute
in the training data set.
1 n
   xi
n i 1
1 n
   xi    2

n  1 i 1
 w  2
1 
f ( w)  e 2

2 
• For examples,
 6 6 73 2
1 
f  temperature  66 | Yes   e 2  6.2  2
 0.0340
2  6.2 
2 3 9
• Likelihood of Yes =  0.0340 0.0221   0.000036
9 9 14

3 3 5
• Likelihood of No =  0.0291 0.038   0.000136
5 5 14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy