Introduction To Statistics..Final
Introduction To Statistics..Final
• Raw values
Data • Facts
• Mathematical tools
Statis
tical • Mathematical techniques
Analy
sis
• Measures
Infor
mati • Summaries
on
Decision
Making
Components of Statistics
Descriptive Statistics:
Condenses sample data into a few summary descriptive measures.
When large quantities of data have been gathered, there is a need to
organise, summarise and extract the essential information contained
within this data for communication to management.
Summary measures allow a user to identify profiles, patterns,
relationships and trends within the data.
Inferential Statistics
Generalises sample findings to the broader population.
Descriptive statistics only describes the behaviour of a random
variable in a sample.
However, management is mainly concerned about the behaviour
and characteristics of random variables in the population from which
the sample was drawn.
They are therefore interested in the ‘bigger population picture’.
Inferential statistics is that area of statistics that allows managers
to understand the population picture of a random variable based on
the sample evidence.
Statistical Modelling
• Data is the lifeblood of statistical analysis. It must therefore be relevant, ‘clean’ and in
the correct format for statistical analysis.
Data Relevancy
• The random variables and the data selected for analysis must be problem specific. The
right choice of variables must be made to ensure that the statistical analysis
addresses the business problem under investigation.
Data Cleaning
• Data, when initially captured, is often ‘dirty’. Data must be checked for typographic
errors, out-of-range values and outliers. When dirty data is used in statistical analysis,
the results will produce poor-quality information for management decision making.
Data Enrichment
• Data can often be made more relevant to the management problem by transforming it
into more meaningful measures. This is known as data enrichment. For example, the
variables ‘turnover’ and ‘store size’ can be combined to create a new variable –
‘turnover per square metre’ – that is more relevant to analyse and compare stores’
performances.
Data Visualisation
Population and Sample
POPULATION
A population consists of all the items or
individuals about which you want to draw a
conclusion. The population is the “large
group”
SAMPLE
A sample is the portion of a population
selected for analysis. The sample is the “small
group”
Chap 1-23
Population vs. Sample
Population Sample
Chap 1-25
Probability Sample:
Simple Random Sample
• Every individual or item from the frame has an
equal chance of being selected
Chap 1-26
Examples
• Wrong sampling practice. 1936 US Presidential Elections. Literary
Digest collected a sample of size n=10,000,000 which was heavily
biased. Got a wrong prediction.
Chap 1-28
Graphical Statistics
Before you do anything with your data,
look at it
Chap 1-30
.
Types of Variables
Variables
Categorical Numerical
Examples:
Marital Status
Political Party Discrete Continuous
Eye Color
(Defined categories)
Examples: Examples:
Number of Children Weight
Defects per hour Voltage
(Counted items) (Measured characteristics)
Chap 1-31
.
Levels of Measurement
A nominal scale classifies data into distinct
categories in which no ranking is implied.
Chap 1-32
.
Levels of Measurement (con’t.)
An ordinal scale classifies data into distinct categories in which
ranking is implied
Chap 1-34
.
Interval and Ratio Scales
Chap 1-35
.
Visualizing Categorical Data:
The Bar Chart
In a bar chart, a bar shows each category, the length of which
represents the amount, frequency or percentage of values falling into
a category which come from the summary table of the variable.
Banking Preference
Chap 2-36
Visualizing Categorical Data:
The Pie Chart
The pie chart is a circle broken up into slices that represent categories.
The size of each slice of the pie varies according to the percentage in
each category.
Banking Preference
Banking Preference? %
16% ATM
ATM 16% 24%
Automated or live 2% 2% Automated or live
telephone telephone
Drive-through service at
Drive-through service at 17%
branch 17% branch
In person at branch
In person at branch 41%
Internet 24% Internet
41%
Chap 2-37
Visualizing Numerical Data:
The Histogram
Relative
Class Frequency Frequency Percentage
Frequency
4
(In a percentage
histogram the vertical
axis would be defined to 2
show the percentage of
observations per class)
0
5 15 25 35 45 55 More
Chap 2-38
Visualizing Two Numerical
Variables: Scatter Plot
Chap 2-39
Visualizing Two Numerical
Variables: Time Series Plot
Number of
Year Franchises Number of Franchises, 1996-2004
120
1996 43
100
1997 54
Franchises
Number of
80
1998 60 60
1999 73 40
2000 82 20
0
2001 95
1994 1996 1998 2000 2002 2004 2006
2002 107 Year
2003 99
2004 95
Chap 2-40
Examples
% of electricity
Appliances consumption Construct a bar chart and a pie
AC 18 chart.
Dryers 5
Washers 24
Computers 1
Make conclusions.
Cooking 2
Dishes 2
Freezers 2
Lighting 16
Friges 9
Heating 7
Water heating 8
TV etc 6
Chap 1-41
Examples
Chap 1-42
DESCRIPTIVE STATISTICS
The central tendency is the extent to which all the data values
group around a typical or central value.
Chap 3-44
Measures of Central Tendency:
The Mean
• The arithmetic mean (often just called the “mean”) is
the most common measure of central tendency
X i
X1 X 2 Xn
X i 1
n n
Sample size Observed values
Chap 3-45
Measures of Central Tendency:
The Mean (continued)
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Mean = 13 Mean = 14
11 12 13 14 15 65 11 12 13 14 20 70
13 14
5 5 5 5
Chap 3-46
Measures of Central Tendency:
The Median
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Median = 13 Median = 13
Chap 3-47
Measures of Central Tendency:
Locating the Median
• The location of the median when the values are in numerical order (smallest to largest):
n 1
Median position position in the ordered data
2
• If the number of values is odd, the median is the middle number
• If the number of values is even, the median is the average of the two middle numbers
Chap 3-48
Measures of Central Tendency:
The Mode
• Value that occurs most often
• Not affected by extreme values
• Used for either numerical or categorical (nominal) data
• There may be no mode
• There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
Chap 3-49
Measures of Central Tendency:
Review Example
House Prices: Mean: ($3,000,000/5)
$2,000,000 = $600,000
$ 500,000 Median: middle value of ranked
$ 300,000
$ 100,000 data
$ 100,000 = $300,000
Sum $ 3,000,000 Mode: most frequent value
= $100,000
Chap 3-50
Measures of Central Tendency:
Which Measure to Choose?
The mean is generally used, unless extreme values
(outliers) exist.
The median is often used, since the median is not
sensitive to extreme values. For example, median
home prices may be reported for a region; it is less
sensitive to outliers.
In some situations it makes sense to report both the
mean and the median.
Chap 3-51
Measures of Central Tendency:
Summary
Central Tendency
X i
X i1
n Middle value Most
in the ordered frequently
array observed
value
Chap 3-52
Measures of Variation
Variation
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
Chap 3-55
Measures of Variation:
Why The Range Can Be Misleading
Ignores the way in which data are distributed
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Chap 3-56
Measures of Variation:
The Sample Variance
(X
• Sample variance: 2
i X)
2
S i1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Chap 3-58
Measures of Variation:
The Sample Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the variance
• Has the same units as the original data
n
• Sample standard deviation: (X i X) 2
S i1
n -1
Chap 3-59
Measures of Variation:
Comparing Standard Deviations
Smaller standard deviation
Chap 3-60
Locating Extreme Outliers:
Z-Score
X X
Z
S
Chap 3-62
General Descriptive Stats Using
Microsoft Excel Functions
Chap 3-63
General Descriptive Stats Using
Microsoft Excel Data Analysis Tool
1. Select Data.
2. Select Data Analysis.
3. Select Descriptive
Statistics and click OK.
Chap 3-64
General Descriptive Stats Using
Microsoft Excel
Chap 3-65
Excel output
Microsoft Excel House Prices
descriptive statistics output,
Mean 600000
using the house price data: Standard Error 357770.8764
House Prices: Median 300000
Mode 100000
$2,000,000 Standard Deviation 800000
500,000 Sample Variance 640,000,000,000
300,000 Kurtosis 4.1301
100,000 Skewness 2.0068
100,000 Range 1900000
Minimum 100000
Maximum 2000000
Sum 3000000
Count 5
Chap 3-66
Numerical Descriptive Measures for
a Population
Descriptive statistics discussed previously described a sample, not the population.
Important population parameters are the population mean, variance, and standard
deviation.
Chap 3-67
Numerical Descriptive Measures
for a Population: The mean µ
X i
X1 X 2 XN
i 1
N N
Where μ = population mean
N = population size
Xi = ith value of the variable X
Chap 3-68
Numerical Descriptive Measures For A
Population: The Variance σ2
• Average of squared deviations of values from the mean
• Population variance: N
(X i μ) 2
σ 2 i1
N
N
• Population standard deviation:
i
(X μ) 2
σ i1
N
Chap 3-70
Sample statistics versus population
parameters
Chap 3-71
Quartiles
• Quartiles split the ranked data into 4 segments with an
equal number of values per segment
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% of the observations
are smaller and 50% are larger)
Only 25% of the observations are greater than the third
quartile
Chap 3-72
Quartile Measures:
Locating Quartiles
Chap 3-73
Quartile Measures:
Calculation Rules
• When calculating the ranked position use the following rules
• If the result is a whole number then it is the ranked position to use
• If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then average the two
corresponding data values.
• If the result is not a whole number or a fractional half then round the result to
the nearest integer to find the ranked position.
Chap 3-74
Quartile Measures
Calculating The Quartiles: Example
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = (12+13)/2 = 12.5
• Measures like Q1, Q3, and IQR that are not influenced by
outliers are called resistant measures
Chap 3-76
Calculating The Interquartile
Range
Example:
X Median X
minimum Q1 (Q2) Q3 maximum
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
Chap 3-77
The Five-Number Summary
The five numbers that help describe the center, spread
and shape of data are:
X
smallest
First Quartile (Q )
1
Median (Q )
2
Third Quartile (Q )
3
X
largest
Chap 3-78
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data based
on the five-number summary:
Xsmallest -- Q1 -- Median -- Q3 -- Xlargest
Example:
Chap 3-79
Five Number Summary:
Shape of Boxplots
• If data are symmetric around the median then the box and
central line are centered between the endpoints
Chap 3-80
Distribution Shape and
The Boxplot
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Chap 3-81
Measures Of The Relationship
Between Two Numerical Variables
• The Covariance
• The Coefficient of Correlation
Chap 3-82
The Covariance
• The covariance measures the strength of the linear
relationship between two numerical variables (X & Y)
( X X)( Y Y )
i i
cov ( X , Y ) i1
n 1
• Only concerned with the strength of the relationship
• No causal effect is implied
Chap 3-83
Interpreting Covariance
• Covariance between two variables:
cov(X,Y) > 0 X and Y tend to move in the same direction
cov(X,Y) < 0 X and Y tend to move in opposite directions
cov(X,Y) = 0 X and Y are independent
Chap 3-84
Coefficient of Correlation
• Measures the relative strength of the linear
relationship between two numerical variables
• Sample coefficient of correlation:
cov (X , Y)
r
SX SY
where
n n n
(X X)(Y Y)
i i (X X)
i
2
(Y Y )
i
2
cov (X , Y) i1
SX i1
SY i1
n 1 n 1 n 1
Chap 3-85
Features of the
Coefficient of Correlation
• The population coefficient of correlation is referred as ρ.
• The sample coefficient of correlation is referred to as r.
• Either ρ or r have the following features:
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker the linear relationship
Chap 3-86
Scatter Plots of Sample Data
with Various Coefficients of
Correlation
Y Y
X X
r = -1 r = -.6
Y
Y Y
X X X
r = +1 r = +.3 r=0
Chap 3-87
The Coefficient of Correlation Using
Microsoft Excel Function
Chap 3-88
The Coefficient of Correlation Using
Microsoft Excel Data Analysis Tool
1. Select Data
2. Choose Data Analysis
3. Choose Correlation &
Click OK
Chap 3-89
The Coefficient of Correlation
Using Microsoft Excel
• Simple event
• An event described by a single characteristic
• e.g., A day in January from all days in 2019
• Joint event
• An event described by two or more characteristics
• e.g. A day in January that is also a Wednesday from all days in 2019
• Complement of an event A (denoted A’)
• All events that are not part of event A
• e.g., All days from 2019 that are not in January
Chap 4-96
Sample Space
The Sample Space is the collection of all
possible events (outcomes)
e.g. All 6 faces of a die:
Wed. 5 47
Not Wed. 5227 286 313
Total
• Decision Trees Wed. 5 Number
Sample Of
J an .
Space Sample
All Days Not Wed. 27
Space
In 2019 Wed. 47
Outcomes
N o t Ja
n.
Not W 286
ed .
Definition: Simple (Marginal)
Probability
• Simple Probability refers to the probability of a
simple event.
• ex. P(Jan.)
• ex. P(Wed.)
P(Jan.) = 32 / 365
Definition: Joint Probability
• Joint Probability refers to the probability of an
occurrence of two or more events (joint event).
• ex. P(Jan. and Wed.)
• ex. P(Not Jan. and Not Wed.)
Chap 4-101
Collectively Exhaustive Events
• Collectively exhaustive events
• One of the events must occur
• The set of events covers the entire sample space
A = Weekday; B = Weekend;
C = January; D = Spring;
Wed. 5 47 52
Not Wed. 27 286 313
Wed. 4 48 52
Not Wed. 27 286 313
Event
Event B1 B2 Total
A1 P(A1 and B1) P(A1 and B2) P(A1)
P(A | B) P(A)
• Events A and B are independent when the probability of
one event is not affected by the fact that the other event
has occurred
Multiplication Rules
P(A | B)P(B)
P(B | A)
P(A)
• Where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events
Bayes’ Theorem
P(A | B i )P(B i )
P(B i | A)
P(A | B 1 )P(B 1 ) P(A | B 2 )P(B 2 ) P(A | B k )P(B k )
• where:
Bi = ith event of k mutually exclusive and collectively
exhaustive events
A = new event that might impact P(Bi)
Bayes’ Theorem Example
• A drilling company has estimated a 40% chance of
striking oil for their new well.
• A detailed test has been scheduled for more
information. Historically, 60% of successful wells
have had detailed tests, and 20% of unsuccessful
wells have had detailed tests.
• Given that this well has been scheduled for a
detailed test, what is the probability
that the well will be successful?
Bayes’ Theorem Example
(continued)
Sum = 0.36
Exercises
• All athletes at the Olympic games are tested for performance-
enhancing steroid drug use. The imperfect test gives positive results
(indicating drug use) for 90% of all steroid-users but also (and
incorrectly) for 2% of those who do not use steroids. Suppose that 5%
of all registered athletes use steroids. If an athlete is tested positive,
what is the probability that he/she uses steroids?
Chapter Summary
In this chapter we discussed:
z = x – µ /__σ_
√n
Thus µ = x –/+ z __σ_
√n
=µ = 78 ± 1.96(1.2124)
= 78 ± 2.376
This gives a lower and an upper confidence limit about µ:
lower 95% confidence limit of 78 – 2.376 = 75.624 ($75.62)
upper 95% confidence limit of 78 + 2.376 = 80.376 ($80.38).
Management Interpretation
• There is a 95% chance that the true mean value of all grocery
purchases by grocery shoppers in PicknPay lies between $75.62 and
$80.38.
The Precision of a Confidence
Interval
• The width of a confidence interval is a measure of its precision. The
narrower the confidence interval, the more precise is the interval
estimate, and vice versa.
• The width of a confidence interval is influenced by:
• the specified confidence level
• the sample size
• the population standard deviation.
Hwange Coal Mine
• A human resources director at the Hwange Coal mine wishes to
estimate the true mean employment period of all coalminers. From a
random sample of 144 coalminers’ records, the sample mean
employment period was found to be 88.4 months. The population
standard deviation is assumed to be 21.5 months and normally
distributed. Find the 95 % confidence interval estimate for the actual
mean employment period (in months) for all miners employed in coal
mines.
Solution
• Given x = 88.4 months, σ = 21.5 months and n = 144 miners. Find the standard
error of the sample mean.
HYPOTHESIS A statement about the value of a population parameter developed for the purpose of testing.
HYPOTHESIS TESTING A procedure based on sample evidence and probability theory to determine whether the
hypothesis is a reasonable statement.
TEST STATISTIC A value, determined from sample information, used to determine whether to reject the null hypothesis.
CRITICAL VALUE The dividing point between the region where the null hypothesis is rejected and the region where it is
not rejected.
Important Things to Remember about H0 and H1
• If we conclude 'do not reject H 0', this does not necessarily mean that the null No more than H0
hypothesis is true, it only suggests that there is not sufficient evidence to reject H 0;
rejecting the null hypothesis then, suggests that the alternative hypothesis may be At least ≥ H0
true.
Has increased > H1
• Equality is always part of H0 (e.g. “=” , “≥” , “≤”).
• “≠” “<” and “>” always part of H1 Is there difference? ≠ H1
MEAN
PROPORTION
Testing for a Population Mean with a Known Population
Standard Deviation- Example
Step 2: Select the level of significance. Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not rejected.
α = 0.01 as stated in the problem
We conclude that the population mean is not different from 200. So we
would report to the vice president of manufacturing that the sample
Step 3: Select the test statistic. evidence does not show that the production rate at the plant has
Use Z-distribution since σ is
known changed from 200 per week.
Testing for a Population Mean with a Known Population
Standard Deviation- Another Example
Suppose in the previous problem the vice president wants to know whether
there has been an increase in the number of units assembled. To put it Step 4: Formulate the decision rule.
another way, can we conclude, because of the improved production Reject H0 if Z > Z
methods, that the mean number of desks assembled in the last 50 weeks
was more than 200?
Recall: σ=16, n=200, α=.01 Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not
rejected. We conclude that the average number of desks
Step 1: State the null hypothesis and the alternate hypothesis. assembled in the last 50 weeks is not more than 200
H0: ≤ 200
H1: > 200
(note: keyword in the problem “an increase”)
• Type I Error -
• Defined as the probability of rejecting the null EAMPLE p-Value
hypothesis when it is actually true. Recall the last problem where the hypothesis and decision rules were
• This is denoted by the Greek letter “” set up as:
• Also known as the significance level of a test H0: ≤ 200
H1: > 200
Reject H0 if Z > Z
• Type II Error:
where Z = 1.55 and Z =2.33
• Defined as the probability of “accepting” the
null hypothesis when it is actually false.
• This is denoted by the Greek letter “β” Reject H0 if p-value <
0.0606 is not < 0.01
• p-VALUE is the probability of observing a sample value
as extreme as, or more extreme than, the value
observed, given that the null hypothesis is true.
Because -1.818 does not fall in the rejection region, H0 is not rejected at the .01
significance level. We have not demonstrated that the cost-cutting measures
reduced the mean cost per claim to less than $60. The difference of $3.58 ($56.42 -
$60) between the sample mean and the population mean could be due to sampling
error.
Tests Concerning Proportion using
the z-Distribution
• A Proportion is the fraction or percentage that indicates the part of the population or sample having a particular trait of interest.
• The sample proportion is denoted by p and is found by x/n
• It is assumed that the binomial assumptions discussed in Chapter 6 are met:
(1) the sample data collected are the result of counts;
(2) the outcome of an experiment is classified into one of two mutually exclusive categories—a “success” or a “failure”;
(3) the probability of a success is the same for each trial; and (4) the trials are independent
• Both n and n(1- ) are at least 5.
• When the above conditions are met, the normal distribution can be used as an approximation to the binomial distribution
• The test statistic is computed as follows:
Test Statistic for Testing a Single Population
Proportion - Example
EXAMPLE Step 4: Formulate the decision rule.
Suppose prior elections in a certain state indicated it is Reject H0 if Z < -Z
necessary for a candidate for governor to receive at least
80 percent of the vote in the northern section of the state
to be elected. The incumbent governor is interested in
assessing his chances of returning to office and plans to
conduct a survey of 2,000 registered voters in the northern
section of the state. Using the hypothesis-testing
procedure, assess the governor’s chances of reelection.
H0: ≥ .80
H1: < .80
(note: keyword in the problem “at least”)
EXAMPLE
The U-Scan facility was recently installed at the Byrne Road Food-Town location. The store manager would like to know if the
mean checkout time using the standard checkout method is longer than using the U-Scan. She gathered the following sample
information. The time is measured from when the customer enters the line until their bags are in the cart. Hence the time
includes both waiting in line and checking out.
Reject H0 if Z > Z
Z > 2.33
Step 5: Compute the value of z and make a decision
Xs Xu
z
s2 u2
ns nu
5 .5 5 .3
0.40 2 0.30 2
50 100
The computed value of 3.13 is
0.2 critical value of 2.33. Our decision is to reject the null hypothesis. The difference of .20 minutes between the mean checkout time using the
larger than the3.13
standard method is too large to have occurred by chance. We conclude the U-Scan method is faster.
0.064
Two-Sample Tests about Proportions
We investigate whether two samples came from EXAMPLE
populations with an equal proportion of successes. The Manelli Perfume Company recently developed a new fragrance that it
two samples are pooled using the following formula.
plans to market under the name Heavenly. A number of market
studies indicate that Heavenly has very good market potential. The
Sales Department at Manelli is particularly interested in whether there
The value of the test statistic is computed from the
following formula. is a difference in the proportions of younger and older women who
would purchase Heavenly if it were marketed. Samples are collected
from each of these independent groups. Each sampled woman was
asked to smell Heavenly and indicate whether she likes the fragrance
well enough to purchase a bottle.
Step 1: State the null and alternate hypotheses. Step 5: Compute the value of t and make a decision
(Keyword: “Is there a difference”)
H0: µ1 = µ2
H1: µ1 ≠ µ2
H0: 1 = 2
H1: 1 ≠ 2
Reject H0 if
.
Two-Sample Tests of Hypothesis:
Dependent Samples
Dependent samples are samples that are paired or
related in some fashion.
EXAMPLE
For example:
Nickel Savings and Loan wishes to compare the two companies it
• If you wished to buy a car you would look uses to appraise the value of residential homes. Nickel Savings
at the same car at two (or more) different
dealerships and compare the prices. selected a sample of 10 residential properties and scheduled both
• If you wished to measure the firms for an appraisal. The results, reported in $000, are shown on
effectiveness of a new diet you would the table (right).
weigh the dieters at the start and at the
finish of the program. At the .05 significance level, can we conclude there is a difference in
the mean appraised values of the homes?
d
t
sd / n
Where
d is the mean of the differences
sd is the standard deviation of the differences
n is the number of pairs (differences)
Hypothesis Testing Involving Paired
Observations - Example
Step 1: State the null and alternate hypotheses.
H0: d = 0
H1: d ≠ 0
Reject H0 if
The computed value of t (3.305) is greater than the higher critical value (2.262), so our decision is to reject the null hypothesis.
We conclude that there is a difference in the mean appraised values of the homes.
Analysis of Variance
The F Distribution
Examples:
• Two Barth shearing machines are set to produce steel bars of the same length. The bars, therefore, should have the
same mean length. We want to ensure that in addition to having the same mean length they also have similar
variation.
• The mean rate of return on two types of common stock may be the same, but there may be more variation in the
rate of return in one than the other. A sample of 10 technology and 10 utility stocks shows the same mean rate of
return, but there is likely more variation in the Internet stocks.
• A study by the marketing department for a large newspaper found that men and women spent about the same
amount of time per day reading the paper. However, the same report indicated there was nearly twice as much
variation in time spent per day among the men than the women.
Test for Equal Variances - Example
Lammers Limos offers limousine service from the Step 1: The hypotheses are:
city hall in Toledo, Ohio, to Metro Airport in H0: σ12 = σ22
Detroit. Sean Lammers, president of the
company, is considering two routes. One is via H1: σ12 ≠ σ22
U.S. 25 and the other via I-75. He wants to study
the time it takes to drive to the airport using Step 2: The significance level is .05.
each route and then compare the results. He
collected the following sample data, which is Step 3: The test statistic is the F distribution.
reported in minutes.
Using the .10 significance level, is there a difference Step 4: State the decision rule.
in the variation in the driving times for the two Reject H0 if F > F/2,v1,v2
routes? F > F.10/2,7-1,8-1
F > F.05,6,7
F > 3.87
Test for Equal Variances - Example
Step 5: Compute the value of F and make a decision
The decision is to reject the null hypothesis, because the computed F value (4.23) is larger than the critical value (3.87).
We conclude that there is a difference in the variation of the travel times along the two routes.
Comparing Means of Two or More
Populations
The F distribution is also used for testing whether two or more sample means came from the same or equal populations.
Assumptions:
• The sampled populations follow the normal distribution.
• The populations have equal standard deviations.
• The samples are randomly selected and are independent.
The Null Hypothesis is that the population means are the same. The Alternative Hypothesis is that at least one of the means is different.
H0: µ1 = µ2 =…= µk
H1: The means are not all equal
Reject H0 if F > F,k-1,n-k
Comparing Means of Two or More Populations –
Example
F > F.05,k-1,(b-1)(k-1)
F > F.05,5-1,(5-1)(4-1)
F > F.05,4,12
F > 3.26
• The variable being studied is referred to as the Of the three questions, we are most interested in the test for
response variable. interactions. To put it another way, does a particular
route/driver combination result in significantly faster (or
• One way to study interaction is by plotting factor means slower) driving times? Also, the results of the hypothesis test
for interaction affect the way we analyze the route and driver
in a graph called an interaction plot. questions.
Example – ANOVA with Replication
Route
Driver
One-way ANOVA for Each Driver
H0: Route travel times are equal.
Conclusion:
The route travel times are all equal for Deans
And Ormson (at 0.05 significance level)
The route travel times are not all equal for Filbeck,
Snaverly and Zollaco
Linear Regression and
Correlation
Regression Analysis - Introduction
• Recall in Chapter 4 the idea of showing the relationship between two variables with a scatter diagram was introduced.
• In that case we showed that, as the age of the buyer increased, the amount spent for the vehicle also increased.
• In this chapter we carry this idea further. Numerical measures to express the strength of relationship between two variables are developed.
• In addition, an equation is used to express the relationship between variables, allowing us to estimate one variable on the basis of another.
EXAMPLES
1. Is there a relationship between the amount Healthtex spends per month on advertising and its sales in the month?
2. Can we base an estimate of the cost to heat a home in January on the number of square feet in the home?
3. Is there a relationship between the miles per gallon achieved by large pickup trucks and the size of the engine?
4. Is there a relationship between the number of hours that students studied for an exam and the score earned?
• Correlation Analysis is the study of the relationship between variables. It is also defined as group of techniques to measure the association between two
variables.
• Scatter Diagram is a chart that portrays the relationship between the two variables. It is the usual first step in correlations analysis
• The Independent Variable provides the basis for estimation. It is the predictor variable.
Scatter Diagram Example
The sales manager of Copier Sales of America, which has a large sales force throughout the United States and Canada, wants to
determine whether there is a relationship between the number of sales calls made in a month and the number of copiers
sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each
representative made last month and the number of copiers sold.
The Coefficient of Correlation, r
The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables.
• It shows the direction and strength of the linear relationship between two interval or ratio-scale variables
• Negative values indicate an inverse relationship and positive values indicate a direct relationship.
Correlation Coefficient - Example
EXAMPLE Using the formula:
Using the Copier Sales of America data which a
scatterplot is shown below, compute the
correlation coefficient and coefficient of
determination.
Reject H0 if:
The computed t (3.297) is within the rejection region, therefore, we will reject H 0. This means the correlation in the population
is not zero. From a practical standpoint, it indicates to the sales manager that there is correlation with respect to the
number of sales calls made and the number of copiers sold in the population of salespeople.
Regression Analysis
In regression analysis we use the independent variable (X) to estimate the dependent variable (Y).
• The relationship between the variables is linear.
• Both variables must be at least interval scale.
• The least squares criterion is used to determine the equation.
REGRESSION EQUATION An equation that expresses the linear relationship between two variables.
LEAST SQUARES PRINCIPLE Determining a regression equation by minimizing the sum of the squares of the vertical distances
between the actual Y values and the predicted values of Y.
n( XY ) ( X )( Y )
b
n( X 2 ) ( X ) 2
Y X
a b
n n
Linear Regression Model
Regression Equation - Example
Recall the example involving Copier Sales of America. Step 1 – Find the slope (b) of the line
The sales manager gathered information on the
number of sales calls made and the number of
copiers sold for a random sample of 10 sales
representatives. Use the least squares method to
determine a linear equation to express the
relationship between the two variables. Step 2 – Find the y-intercept (a)
What is the expected number of copiers sold by a
representative who made 20 calls?
The regression equation is :
^
Y a bX
^
Y 18.9476 1.1842 X
^
Y 18.9476 1.1842(20)
^
Y 42.6316
Assumptions Underlying Linear
Regression
For each value of X, there is a group of Y values, and these
• Y values are normally distributed. The means of these normal
distributions of Y values all lie on the straight line of regression.
• The standard deviations of these normal distributions are equal.
• The Y values are statistically independent. This means that in the
selection of a sample, the Y values chosen for a particular X value do not
depend on the Y values for any other X values.
Time Series and
Forecasting
Time Series and its Components
TIME SERIES is a collection of data recorded
over a period of time (weekly, monthly,
quarterly), an analysis of history, that can be
used by management to make current
decisions and plans based on long-term
forecasting. It usually assumes past pattern
to continue into the future
Sales
Year t ($ mil.)
2005 1 7
2006 2 10
2007 3 9
2008 4 11
2009 5 13
Seasonal Variation and Seasonal Index
SEASONAL INDEX
• A number, usually expressed in percent, that expresses the relative value of a season with respect to
the average for the year (100%)
• Ratio-to-moving-average method
• The method most commonly used to compute the typical seasonal
pattern
• It eliminates the trend (T), cyclical (C), and irregular (I) components from
the time series
Seasonal Index – An Example
EXAMPLE
The table below shows the quarterly sales for Toys
International for the years 2001 through 2006. The
sales are reported in millions of dollars. Determine a
quarterly seasonal index using the ratio-to-moving-
average method.
Given the deseasonalized linear equation for Toys International sales as Ŷ=8.109 + 0.0899t, generate the seasonally adjusted forecast for each
of the quarters of 2010
Ŷ X SI = 10.62648 X 1.519
Ŷ = 8.10 + 0.0899(28)