0% found this document useful (0 votes)
13 views221 pages

Introduction To Statistics..Final

The document provides an introduction to statistics, defining key concepts such as data, information, and the components of statistics including descriptive and inferential statistics. It discusses various data types, measurement scales, and the importance of data quality and collection methods in statistical analysis. Additionally, it emphasizes the role of statistical modeling and tools in decision-making across different management areas.

Uploaded by

joyropafadzo3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views221 pages

Introduction To Statistics..Final

The document provides an introduction to statistics, defining key concepts such as data, information, and the components of statistics including descriptive and inferential statistics. It discusses various data types, measurement scales, and the importance of data quality and collection methods in statistical analysis. Additionally, it emphasizes the role of statistical modeling and tools in decision-making across different management areas.

Uploaded by

joyropafadzo3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 221

INTRODUCTION TO STATISTICS

Statistics: a set of mathematically based tools and techniques to


transform raw (unprocessed) data into a few summary measures that
represent useful and usable information to support effective decision
making (Wegner 2015)
These summary measures are used to describe profiles (patterns) of
data, test relationships between sets of data and identify trends in data
over time.
Data v Information
Data: In simple terms data refers to raw values/facts.
For example
Daily stock prices
Amount of rainfall (mm) in each province
Assignment marks for Statistics students

Information refers to processed data


Information must be timely, accurate, relevant, adequate and easily
accessible.
• Statistical methods can be applied in any management area where
data exists
e.g. in Human Resources, Marketing, Finance and Operations inter alia

• Raw values
Data • Facts

• Mathematical tools
Statis
tical • Mathematical techniques
Analy
sis

• Measures
Infor
mati • Summaries
on

Decision
Making
Components of Statistics
Descriptive Statistics:
Condenses sample data into a few summary descriptive measures.
When large quantities of data have been gathered, there is a need to
organise, summarise and extract the essential information contained
within this data for communication to management.
Summary measures allow a user to identify profiles, patterns,
relationships and trends within the data.

Inferential Statistics
Generalises sample findings to the broader population.
Descriptive statistics only describes the behaviour of a random
variable in a sample.
 However, management is mainly concerned about the behaviour
and characteristics of random variables in the population from which
the sample was drawn.
They are therefore interested in the ‘bigger population picture’.
 Inferential statistics is that area of statistics that allows managers
to understand the population picture of a random variable based on
the sample evidence.
Statistical Modelling

Build models of relationships between random variables.


Statistical modelling constructs equations between variables that are
related to each other.
These equations (called models) are then used to estimate or predict
values of one of these variables based on values of related variables.
They are extremely useful in forecasting decisions.
Statistics and computers
When doing the analysis or modelling, the following packages can be
used:
Micrsoft Excel
Stata
Eviews
R
SPSS
Python
Closer at Data
Data is a raw material of statistical analysis
Thus at the pith of data is data quality
Usually the output is determined by input (GIGO)
Data quality is determined by: data source, data type and collection
method
Statistical Method
The choice of the most appropriate statistical method to use
depends firstly on the management problem to be addressed and
secondly on the type of data available.
Certain statistical methods are valid for certain data types only.
The incorrect choice of statistical method for a given data type can
again produce invalid statistical findings.
Data Types

a) Qualitative random variables


Generate categorical (non-numeric) response data.
The data is represented by categories only.
The following are examples of qualitative random variables with categories
as data:
The gender of a consumer is either male, female or transgender
An employee’s highest qualification is either a PhD, Masters, degree diploma
or Certificate
A company operates in either the financial, retail, mining or industrial sector.
A consumer’s choice of mobile phone service provider is either Econet,
Telecel or Netone
Quantitative random
variables
Generate numeric response data.
These are real numbers that can be manipulated using arithmetic
operations (add, subtract, multiply and divide).
The following are examples of quantitative random variables with real
numbers as data: the age of an employee (e.g. 46 years; 28 years; 32
years)
Weight of an object, height, distance
the price of a product in different stores (e.g. $6.75; $7.45; $7.20; $6.99)
Numeric data can be further classified as either discrete or continuous.
i. Continuous Data
 Is any number that can occur in an interval.
For example, the assembly time for a part can be between
27 minutes and 31 minutes (e.g. assembly time = 28.4 min),
A passenger’s hand luggage can have a mass between 0.5 kg and 10
kg (e.g. 2.4 kg) and
The volume of fuel in a car tank can be between 0 litres and 55 litres
(e.g. 42.38 litres).
ii) Discrete Data
Discrete data is whole number (or integer) data.
For example, the number of students in a class (e.g. 24; 37; 41; 46),
 The number of cars sold by a dealer in a month (e.g. 14; 27; 21; 16)
 The number of machine breakdowns in a shift (e.g. 4; 0; 6; 2).
Data Measurement Scales
A) Nominal Data
Nominal data is associated with categorical data.
If all the categories of a qualitative random variable are of equal importance,
then this categorical data is termed ‘nominal-scaled’.
Examples of nominal-scaled categorical data are: gender (1 = male; 2 = female),
city of residence (1 = Harare; 2 = Moscow; 3 = Cape Town; 4 = Kyiv), home
language (1 = Manyika; 2 = Karanga; 3 = Korekore; 4 = Zulu; 5 = Sotho), mode of
commuter transport (1 = car; 2 = train; 3 = bus; 4 = taxi; 5 = bicycle), engineering
profession (1 = chemical; 2 = electrical; 3 = civil; 4 = mechanical), Race (1=Black,
2=White, 3= Red, 4=Other)
Nominal data is the weakest form of data to analyse since the codes assigned to
the various categories have no numerical properties.
Nominal data can only be counted (or tabulated).
This limits the range of statistical methods that can be applied to nominal-scaled
data to only a few techniques.
B) Ordinal data
Ordinal data is also associated with categorical data, but has an implied ranking
between the different categories of the qualitative random variable.
Each consecutive category possesses either more or less than the previous category
of a given characteristic.
 Examples of ordinal-scaled categorical data are: size of clothing (1 = small; 2 =
medium; 3 = large; 4 = extra large), product usage level (1 = light; 2 = moderate; 3 =
heavy), income category (1 = lower; 2 = middle; 3 = upper), company size (1 = micro;
2 = small; 3 = medium; 4 = large), response to a survey question: ‘Rank your top
three TV programmes in order of preference’ (1 = first choice; 2 = second choice; 3
= third choice).
Rank (ordinal) data is stronger than nominal data because the data possesses the
numeric property of order (but the distances between the ranks are not equal). It is
therefore still numerically weak data, but it can be analysed by more statistical
methods (i.e. from the field of non-parametric statistics) than nominal data.
C) Interval data

• Interval data is associated with numeric data and quantitative random


variables.
• It is generated mainly from rating scales, which are used in survey
questionnaires to measure respondents’ attitudes, motivations,
preferences and perceptions.
• It use bipolar adjectives (e.g. very slow to extremely high, strongly
disagree to strongly agree)with respect to a statement or an opinion.
Interval data possesses the two properties of rank-order (same as ordinal
data) and distance in terms of ‘how much more or how much less’ an
object possesses of a given characteristic.
However, it has no zero point.
Therefore it is not meaningful to compare the ratio of interval- scaled
values with one another. For example, it is not valid to conclude that a
rating of 4 is twice as important as a rating of 2, or that a rating of 1 is
only one-third as important as a rating of 3, or 20 dc is twice as hot as 10
dc.
Zero dc does not mean absence of temperature
Interval data (rating scales) possesses sufficient numeric properties to be
treated as numeric data for the purpose of statistical analysis.
A much wider range of statistical techniques can therefore be applied to
interval data compared with nominal and ordinal data.
D) Ratio data
Ratio data consists of all real numbers associated with quantitative random variables.
 Examples of ratio-scaled data are: employee ages (years), customer income (R),
distance travelled (km), door height (cm), product mass (g), volume of liquid in a
container (ml), machine speed (rpm), tyre pressure (psi), product prices ($), length of
service (months) and number of shopping trips per month (0; 1; 2; 4; etc.).
Ratio data has all the properties of numbers (order, distance and an absolute origin of
zero) that allow such data to be manipulated using all arithmetic operations (addition,
subtraction, multiplication and division).
The zero origin property means that ratios can be computed (5 is half of 10, 4 is one-
quarter of 16, 36 is twice as great as 18, etc).
Ratio data is the strongest data for statistical analysis. Compared to the other data
types (nominal, ordinal and interval), the most amount of statistical information can
be extracted from it.
Also, more statistical methods can be applied to ratio data than to any other data
type.
Data Sources
• Internal v External
• In a business set up, internal data is sourced from within a company.
• It is data that is generated during the normal course of business activities. As such, it is relatively
inexpensive to gather, readily available from company databases and potentially of good quality
(since it is recorded using internal business systems). Examples of internal data sources are: sales
vouchers, credit notes, accounts receivable, accounts payable and asset registers for financial data,
production cost records, stock sheets and downtime records for production data, time sheets,
wages and salaries schedules and absenteeism records for human resource data, product sales
records and advertising expenditure budgets for marketing data.
• External data sources exist outside an organisation. They are mainly business associations,
government agencies, universities and various research institutions. The cost and reliability of
external data is dependent on the source.
• A wide selection of external databases exist and, in many cases, can be accessed via the internet,
either free of charge or for a fee. A few examples of external data sources are: ZSE, VFEX, ZIMSTATS,
Press Statements, Reports CCZ, GVT etc
Primary v Secondary
• Primary data is data that is recorded for the first time at source and with a specific purpose in mind.
• Primary data can be either internal (if it is recorded directly from an internal business process, such as
machine speed settings, sales invoices, stock sheets and employee attendance records) or external (e.g.
obtained through surveys such as human resource surveys, economic surveys and consumer surveys
(market research)).
• The main advantage of primary-sourced data is its high quality (i.e. relevancy and accuracy). This is due
to generally greater control over its collection and the focus on only data that is directly relevant to the
management problem.
• The main disadvantage of primary-sourced data is that it can be time consuming and expensive to
collect, particularly if sourced using surveys. Internal company databases, however, are relatively quick
and cheap to access for primary data.
• Secondary data is data that already exists in a processed format.
• It was previously collected and processed by others for a purpose other than the problem at hand.
• It can be internally sourced (e.g. a monthly stock report or a quarterly absenteeism report) or externally
sourced (e.g. economic time series on trade, exports, employment statistics from Zimstats)
• Secondary data has two main advantages. First, its access time is relatively short (especially if the data is
accessible through the internet), and second it is generally less expensive to acquire than primary data.
Data Collection Methods
• Surveys (personal, telephone, e-survey )
Under survey, questionnaire is at the epitome
• Experiments
• FGD
• Observation
Data Preparation

• Data is the lifeblood of statistical analysis. It must therefore be relevant, ‘clean’ and in
the correct format for statistical analysis.
Data Relevancy
• The random variables and the data selected for analysis must be problem specific. The
right choice of variables must be made to ensure that the statistical analysis
addresses the business problem under investigation.
Data Cleaning
• Data, when initially captured, is often ‘dirty’. Data must be checked for typographic
errors, out-of-range values and outliers. When dirty data is used in statistical analysis,
the results will produce poor-quality information for management decision making.
Data Enrichment
• Data can often be made more relevant to the management problem by transforming it
into more meaningful measures. This is known as data enrichment. For example, the
variables ‘turnover’ and ‘store size’ can be combined to create a new variable –
‘turnover per square metre’ – that is more relevant to analyse and compare stores’
performances.
Data Visualisation
Population and Sample
POPULATION
A population consists of all the items or
individuals about which you want to draw a
conclusion. The population is the “large
group”

SAMPLE
A sample is the portion of a population
selected for analysis. The sample is the “small
group”

Chap 1-23
Population vs. Sample
Population Sample

All the items or individuals about A portion of the population of


which you want to draw conclusion(s) items or individuals

Chap 1-25
Probability Sample:
Simple Random Sample
• Every individual or item from the frame has an
equal chance of being selected

• Samples obtained from table of random numbers or


computer random number generators.

Chap 1-26
Examples
• Wrong sampling practice. 1936 US Presidential Elections. Literary
Digest collected a sample of size n=10,000,000 which was heavily
biased. Got a wrong prediction.

• Good sampling practice. 1980 trial of Chrysler Corporation vs. United


States Environmental Protection Agency. A very clean sample of n=10
cars provided a bulletproof evidence.

Chap 1-28
Graphical Statistics
Before you do anything with your data,
look at it

In Excel: INSERT → CHARTS


Data Analysis Toolpak → Histogram
Recall: Types of Variables
 Categorical (qualitative) variables have values that can only be
placed into categories, such as “yes” and “no”; major; architectural
style; etc.

 Numerical (quantitative) variables have values that represent


quantities.
 Discrete variables arise from a counting process
 Continuous variables arise from a measuring process

Chap 1-30
.
Types of Variables
Variables

Categorical Numerical

Examples:
 Marital Status
 Political Party Discrete Continuous
 Eye Color
(Defined categories)
Examples: Examples:
 Number of Children  Weight
 Defects per hour  Voltage
(Counted items) (Measured characteristics)
Chap 1-31
.
Levels of Measurement
A nominal scale classifies data into distinct
categories in which no ranking is implied.

Categorical Variables Categories

Personal Computer Yes / No


Ownership

Type of Stocks Owned Growth / Value / Other

Internet Provider AT&T, Verizon, Time Warner Cable

Chap 1-32
.
Levels of Measurement (con’t.)
An ordinal scale classifies data into distinct categories in which
ranking is implied

Categorical Variable Ordered Categories

Student class designation Freshman, Sophomore, Junior,


Senior
Product satisfaction Satisfied, Neutral, Unsatisfied

Faculty rank Professor, Associate Professor,


Assistant Professor, Instructor
Standard & Poor’s bond ratings AAA, AA, A, BBB, BB, B, CCC, CC,
C, DDD, DD, D
Student Grades A, B, C, D, F
Chap 1-33
.
Levels of Measurement (con’t.)
 An interval scale is an ordered scale in which the difference between
measurements is a meaningful quantity but the measurements do not
have a true zero point.

 A ratio scale is an ordered scale in which the difference between the


measurements is a meaningful quantity and the measurements have a
true zero point.

Chap 1-34
.
Interval and Ratio Scales

Chap 1-35
.
Visualizing Categorical Data:
The Bar Chart
 In a bar chart, a bar shows each category, the length of which
represents the amount, frequency or percentage of values falling into
a category which come from the summary table of the variable.

Banking Preference

Banking Preference? % Internet


ATM 16%
In person at branch
Automated or live 2%
telephone
Drive-through service at 17%
Drive-through service at branch
branch
In person at branch 41% Automated or live telephone
Internet 24%
ATM

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%

Chap 2-36
Visualizing Categorical Data:
The Pie Chart
 The pie chart is a circle broken up into slices that represent categories.
The size of each slice of the pie varies according to the percentage in
each category.
Banking Preference

Banking Preference? %
16% ATM
ATM 16% 24%
Automated or live 2% 2% Automated or live
telephone telephone
Drive-through service at
Drive-through service at 17%
branch 17% branch
In person at branch
In person at branch 41%
Internet 24% Internet
41%

Chap 2-37
Visualizing Numerical Data:
The Histogram
Relative
Class Frequency Frequency Percentage

10 but less than 20 3 .15 15


20 but less than 30 6 .30 30
30 but less than 40 5 .25 25

40 but less than 50 4 .20 20 8


50 but less than 60 2 .10 10 Histogram: Age Of Students
Total 20 1.00 100
6

Frequency
4
(In a percentage
histogram the vertical
axis would be defined to 2
show the percentage of
observations per class)
0
5 15 25 35 45 55 More

Chap 2-38
Visualizing Two Numerical
Variables: Scatter Plot

Volume Cost per


per day day
Cost per Day vs. Production Volume
23 125
250
26 140
200
29 146

Cost per Day


150
33 160
100
38 167
50
42 170
0
50 188
20 30 40 50 60 70
55 195
Volume per Day
60 200

Chap 2-39
Visualizing Two Numerical
Variables: Time Series Plot

Number of
Year Franchises Number of Franchises, 1996-2004
120
1996 43
100
1997 54

Franchises
Number of
80
1998 60 60
1999 73 40

2000 82 20
0
2001 95
1994 1996 1998 2000 2002 2004 2006
2002 107 Year
2003 99
2004 95

Chap 2-40
Examples
% of electricity
Appliances consumption Construct a bar chart and a pie
AC 18 chart.
Dryers 5
Washers 24
Computers 1
Make conclusions.
Cooking 2
Dishes 2
Freezers 2
Lighting 16
Friges 9
Heating 7
Water heating 8
TV etc 6

Chap 1-41
Examples

#2.39, p.58 “Cost of baseball games”.


Dataset BBcost2011 (BBcost2015).
Construct a histogram.

#2.54, p.62 “Stock performance”.


Construct a time series plot. Is there a pattern?

Chap 1-42
DESCRIPTIVE STATISTICS
 The central tendency is the extent to which all the data values
group around a typical or central value.

 The variation is the amount of dispersion or scattering of values

 The shape is the pattern of the distribution of values from the


lowest value to the highest value.

Chap 3-44
Measures of Central Tendency:
The Mean
• The arithmetic mean (often just called the “mean”) is
the most common measure of central tendency

• For a sample of size n:


Pronounced x-bar
The ith value
n

X i
X1  X 2    Xn
X i 1

n n
Sample size Observed values
Chap 3-45
Measures of Central Tendency:
The Mean (continued)

• The most common measure of central tendency


• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Mean = 13 Mean = 14
11  12  13  14  15 65 11  12  13  14  20 70
 13  14
5 5 5 5

Chap 3-46
Measures of Central Tendency:
The Median

• In an ordered array, the median is the “middle”


number (50% above, 50% below)

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Median = 13 Median = 13

• Not affected by extreme values

Chap 3-47
Measures of Central Tendency:
Locating the Median
• The location of the median when the values are in numerical order (smallest to largest):

n 1
Median position  position in the ordered data
2
• If the number of values is odd, the median is the middle number

• If the number of values is even, the median is the average of the two middle numbers

Note that  1 of the median, only the position of


is not then value
2
the median in the ranked data

Chap 3-48
Measures of Central Tendency:
The Mode
• Value that occurs most often
• Not affected by extreme values
• Used for either numerical or categorical (nominal) data
• There may be no mode
• There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

No Mode
Mode = 9
Chap 3-49
Measures of Central Tendency:
Review Example
House Prices:  Mean: ($3,000,000/5)
$2,000,000 = $600,000
$ 500,000  Median: middle value of ranked
$ 300,000
$ 100,000 data
$ 100,000 = $300,000
Sum $ 3,000,000  Mode: most frequent value
= $100,000

Chap 3-50
Measures of Central Tendency:
Which Measure to Choose?
 The mean is generally used, unless extreme values
(outliers) exist.
 The median is often used, since the median is not
sensitive to extreme values. For example, median
home prices may be reported for a region; it is less
sensitive to outliers.
 In some situations it makes sense to report both the
mean and the median.

Chap 3-51
Measures of Central Tendency:
Summary
Central Tendency

Arithmetic Median Mode


Mean
n

X i
X  i1
n Middle value Most
in the ordered frequently
array observed
value

Chap 3-52
Measures of Variation
Variation

Range Variance Standard Coefficient of


Variation
Deviation

 Measures of variation give


information on the spread
or variability or
dispersion of the data
values.
Same center,
different variation
Chap 3-53
Measures of Variation:
The Range
 Simplest measure of variation
 Difference between the largest and the smallest values:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 13 - 1 = 12

Chap 3-55
Measures of Variation:
Why The Range Can Be Misleading
 Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

Chap 3-56
Measures of Variation:
The Sample Variance

Low variation: more points close to the mean


High variation: more points far from the mean

So, measure the distance to the mean


Chap 1-57
Measures of Variation:
The Sample Variance

• Average (approximately) of squared deviations of values from the


mean

 (X
• Sample variance: 2
i  X)
2
S  i1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Chap 3-58
Measures of Variation:
The Sample Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the variance
• Has the same units as the original data

n
• Sample standard deviation:  (X i  X) 2

S i1
n -1

Chap 3-59
Measures of Variation:
Comparing Standard Deviations
Smaller standard deviation

Larger standard deviation

Chap 3-60
Locating Extreme Outliers:
Z-Score
X X
Z
S

For a bell-shape (Normal) distribution,


|Z| < 1 for 68% of data
|Z| < 2 for 95% of data
|Z| < 3 for 99.7% of data

So, values X with large |Z| can be outliers.


Locating Extreme Outliers:
Z-Score
 Suppose the mean math SAT score is 490, with a
standard deviation of 100.
 Compute the Z-score for a test score of 620.

X  X 620  490 130


Z   1.3
S 100 100

A score of 620 is 1.3 standard deviations above the


mean and would not be considered an outlier.

Chap 3-62
General Descriptive Stats Using
Microsoft Excel Functions

House Prices Descriptive Statistics


$ 2,000,000 Mean $ 600,000 =AVERAGE(A2:A6)
$ 500,000 Standard Error $ 357,770.88 =D6/SQRT(D14)
$ 300,000 Median $ 300,000 =MEDIAN(A2:A6)
$ 100,000 Mode $ 100,000.00 =MODE(A2:A6)
$ 100,000 Standard Deviation $ 800,000 =STDEV(A2:A6)
Sample Variance 640,000,000,000 =VAR(A2:A6)
Kurtosis 4.1301 =KURT(A2:A6)
Skewness 2.0068 =SKEW(A2:A6)
Range $ 1,900,000 =D12 - D11
Minimum $ 100,000 =MIN(A2:A6)
Maximum $ 2,000,000 =MAX(A2:A6)
Sum $ 3,000,000 =SUM(A2:A6)
Count 5 =COUNT(A2:A6)

Chap 3-63
General Descriptive Stats Using
Microsoft Excel Data Analysis Tool

1. Select Data.
2. Select Data Analysis.
3. Select Descriptive
Statistics and click OK.

Chap 3-64
General Descriptive Stats Using
Microsoft Excel

4. Enter the cell


range.
5. Check the
Summary
Statistics box.
6. Click OK

Chap 3-65
Excel output
Microsoft Excel House Prices
descriptive statistics output,
Mean 600000
using the house price data: Standard Error 357770.8764
House Prices: Median 300000
Mode 100000
$2,000,000 Standard Deviation 800000
500,000 Sample Variance 640,000,000,000
300,000 Kurtosis 4.1301
100,000 Skewness 2.0068
100,000 Range 1900000
Minimum 100000
Maximum 2000000
Sum 3000000
Count 5

Chap 3-66
Numerical Descriptive Measures for
a Population
 Descriptive statistics discussed previously described a sample, not the population.

 Summary measures describing a population, called parameters, are denoted with


Greek letters.

 Important population parameters are the population mean, variance, and standard
deviation.

Chap 3-67
Numerical Descriptive Measures
for a Population: The mean µ

• The population mean is the sum of the values in the


population divided by the population size, N

X i
X1  X 2    XN
 i 1

N N
Where μ = population mean
N = population size
Xi = ith value of the variable X
Chap 3-68
Numerical Descriptive Measures For A
Population: The Variance σ2
• Average of squared deviations of values from the mean

• Population variance: N

 (X i  μ) 2

σ 2  i1
N

Where μ = population mean


N = population size
Xi = ith value of the variable X
Chap 3-69
Numerical Descriptive Measures For
A Population: The Standard
Deviation σ
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the population variance
• Has the same units as the original data

N
• Population standard deviation:
 i
(X  μ) 2

σ i1
N

Chap 3-70
Sample statistics versus population
parameters

Measure Population Sample


Parameter Statistic
Mean
 X
Variance
2 S2
Standard
Deviation  S

Chap 3-71
Quartiles
• Quartiles split the ranked data into 4 segments with an
equal number of values per segment

25% 25% 25% 25%

Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
 Q2 is the same as the median (50% of the observations
are smaller and 50% are larger)
 Only 25% of the observations are greater than the third
quartile

Chap 3-72
Quartile Measures:
Locating Quartiles

Find a quartile by determining the value in the


appropriate position in the ranked data, where

First quartile position: Q1 = (n+1)/4 ranked value

Second quartile position: Q2 = (n+1)/2 ranked value

Third quartile position: Q3 = 3(n+1)/4 ranked value

where n is the number of observed values

Chap 3-73
Quartile Measures:
Calculation Rules
• When calculating the ranked position use the following rules
• If the result is a whole number then it is the ranked position to use

• If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then average the two
corresponding data values.

• If the result is not a whole number or a fractional half then round the result to
the nearest integer to find the ranked position.

Chap 3-74
Quartile Measures
Calculating The Quartiles: Example

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so Q1 = (12+13)/2 = 12.5

Q2 is in the (9+1)/2 = 5th position of the ranked data,


so Q2 = median = 16

Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,


so Q3 = (18+21)/2 = 19.5
Q1 and Q3 are measures of non-central location
Q2 = median, is a measure of central tendency
Chap 3-75
Interquartile Range
(IQR)
• The IQR is Q3 – Q1 and measures the spread in the middle
50% of the data

• The IQR is also called the midspread because it covers the


middle 50% of the data

• The IQR is a measure of variability that is not influenced by


outliers or extreme values

• Measures like Q1, Q3, and IQR that are not influenced by
outliers are called resistant measures

Chap 3-76
Calculating The Interquartile
Range

Example:
X Median X
minimum Q1 (Q2) Q3 maximum
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27

Chap 3-77
The Five-Number Summary
The five numbers that help describe the center, spread
and shape of data are:
X
smallest
 First Quartile (Q )
1
 Median (Q )
2
 Third Quartile (Q )
3
X
largest

Chap 3-78
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data based
on the five-number summary:
Xsmallest -- Q1 -- Median -- Q3 -- Xlargest
Example:

25% of data 25% 25% 25% of data


of data of data

Xsmallest Q1 Median Q3 Xlargest

Chap 3-79
Five Number Summary:
Shape of Boxplots

• If data are symmetric around the median then the box and
central line are centered between the endpoints

Xsmallest Q1 Median Q3 Xlargest

• A Boxplot can be shown in either a vertical or horizontal


orientation

Chap 3-80
Distribution Shape and
The Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3

Chap 3-81
Measures Of The Relationship
Between Two Numerical Variables

• Scatter plots allow you to visually examine the relationship between


two numerical variables and now we will discuss two quantitative
measures of such relationships.

• The Covariance
• The Coefficient of Correlation

Chap 3-82
The Covariance
• The covariance measures the strength of the linear
relationship between two numerical variables (X & Y)

• The sample covariance:


n

 ( X  X)( Y  Y )
i i
cov ( X , Y )  i1
n 1
• Only concerned with the strength of the relationship
• No causal effect is implied
Chap 3-83
Interpreting Covariance
• Covariance between two variables:
cov(X,Y) > 0 X and Y tend to move in the same direction
cov(X,Y) < 0 X and Y tend to move in opposite directions
cov(X,Y) = 0 X and Y are independent

• The covariance has a major flaw:


• It is not possible to determine the relative strength of the
relationship from the size of the covariance

Chap 3-84
Coefficient of Correlation
• Measures the relative strength of the linear
relationship between two numerical variables
• Sample coefficient of correlation:

cov (X , Y)
r
SX SY

where
n n n
 (X  X)(Y  Y)
i i  (X  X)
i
2
 (Y  Y )
i
2

cov (X , Y)  i1
SX  i1
SY  i1
n 1 n 1 n 1
Chap 3-85
Features of the
Coefficient of Correlation
• The population coefficient of correlation is referred as ρ.
• The sample coefficient of correlation is referred to as r.
• Either ρ or r have the following features:
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear relationship
• The closer to 1, the stronger the positive linear relationship
• The closer to 0, the weaker the linear relationship

Chap 3-86
Scatter Plots of Sample Data
with Various Coefficients of
Correlation
Y Y

X X
r = -1 r = -.6
Y
Y Y

X X X
r = +1 r = +.3 r=0
Chap 3-87
The Coefficient of Correlation Using
Microsoft Excel Function

Test #1 Score Test #2 Score Correlation Coefficient


78 82 0.7332 =CORREL(A2:A11,B2:B11)
92 88
86 91
83 90
95 92
85 85
91 89
76 81
88 96
79 77

Chap 3-88
The Coefficient of Correlation Using
Microsoft Excel Data Analysis Tool
1. Select Data
2. Choose Data Analysis
3. Choose Correlation &
Click OK

Chap 3-89
The Coefficient of Correlation
Using Microsoft Excel

4. Input data range and select


appropriate options
5. Click OK to get output
Excel
Excel function: =correl

Or use the Data Analysis Tool, “covariance” and “correlation”.


Probability Theory
In this chapter, you learn:

• Basic probability concepts


• Calculating probabilities of events
• Joint events
• Conditional probability
• Decision trees
Basic Probability
Concepts
• Probability – the chance that an uncertain event will
occur (always between 0 and 1)

• Impossible Event – an event that has no chance of


occurring (probability = 0)

• Certain Event – an event that is sure to occur


(probability = 1)
Example
Find the probability of selecting a male taking statistics
from the population described in the following table:

Taking Stats Not Taking Total


Stats
Male 84 145 229
Female 76 134 210
Total 160 279 439

number of males taking stats 84


Probability of male taking stats   0.191
total number of people 439
Events
Each possible outcome of a variable is an event.

• Simple event
• An event described by a single characteristic
• e.g., A day in January from all days in 2019
• Joint event
• An event described by two or more characteristics
• e.g. A day in January that is also a Wednesday from all days in 2019
• Complement of an event A (denoted A’)
• All events that are not part of event A
• e.g., All days from 2019 that are not in January

Chap 4-96
Sample Space
The Sample Space is the collection of all
possible events (outcomes)
e.g. All 6 faces of a die:

e.g. All 52 cards of a bridge deck:


Organizing & Visualizing Events
• Contingency Tables -- For All Days in 2019
Jan. Not Jan. Total

Wed. 5 47
Not Wed. 5227 286 313

Total 32 333 365

Total
• Decision Trees Wed. 5 Number
Sample Of
J an .
Space Sample
All Days Not Wed. 27
Space
In 2019 Wed. 47
Outcomes
N o t Ja
n.
Not W 286
ed .
Definition: Simple (Marginal)
Probability
• Simple Probability refers to the probability of a
simple event.
• ex. P(Jan.)
• ex. P(Wed.)

Jan. Not Jan. Total P(Wed.) = 52 / 365


Wed. 5 47 52
Not Wed. 27 286 313

Total 32 333 365

P(Jan.) = 32 / 365
Definition: Joint Probability
• Joint Probability refers to the probability of an
occurrence of two or more events (joint event).
• ex. P(Jan. and Wed.)
• ex. P(Not Jan. and Not Wed.)

Jan. Not Jan. Total


P(Not Jan. and Not Wed.)
Wed. 5 47 52
= 286 / 365
Not Wed. 27 286 313

Total 32 333 365

P(Jan. and Wed.) = 5 / 365


Mutually Exclusive Events
• Mutually exclusive events
• Events that cannot occur simultaneously

Example: Randomly choosing a day from 2019

A = day in January; B = day in February

• Events A and B are mutually exclusive

Chap 4-101
Collectively Exhaustive Events
• Collectively exhaustive events
• One of the events must occur
• The set of events covers the entire sample space

Example: Randomly choose a day from 2019

A = Weekday; B = Weekend;
C = January; D = Spring;

• Events A, B, C and D are collectively exhaustive (but not


mutually exclusive – a weekday can be in January or in
Spring)
• Events A and B are collectively exhaustive and also
mutually exclusive
Computing Joint and
Marginal Probabilities

• The probability of a joint event, A and B:


number of outcomes satisfying A and B
P( A and B) 
total number of elementary outcomes

• Computing a marginal (or simple) probability:

P(A) P(A and B1 )  P(A and B 2 )    P(A and Bk )


• Where B1, B2, …, Bk are k mutually exclusive and collectively
exhaustive events
Joint Probability
Example
P(Jan. and Wed.)
number of days that are in Jan. and are Wed. 5
 
total number of days in 2013 365

Jan. Not Jan. Total

Wed. 5 47 52
Not Wed. 27 286 313

Total 32 333 365


Marginal Probability
Example
P(Wed.)
4 48 52
P (Jan. and Wed.)  P(Not Jan. and Wed.)   
365 365 365

Jan. Not Jan. Total

Wed. 4 48 52
Not Wed. 27 286 313

Total 31 334 365


Marginal & Joint Probabilities In
A Contingency Table

Event
Event B1 B2 Total
A1 P(A1 and B1) P(A1 and B2) P(A1)

A2 P(A2 and B1) P(A2 and B2) P(A2)

Total P(B1) P(B2) 1

Joint Probabilities Marginal (Simple) Probabilities


Probability Summary
So Far
• Probability is the numerical measure of
the likelihood that an event will occur 1 Certain

• The probability of any event must be


between 0 and 1, inclusively

0 ≤ P(A) ≤ 1 For any event A 0.5


• The sum of the probabilities of all
mutually exclusive and collectively
exhaustive events is 1

P(A)  P(B)  P(C) 1


0 Impossible
If A, B, and C are mutually exclusive and
collectively exhaustive
General Addition Rule

General Addition Rule:


P(A or B) = P(A) + P(B) - P(A and B)

If A and B are mutually exclusive, then


P(A and B) = 0, so the rule can be simplified:

P(A or B) = P(A) + P(B)


For mutually exclusive events A and B
General Addition Rule
Example
P(Jan. or Wed.) = P(Jan.) + P(Wed.) - P(Jan. and Wed.)
= 32/365 + 52/365 - 5/365 = 79/365
Don’t count
the five
Wednesdays
in January
Jan. Not Jan. Total twice!
Wed. 5 47 52
Not Wed. 27 286 313

Total 32 333 365


Computing Conditional
Probabilities
• A conditional probability is the probability of one
event, given that another event has occurred:
P(A and B) The conditional
P(A | B)  probability of A given
P(B) that B has occurred

P(A and B) The conditional


P(B | A)  probability of B given
P(A) that A has occurred

Where P(A and B) = joint probability of A and B


P(A) = marginal or simple probability of A
P(B) = marginal or simple probability of B
Conditional Probability
Example
 Of the cars on a used car lot, 70% have air
conditioning (AC) and 40% have a GPS. 20%
of the cars have both.

• What is the probability that a car has a GPS, given


that it has AC ?

i.e., we want to find P(GPS | AC)


Conditional Probability
Example (continued)
 Of the cars on a used car lot, 70% have air conditioning
(AC) and 40% have a GPS and
20% of the cars have both.
GPS No GPS Total
AC 0.2 0.5 0.7
No AC 0.2 0.1 0.3
Total 0.4 0.6 1.0

P(GPS and AC) 0.2


P(GPS | AC)   0.2857
P(AC) 0.7
Conditional Probability
Example (continued)
 Given AC, we only consider the top row (70% of the cars). Of these,
20% have a GPS. 20% of 70% is about 28.57%.

GPS No GPS Total


AC 0.2 0.5 0.7
No AC 0.2 0.1 0.3
Total 0.4 0.6 1.0

P(GPS and AC) 0.2


P(GPS | AC)   0.2857
P(AC) 0.7
Using Decision Trees
.2
Given AC or G PS .7 P(AC and GPS) = 0.2
H a s
no AC: 0 .7
C )= D oe
P( A s
have not P(AC and GPS’) = 0.5
C G PS .5
A
H as .7
All Conditional
Probabilities
Cars
Do .2
e
hav s not .3
eA
C P(A G P S P(AC’ and GPS) = 0.2
C ’) Has
=0
.3
D oe
s
have not
G PS .1 P(AC’ and GPS’) = 0.1
.3
Using Decision Trees
(continued)
.2
C
.4 P(GPS and AC) = 0.2
Given GPS H as A
or no GPS: = 0 .4
S )
( G P D oe
P s
have not .2 P(GPS and AC’) = 0.2
P S AC
sG
Ha .4
All Conditional
Probabilities
Cars
Do .5
e
hav s not
eG P(G .6 P(GPS’ and AC) = 0.5
PS
PS a s AC
’) = H
0.6
D oe
s
have not .1 P(GPS’ and AC’) = 0.1
AC
.6
Independence
• Two events are independent if and only if:

P(A | B) P(A)
• Events A and B are independent when the probability of
one event is not affected by the fact that the other event
has occurred
Multiplication Rules

• Multiplication rule for two events A and B:

P(A and B) P(A | B)P(B)

Note: If A and B are independent, then P(A | B) P(A)


and the multiplication rule simplifies to

P(A and B) P(A)P(B)


Bayes Theorem
Bayes’ Theorem
• Bayes’ Theorem is used to revise previously
calculated probabilities based on new information.

• Developed by Thomas Bayes in the 18th Century.

• It is an extension of conditional probability.


Example
There exists a test for a certain viral infection. It is
95% reliable for infected patients and 99% reliable for
the healthy ones. That is, if a patient has the virus
(event V), the test shows that (event S) with
probability P{S | V} = 0.95, and if the patient does not
have the virus, the test shows that with probability
P{not S | not V} = 0.99. Suppose that 4% of all the
patients are infected with the virus, P{V} = 0.04. When
the test shows positive result, find the (conditional)
probability that the patient is infected.
Bayes’ Theorem

P(A | B)P(B)
P(B | A) 
P(A)

How to find P(A)?


Marginal Probability
(The Law of Total Probability)

• Marginal probability for event A:

P(A) P(A | B1 ) P(B1 )  P(A | B 2 ) P(B 2 )    P(A | Bk ) P(Bk )

• Where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events
Bayes’ Theorem

P(A | B i )P(B i )
P(B i | A) 
P(A | B 1 )P(B 1 )  P(A | B 2 )P(B 2 )   P(A | B k )P(B k )

• where:
Bi = ith event of k mutually exclusive and collectively
exhaustive events
A = new event that might impact P(Bi)
Bayes’ Theorem Example
• A drilling company has estimated a 40% chance of
striking oil for their new well.
• A detailed test has been scheduled for more
information. Historically, 60% of successful wells
have had detailed tests, and 20% of unsuccessful
wells have had detailed tests.
• Given that this well has been scheduled for a
detailed test, what is the probability
that the well will be successful?
Bayes’ Theorem Example
(continued)

• Let S = successful well


U = unsuccessful well
• P(S) = 0.4 , P(U) = 0.6 (prior probabilities)
• Define the detailed test event as D
• Conditional probabilities:
P(D|S) = 0.6 P(D|U) = 0.2
• Goal is to find P(S|D)
Bayes’ Theorem Example
(continued)

Apply Bayes’ Theorem:


P(D | S)P(S)
P(S | D) 
P(D | S)P(S)  P(D | U)P(U)
(0.6)(0.4)

(0.6)(0.4)  (0.2)(0.6)
0.24
 0.667
0.24  0.12

So the revised probability of success, given that this well has


been scheduled for a detailed test, is 0.667
Bayes’ Theorem Example
(continued)

• Given the detailed test, the revised probability of a


successful well has risen to 0.667 from the original
estimate of 0.4

Prior Conditional Joint Revised


Event Prob.
Prob. Prob. Prob.
S (successful) 0.4 0.6 (0.4)(0.6) = 0.24 0.24/0.36 = 0.667
U (unsuccessful) 0.6 0.2 (0.6)(0.2) = 0.12 0.12/0.36 = 0.333

Sum = 0.36
Exercises
• All athletes at the Olympic games are tested for performance-
enhancing steroid drug use. The imperfect test gives positive results
(indicating drug use) for 90% of all steroid-users but also (and
incorrectly) for 2% of those who do not use steroids. Suppose that 5%
of all registered athletes use steroids. If an athlete is tested positive,
what is the probability that he/she uses steroids?
Chapter Summary
In this chapter we discussed:

• Basic probability concepts


• Sample spaces and events, contingency tables, simple probability, and
joint probability
• Basic probability rules
• General addition rule, addition rule for mutually exclusive events, rule
for collectively exhaustive events
• Conditional probability
• Statistical independence, marginal probability, decision trees, and the
multiplication rule
• Bayes Formula and the Law of Total Probability
Confidence Interval Estimation
• The role of inferential statistics is to use sample evidence to identify
population measures.
• One approach is to calculate the most likely value for the population
parameter based on the sample statistic.
• This is known as an estimation approach.
• This section will consider two estimation methods for the true population
parameter. They are point estimation, and confidence interval estimation.
• The most common population parameters to estimate are the central
location measures of the single population mean, µ, and single population
proportion, π from the sample mean, x, and the sample proportion, p
Point Estimation
• A point estimate is made when the value of a sample statistic is taken to be the true
value of the population parameter.
•_
•_
• Thus a sample mean, x, is the used as a point estimate of its population mean, µ,
while a sample proportion, p, is used to represent the true value of its population
proportion, π.
• Here are two examples:
• A supermarket survey of a random sample of 75 shoppers found that their average
shopping time was 28.4 minutes (x = 28.4), so a point estimate of the actual average
shopping time of all supermarket shoppers is expected to be 28.4 minutes (µ = 28.4).
Suppose 55 out of 350 (15.7%) randomly surveyed coffee drinkers prefer
decaffeinated coffee. Then a point estimate of the actual proportion (%) of all coffee
drinkers who prefer decaffeinated coffee is assumed to be 0.157 (π = 0.157 or 15.7%).
A sample point estimate is a highly unreliable measure of a population parameter as the probability that it will
exactly equal the true value is extremely small (almost zero).
Also, there is no indication of how near or how far a single sample statistic lies from its population measure (i.e. no
indication of sampling error).
For these reasons, a point estimate is seldom used to estimate a population parameter.
It is better to offer a range of values within which the population parameter is expected to fall so that the
reliability of the estimate can be measured.
This is the purpose of interval estimation.
Confidence Interval Estimation
• An interval estimate is a range of values defined around a sample
statistic.
• The population parameter is expected to lie within this interval with a
specified level of confidence (or probability).
• It is therefore called a confidence interval.
• Confidence intervals will be constructed for the single population
mean, µ, and the single population proportion, π, using their
respective sample statistics, x and p.
CI for Single population mean (µ)
• The population parameter to be estimated is µ.
• The appropriate sample statistic to estimate µ is the sample mean, x.
• Typical questions that imply the use of confidence intervals are as
follows:
• Construct 95% confidence limits for the average years of experience of
all registered financial advisors.
• Estimate, with 99% confidence, the actual average mass of all frozen
chickens supplied by Irvines.
• Find the 90% confidence interval estimate for the average duration of
all telesales calls.
PicknPay Survey
• A survey of a random sample of 300 grocery shoppers in PicknPay
found that the mean value of their grocery purchases was USD78.
Assume that the population standard deviation of grocery purchase
values is USD21.
• Find the 95% confidence limits for the average value of a grocery
purchase by all grocery shoppers in PicknPay
Solution
• Sample size=300
• Sample mean=78
• In addition, the level of confidence and a measure of the sampling error is required.
• The sampling error is represented by the standard error of the sample means:
SE=σ/√n=21/ √300=1.2124
• It requires that the population standard deviation, σ, is known.
• The 95% confidence level refers to z-limits that bound a symmetrical area of 95%
around the mean (centre) of the standard normal distribution.
• From the z-table (Appendix 1), a 95% confidence level corresponds to z-limits of
±1.96. Thus an area of 95% is found under the standard normal distribution
between z = −1.96 and z = +1.96.
• These z-limits represent the 95% confidence interval in z terms. To
express the confidence limits in the same unit of measure of the
random variable under study (i.e. value of grocery purchases), the z-
limits must be transformed into x-limits using the following z
transformation formula:

z = x – µ /__σ_
√n
Thus µ = x –/+ z __σ_
√n
=µ = 78 ± 1.96(1.2124)
= 78 ± 2.376
This gives a lower and an upper confidence limit about µ:
lower 95% confidence limit of 78 – 2.376 = 75.624 ($75.62)
upper 95% confidence limit of 78 + 2.376 = 80.376 ($80.38).
Management Interpretation

• There is a 95% chance that the true mean value of all grocery
purchases by grocery shoppers in PicknPay lies between $75.62 and
$80.38.
The Precision of a Confidence
Interval
• The width of a confidence interval is a measure of its precision. The
narrower the confidence interval, the more precise is the interval
estimate, and vice versa.
• The width of a confidence interval is influenced by:
• the specified confidence level
• the sample size
• the population standard deviation.
Hwange Coal Mine
• A human resources director at the Hwange Coal mine wishes to
estimate the true mean employment period of all coalminers. From a
random sample of 144 coalminers’ records, the sample mean
employment period was found to be 88.4 months. The population
standard deviation is assumed to be 21.5 months and normally
distributed. Find the 95 % confidence interval estimate for the actual
mean employment period (in months) for all miners employed in coal
mines.
Solution
• Given x = 88.4 months, σ = 21.5 months and n = 144 miners. Find the standard
error of the sample mean.

• SE= 1.792 months


• From the z-table, the 95% confidence level equates to z-limits of ±1.96.
• Thus the lower limit is 88.4 – 1.96(1.792) = 88.4 – 3.51
• = 84.89 months
• and the upper limit is 88.4 + 1.96(1.792) = 88.4 + 3.51
• = 91.91 months.
• Thus the 95% confidence interval is defined as 84.89 ≤ µ ≤ 91.91 months.
Management Interpretation
• There is a 95% chance that the average employment period of all
coalminers lies between 84.89 and 91.91 months.
Practice Question
Dunlop Zimbabwe
• A tyre manufacturer Dunlop Zimbabwe found that the sample mean
tread life of 81 radial tyres tested was 52 345 km. The population
standard deviation of radial tyre tread life is 6 144 km and is assumed
to be normally distributed.
a) Estimate, with 99% confidence, the true mean tread life of all radial
tyres manufactured.
b) Interpret the results
CI for Population Proportion (π)
•p π
• If the data of a random variable is nominal/ordinal scaled, the appropriate
measure of central location is a proportion. In the same way that the population
mean can be estimated from a sample mean, a population proportion π can be
estimated from its sample proportion, p.
• The following statistical measures are required to construct a confidence interval
estimate about the true population proportion, π:
• x_n
• the sample proportion, p, where p = the sample size, n
• the z-limits corresponding to a specified level of confidence
• the standard error of the sample proportion, p, which is calculated using:
Example
• A recent survey amongst 240 randomly selected street vendors in
Harare showed that 84 of them felt that city of Harare by-laws still
hampered their trading.
• Find the 90% confidence interval for the true proportion, π, of all
Harare street vendors who believe that local by-laws still hamper their
trading.
Solution
• From the data, x = 84 (number of ‘success’ outcomes) and n = 240
(sample size), so the sample proportion, p =
• 84/240
• = 0.35.
Thus:
• lower limit: 0.35 – 1.645(0.0308) = 0.35 – 0.0507 = 0.299
• upper limit: 0.35 + 1.645(0.0308) = 0.35 + 0.0507 = 0.401
• The 90% confidence interval estimate for π is given by 0.299 ≤ π ≤
0.401.
Hypothesis Testing
Hypothesis refers to an intelligent claim
that can be statistically tested: “Those
who do not attend Statistics lectures fail
the exam”
Hypothesis and Hypothesis Testing

HYPOTHESIS A statement about the value of a population parameter developed for the purpose of testing.

HYPOTHESIS TESTING A procedure based on sample evidence and probability theory to determine whether the
hypothesis is a reasonable statement.

TEST STATISTIC A value, determined from sample information, used to determine whether to reject the null hypothesis.

CRITICAL VALUE The dividing point between the region where the null hypothesis is rejected and the region where it is
not rejected.
Important Things to Remember about H0 and H1

• H0: null hypothesis and H1: alternate hypothesis


Inequality
• H0 and H1 are mutually exclusive and collectively exhaustive Keywords Part of:
Symbol
• H0 is always presumed to be true
Larger (or more) than > H1
• H1 has the burden of proof
• A random sample (n) is used to “reject H0” Smaller (or less) < H1

• If we conclude 'do not reject H 0', this does not necessarily mean that the null No more than  H0
hypothesis is true, it only suggests that there is not sufficient evidence to reject H 0;
rejecting the null hypothesis then, suggests that the alternative hypothesis may be At least ≥ H0
true.
Has increased > H1
• Equality is always part of H0 (e.g. “=” , “≥” , “≤”).
• “≠” “<” and “>” always part of H1 Is there difference? ≠ H1

• In actual practice, the status quo is set up as H 0 Has not changed = H0


• If the claim is “boastful” the claim is set up as H 1 (we apply the Missouri rule – Has “improved”, “is better See left text H1
“show me”). Remember, H1 has the burden of proof than”. “is more effective”
• In problem solving, look for key words and convert them into symbols. Some key
words include: “improved, better than, as effective as, different from, has changed,
etc.”
Hypothesis Setups for Testing a Mean () or a
Proportion ()

MEAN

PROPORTION
Testing for a Population Mean with a Known Population
Standard Deviation- Example

EXAMPLE Step 4: Formulate the decision rule.


Reject H0 if |Z| > Z/2
Jamestown Steel Company manufactures and assembles desks and other
office equipment . The weekly production of the Model A325 desk at the Z  Z / 2
Fredonia Plant follows the normal probability distribution with a mean of
200 and a standard deviation of 16. Recently, new production methods X  
have been introduced and new employees hired. The VP of  Z / 2
manufacturing would like to investigate whether there has been a / n
change in the weekly production of the Model A325 desk. 203.5  200
 Z .01/ 2
16 / 50
1.55 is not  2.58
Step 1: State the null hypothesis and the alternate
hypothesis.
H0:  = 200
H1:  ≠ 200
(note: keyword in the problem “has changed”)

Step 2: Select the level of significance. Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not rejected.
α = 0.01 as stated in the problem
We conclude that the population mean is not different from 200. So we
would report to the vice president of manufacturing that the sample
Step 3: Select the test statistic. evidence does not show that the production rate at the plant has
Use Z-distribution since σ is
known changed from 200 per week.
Testing for a Population Mean with a Known Population
Standard Deviation- Another Example

Suppose in the previous problem the vice president wants to know whether
there has been an increase in the number of units assembled. To put it Step 4: Formulate the decision rule.
another way, can we conclude, because of the improved production Reject H0 if Z > Z
methods, that the mean number of desks assembled in the last 50 weeks
was more than 200?
Recall: σ=16, n=200, α=.01 Step 5: Make a decision and interpret the result.
Because 1.55 does not fall in the rejection region, H0 is not
rejected. We conclude that the average number of desks
Step 1: State the null hypothesis and the alternate hypothesis. assembled in the last 50 weeks is not more than 200

H0:  ≤ 200
H1:  > 200
(note: keyword in the problem “an increase”)

Step 2: Select the level of significance.


α = 0.01 as stated in the problem

Step 3: Select the test statistic.


Use Z-distribution since σ is known
Type of Errors and p-value in Hypothesis Testing

• Type I Error -
• Defined as the probability of rejecting the null EAMPLE p-Value
hypothesis when it is actually true. Recall the last problem where the hypothesis and decision rules were
• This is denoted by the Greek letter “” set up as:
• Also known as the significance level of a test H0:  ≤ 200
H1:  > 200
Reject H0 if Z > Z
• Type II Error:
where Z = 1.55 and Z =2.33
• Defined as the probability of “accepting” the
null hypothesis when it is actually false.
• This is denoted by the Greek letter “β” Reject H0 if p-value < 
0.0606 is not < 0.01
• p-VALUE is the probability of observing a sample value
as extreme as, or more extreme than, the value
observed, given that the null hypothesis is true.

• In testing a hypothesis, we can also compare the p-


value to the significance level ().

• Decision rule using the p-value:

Reject H0 if p-value < significance level


Conclude: Fail to reject H0
Testing for the Population Mean:
Population Standard Deviation Unknown

When the population standard deviation (σ) is unknown, EXAMPLE


the sample standard deviation (s) is used in its place The McFarland Insurance Company Claims Department
the t-distribution is used as test statistic, which is reports the mean cost to process a claim is $60. An
computed using the formula: industry comparison showed this amount to be
larger than most other insurance companies, so
the company instituted cost-cutting measures. To
evaluate the effect of the cost-cutting measures,
the Supervisor of the Claims Department selected
a random sample of 26 claims processed last
month. The sample information is reported below.
At the .01 significance level is it reasonable a claim is
now less than $60?
Testing for the Population Mean: Population Standard
Deviation Unknown - Example

Step 1: State the null hypothesis and the alternate hypothesis.


H0:  ≥ $60
H1:  < $60

Step 2: Select the level of significance.


α = 0.01 as stated in the problem

Step 3: Select the test statistic.


Use t-distribution since σ is unknown

Step 4: Formulate the decision rule.


Reject H0 if t < -t,n-1

Step 5: Make a decision and interpret the result.

Because -1.818 does not fall in the rejection region, H0 is not rejected at the .01
significance level. We have not demonstrated that the cost-cutting measures
reduced the mean cost per claim to less than $60. The difference of $3.58 ($56.42 -
$60) between the sample mean and the population mean could be due to sampling
error.
Tests Concerning Proportion using
the z-Distribution
• A Proportion is the fraction or percentage that indicates the part of the population or sample having a particular trait of interest.
• The sample proportion is denoted by p and is found by x/n
• It is assumed that the binomial assumptions discussed in Chapter 6 are met:
(1) the sample data collected are the result of counts;
(2) the outcome of an experiment is classified into one of two mutually exclusive categories—a “success” or a “failure”;
(3) the probability of a success is the same for each trial; and (4) the trials are independent
• Both n and n(1-  ) are at least 5.
• When the above conditions are met, the normal distribution can be used as an approximation to the binomial distribution
• The test statistic is computed as follows:
Test Statistic for Testing a Single Population
Proportion - Example
EXAMPLE Step 4: Formulate the decision rule.
Suppose prior elections in a certain state indicated it is Reject H0 if Z < -Z
necessary for a candidate for governor to receive at least
80 percent of the vote in the northern section of the state
to be elected. The incumbent governor is interested in
assessing his chances of returning to office and plans to
conduct a survey of 2,000 registered voters in the northern
section of the state. Using the hypothesis-testing
procedure, assess the governor’s chances of reelection.

Step 1: State the null hypothesis and the alternate hypothesis.

H0:  ≥ .80
H1:  < .80
(note: keyword in the problem “at least”)

Step 2: Select the level of significance.


Step 5: Make a decision and interpret the result.
α = 0.01 as stated in the problem The computed value of z (-2.80) is in the rejection
region, so the null hypothesis is rejected at the .05
level. The evidence at this point does not support
Step 3: Select the test statistic. the claim that the incumbent governor will return
Use Z-distribution since the assumptions are to the governor’s mansion for another four years.
met and n and n(1-) ≥ 5
Two-Sample Tests of
Hypothesis
Comparing two populations – Some Examples

1. Is there a difference in the mean value of residential real estate


sold by male agents and female agents in south Florida?
2. Is there a difference in the mean number of defects produced on
the day and the afternoon shifts at Kimble Products?
3. Is there a difference in the mean number of days absent
between young workers (under 21 years of age) and older
workers (more than 60 years of age) in the fast-food industry?
4. Is there is a difference in the proportion of Ohio State University
graduates and University of Cincinnati graduates who pass the
state Certified Public Accountant Examination on their first
attempt?
5. Is there an increase in the production rate if music is piped into
the production area?
Comparing Two Population Means
Use if sample sizes  30 Use if sample sizes  30
• No assumptions about the shape of the populations are required. or if  1 and  2 are known and if  1 and  2 are unknown

• The samples are from independent populations.


X1  X 2 X1  X 2
z z
• The formula for computing the test statistic (z) is:  2
 2
s12 s22

1
 2
n1 n2 n1 n2

EXAMPLE
The U-Scan facility was recently installed at the Byrne Road Food-Town location. The store manager would like to know if the
mean checkout time using the standard checkout method is longer than using the U-Scan. She gathered the following sample
information. The time is measured from when the customer enters the line until their bags are in the cart. Hence the time
includes both waiting in line and checking out.

Step 1: State the null and alternate hypotheses.


(keyword: “longer than”)
H0: µS ≤ µU
H1: µS > µU

Step 2: Select the level of significance.


The .01 significance level is stated in the problem.
Example 1 continued

Step 3: Determine the appropriate test statistic.


Because both population standard deviations are known, we can use z-distribution as the test statistic

Step 4: Formulate a decision rule.

Reject H0 if Z > Z
Z > 2.33
Step 5: Compute the value of z and make a decision

Xs  Xu
z
 s2  u2

ns nu
5 .5  5 .3

0.40 2 0.30 2

50 100
The computed value of 3.13 is
0.2 critical value of 2.33. Our decision is to reject the null hypothesis. The difference of .20 minutes between the mean checkout time using the
larger than the3.13
standard method is too large to have occurred by chance. We conclude the U-Scan method is faster.
0.064
Two-Sample Tests about Proportions
We investigate whether two samples came from EXAMPLE
populations with an equal proportion of successes. The Manelli Perfume Company recently developed a new fragrance that it
two samples are pooled using the following formula.
plans to market under the name Heavenly. A number of market
studies indicate that Heavenly has very good market potential. The
Sales Department at Manelli is particularly interested in whether there
The value of the test statistic is computed from the
following formula. is a difference in the proportions of younger and older women who
would purchase Heavenly if it were marketed. Samples are collected
from each of these independent groups. Each sampled woman was
asked to smell Heavenly and indicate whether she likes the fragrance
well enough to purchase a bottle.

Step 1: State the null and alternate hypotheses.


(keyword: “there is a difference”)
H0: 1 =  2
H1:  1 ≠  2

Step 2: Select the level of significance.


The .05 significance level is stated in the problem.

Step 3: Determine the appropriate test statistic.


We will use the z-distribution
Two Sample Tests of Proportions -
Example
Step 4: Formulate the decision rule.

Reject H0 if Z > Z/2 or Z < - Z/2

Z > Z.05/2 or Z < - Z.05/2


Z > 1.96 or Z < -1.96

Let p1 = young women p2 = older women


5: Select a sample and make a decision
• The computed value of -2.21 is in the area of rejection. Therefore, the null hypothesis is rejected at the .05 significance level. To put it another way, we reject the null hypothesis that the proportion of young women who would purchase Heavenly is equal to the proportion of older women who would purchase Heavenly.
Comparing Population Means with Unknown Population Standard
Deviations (the Pooled t-test)

The t distribution is used as the test statistic if one or more EXAMPLE


of the samples have less than 30 observations. The Owens Lawn Care, Inc., manufactures and assembles
required assumptions are: lawnmowers that are shipped to dealers throughout the United
1. Both populations must follow the normal States and Canada. Two different procedures have been
distribution. proposed for mounting the engine on the frame of the
lawnmower. The question is: Is there a difference in the mean
2. The populations must have equal standard time to mount the engines on the frames of the lawnmowers?
deviations.
3. The samples are from independent populations. To evaluate the two methods, it was decided to conduct a time
and motion study. A sample of five employees was timed using
the Welles method and six using the Atkins method. The results,
in minutes, are shown below:
Finding the value of the test statistic requires two steps.
4. Pool the sample standard deviations.
5. Use the pooled standard deviation in the formula.
( n1  1) s12  ( n2  1) s22
s 2p 
n1  n2  2

X1  X 2 Is there a difference in the mean mounting times? Use the .10


t  significance level.
 1 1 
s 2p   
n
 1 n 2 
Comparing Population Means with Unknown Population Standard Deviations (the
Pooled t-test) - Example

Step 1: State the null and alternate hypotheses. Step 5: Compute the value of t and make a decision
(Keyword: “Is there a difference”)

H0: µ1 = µ2

H1: µ1 ≠ µ2

Step 2: State the level of significance.


The 0.10 significance level is stated in the problem.

Step 3: Find the appropriate test statistic.


Because the population standard deviations are not known
but are assumed to be equal, we use the pooled t-test.

Step 4: State the decision rule.

Reject H0 if t > t/2,n1+n2-2 or t < - t/2, n1+n2-2


-0.662
t > t.05,9 or t < - t.05,9
The decision is not to reject the null hypothesis, because
t > 1.833 or t < - 1.833 -0.662 falls in the region between -1.833 and 1.833.

We conclude that there is no difference in the mean times


to mount the engine on the frame using the two methods
Comparing Population Means with Unequal Population Standard
Deviations

Compute the t-statistic shown on the right if it is not


reasonable to assume the population standard
deviations are equal. EXAMPLE
Personnel in a consumer testing laboratory are
The sample standard deviations s1 and s2 are used in evaluating the absorbency of paper towels. They wish
place of the respective population standard to compare a set of store brand towels to a similar
group of name brand ones. For each brand they dip a
deviations. ply of the paper into a tub of fluid, allow the paper to
drain back into the vat for two minutes, and then
In addition, the degrees of freedom are adjusted evaluate the amount of liquid the paper has taken up
downward by a rather complex approximation from the vat. A random sample of 9 store brand paper
formula. The effect is to reduce the number of towels absorbed the following amounts of liquid in
degrees of freedom in the test, which will require milliliters.
a larger value of the test statistic to reject the null
hypothesis. 8 8 3 1 9 7 5 5 12
An independent random sample of 12 name brand
towels absorbed the following amounts of liquid in
milliliters:
12 11 10 6 8 9 9 10 11 9 8 10
Use the .10 significance level and test if there is a
difference in the mean amount of liquid absorbed by
the two types of paper towels.
Comparing Population Means with Unequal Population Standard
Deviations - Example

The following dot plot provided by MINITAB shows the


variances to be unequal.

The following output provided by MINITAB shows the


descriptive statistics
Comparing Population Means with Unequal Population Standard
Deviations - Example

Step 1: State the null and alternate hypotheses.

H0: 1 = 2

H1: 1 ≠ 2

Step 2: State the level of significance.


The .10 significance level is stated
in the problem.

Step 3: Find the appropriate test statistic.


We will use unequal variances t-test

Step 4: State the decision rule.

Reject H0 if

t > t/2d.f. or t < - t/2,d.f.

t > t.05,10 or t < - t.05, 10

t > 1.812 or t < -1.812

Step 5: Compute the value of t and make a decision


The computed value of t is less than the lower critical value, so our decision is to reject the null hypothesis. We conclude that the mean absorption rate for the two towels is not the same

.
Two-Sample Tests of Hypothesis:
Dependent Samples
Dependent samples are samples that are paired or
related in some fashion.

EXAMPLE
For example:
Nickel Savings and Loan wishes to compare the two companies it
• If you wished to buy a car you would look uses to appraise the value of residential homes. Nickel Savings
at the same car at two (or more) different
dealerships and compare the prices. selected a sample of 10 residential properties and scheduled both
• If you wished to measure the firms for an appraisal. The results, reported in $000, are shown on
effectiveness of a new diet you would the table (right).
weigh the dieters at the start and at the
finish of the program. At the .05 significance level, can we conclude there is a difference in
the mean appraised values of the homes?

d
t
sd / n
Where
d is the mean of the differences
sd is the standard deviation of the differences
n is the number of pairs (differences)
Hypothesis Testing Involving Paired
Observations - Example
Step 1: State the null and alternate hypotheses.

H0: d = 0

H1: d ≠ 0

Step 2: State the level of significance.


The .05 significance level is stated in the problem.

Step 3: Find the appropriate test statistic.


We will use the t-test

Step 4: State the decision rule.

Reject H0 if

t > t/2, n-1 or t < - t/2,n-1

t > t.025,9 or t < - t.025, 9

t > 2.262 or t < -2.262

Step 5: Compute the value of t and make a decision

The computed value of t (3.305) is greater than the higher critical value (2.262), so our decision is to reject the null hypothesis.
We conclude that there is a difference in the mean appraised values of the homes.
Analysis of Variance
The F Distribution

Uses of the F Distribution


• test whether two samples are from populations
having equal variances
• to compare several population means
simultaneously. The simultaneous comparison of
several population means is called analysis of
variance(ANOVA).
Assumption:
In both of the uses above, the populations must
follow a normal distribution, and the data must be
at least interval-scale.

Characteristics of the F Distribution


1. There is a “family” of F Distributions. A particular
member of the family is determined by two
parameters: the degrees of freedom in the numerator
and the degrees of freedom in the denominator.
2. The F distribution is continuous
3. F cannot be negative.
4. The F distribution is positively skewed.
5. It is asymptotic. As F   the curve approaches the X-
axis but never touches it.
Comparing Two Population Variances
The F distribution is used to test the hypothesis that the variance of one normal population equals the variance of
another normal population.

Examples:
• Two Barth shearing machines are set to produce steel bars of the same length. The bars, therefore, should have the
same mean length. We want to ensure that in addition to having the same mean length they also have similar
variation.

• The mean rate of return on two types of common stock may be the same, but there may be more variation in the
rate of return in one than the other. A sample of 10 technology and 10 utility stocks shows the same mean rate of
return, but there is likely more variation in the Internet stocks.

• A study by the marketing department for a large newspaper found that men and women spent about the same
amount of time per day reading the paper. However, the same report indicated there was nearly twice as much
variation in time spent per day among the men than the women.
Test for Equal Variances - Example
Lammers Limos offers limousine service from the Step 1: The hypotheses are:
city hall in Toledo, Ohio, to Metro Airport in H0: σ12 = σ22
Detroit. Sean Lammers, president of the
company, is considering two routes. One is via H1: σ12 ≠ σ22
U.S. 25 and the other via I-75. He wants to study
the time it takes to drive to the airport using Step 2: The significance level is .05.
each route and then compare the results. He
collected the following sample data, which is Step 3: The test statistic is the F distribution.
reported in minutes.
Using the .10 significance level, is there a difference Step 4: State the decision rule.
in the variation in the driving times for the two Reject H0 if F > F/2,v1,v2
routes? F > F.10/2,7-1,8-1
F > F.05,6,7
F > 3.87
Test for Equal Variances - Example
Step 5: Compute the value of F and make a decision

The decision is to reject the null hypothesis, because the computed F value (4.23) is larger than the critical value (3.87).

We conclude that there is a difference in the variation of the travel times along the two routes.
Comparing Means of Two or More
Populations
The F distribution is also used for testing whether two or more sample means came from the same or equal populations.

Assumptions:
• The sampled populations follow the normal distribution.
• The populations have equal standard deviations.
• The samples are randomly selected and are independent.

The Null Hypothesis is that the population means are the same. The Alternative Hypothesis is that at least one of the means is different.

H0: µ1 = µ2 =…= µk
H1: The means are not all equal
Reject H0 if F > F,k-1,n-k
Comparing Means of Two or More Populations –
Example

EXAMPLE Step 1: State the null and alternate hypotheses.


Recently a group of four major carriers joined in hiring H0: µE = µA = µT = µO
Brunner Marketing Research, Inc., to survey recent H1: The means are not all equal
passengers regarding their level of satisfaction with a
recent flight. The survey included questions on Reject H0 if F > F,k-1,n-k
ticketing, boarding, in-flight service, baggage
handling, pilot communication, and so forth.
Step 2: State the level of significance.
The .01 significance level is stated in the problem.
Twenty-five questions offered a range of possible Step 3: Find the appropriate test statistic.
answers: excellent, good, fair, or poor. A response of Use the F statistic
excellent was given a score of 4, good a 3, fair a 2,
and poor a 1. These responses were then totaled, so Step 4: State the decision rule.
the total score was an indication of the satisfaction
with the flight. Brunner Marketing Research, Inc., Reject H0 if: F > F,k-1,n-k
randomly selected and surveyed passengers from F > F.01,4-1,22-4
the four airlines.
F > F.01,3,18
F > 5.09
Is there a difference in the mean satisfaction level
among the four airlines?
Use the .01 significance level.

The computed value of F is 8.99, which is greater than the


critical value of 5.09, so the null hypothesis is rejected.
Conclusion: The mean scores are not the same for the four
airlines; at this point we can only conclude there is a difference
in the treatment means. We cannot determine which treatment
groups differ or how many treatment groups differ.
Comparing Means of Two or More Populations – Example

Step 5: Compute the value of F and make a decision


Two-Way Analysis of Variance
• For the two-factor ANOVA we test whether there is a EXAMPLE
significant difference between the treatment effect WARTA, the Warren Area Regional Transit Authority, is
and whether there is a difference in the blocking expanding bus service from the suburb of Starbrick into the
effect. Let Br be the block totals (r for rows) central business district of Warren. There are four routes
being considered from Starbrick to downtown Warren: (1)
via U.S. 6, (2) via the West End, (3) via the Hickory Street
• Let SSB represent the sum of squares for the blocks Bridge, and (4) via Route 59. WARTA conducted several
where: tests to determine whether there was a difference in the
mean travel times along the four routes. Because there will
be many different drivers, the test was set up so each
driver drove along each of the four routes. Next slide shows
the travel time, in minutes, for each driver-route
SSB k( x b  x G ) 2 combination. At the .05 significance level, is there a
difference in the mean travel time along the four routes? If
we remove the effect of the drivers, is there a difference in
the mean travel time?
Two-Way Analysis of Variance - Example

Step 1: State the null and alternate hypotheses.


H0: µu = µw = µh = µr
H1: Not all treatment means are the same
Reject H0 if F > F,k-1,n-k

Step 2: State the level of significance.


The .05 significance level is stated in the problem.

Step 3: Find the appropriate test statistic.


Because we are comparing means of more than two groups, use the F statistic

Step 4: State the decision rule.


Reject H0 if F > F,v1,v2

F > F.05,k-1,(b-1)(k-1)

F > F.05,5-1,(5-1)(4-1)

F > F.05,4,12

F > 3.26

Using Excel to perform the calculations, we conclude:


(1) The mean time is not the same for all drivers
(2) The mean times for the routes are not all the same
Two-way ANOVA with Interaction

• In the previous section, we studied the separate or


independent effects of two variables, routes into the
city and drivers, on mean travel time.
• There is another effect that may influence travel time.
This is called an interaction effect between route and
driver on travel time. For example, is it possible that
one of the drivers is especially good driving one or
more of the routes?
• The combined effect of driver and route may also
explain differences in mean travel time.
• To measure interaction effects it is necessary to have at
least two observations in each cell.
• When we use a two-way ANOVA to study interaction,
we now call the two variables as factors instead of
blocks Is there really an interaction between routes and drivers?
• Interaction occurs if the combination of two factors has Are the travel times for the drivers the same?
some effect on the variable under study, in addition to
each factor alone. Are the travel times for the routes the same?

• The variable being studied is referred to as the Of the three questions, we are most interested in the test for
response variable. interactions. To put it another way, does a particular
route/driver combination result in significantly faster (or
• One way to study interaction is by plotting factor means slower) driving times? Also, the results of the hypothesis test
for interaction affect the way we analyze the route and driver
in a graph called an interaction plot. questions.
Example – ANOVA with Replication

Suppose the WARTA blocking


experiment discussed earlier is repeated
by measuring two more travel times for
each driver and route combination with The ANOVA now has three sets of hypotheses to test:
the data shown in the Excel worksheet.
1. H0: There is no interaction between drivers
and routes.
H1: There is interaction between drivers and
routes.

2. H0: The driver means are the same.


H1: The driver means are not the same.

3. H0: The route means are the same.


H1: The route means are not the same.
ANOVA Table

Route
Driver
One-way ANOVA for Each Driver
H0: Route travel times are equal.

Conclusion:
The route travel times are all equal for Deans
And Ormson (at 0.05 significance level)

The route travel times are not all equal for Filbeck,
Snaverly and Zollaco
Linear Regression and
Correlation
Regression Analysis - Introduction
• Recall in Chapter 4 the idea of showing the relationship between two variables with a scatter diagram was introduced.
• In that case we showed that, as the age of the buyer increased, the amount spent for the vehicle also increased.
• In this chapter we carry this idea further. Numerical measures to express the strength of relationship between two variables are developed.
• In addition, an equation is used to express the relationship between variables, allowing us to estimate one variable on the basis of another.

EXAMPLES
1. Is there a relationship between the amount Healthtex spends per month on advertising and its sales in the month?
2. Can we base an estimate of the cost to heat a home in January on the number of square feet in the home?
3. Is there a relationship between the miles per gallon achieved by large pickup trucks and the size of the engine?
4. Is there a relationship between the number of hours that students studied for an exam and the score earned?

• Correlation Analysis is the study of the relationship between variables. It is also defined as group of techniques to measure the association between two
variables.

• Scatter Diagram is a chart that portrays the relationship between the two variables. It is the usual first step in correlations analysis

• The Dependent Variable is the variable being predicted or estimated.

• The Independent Variable provides the basis for estimation. It is the predictor variable.
Scatter Diagram Example

The sales manager of Copier Sales of America, which has a large sales force throughout the United States and Canada, wants to
determine whether there is a relationship between the number of sales calls made in a month and the number of copiers
sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each
representative made last month and the number of copiers sold.
The Coefficient of Correlation, r
The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables.

• It shows the direction and strength of the linear relationship between two interval or ratio-scale variables

• It can range from -1.00 to +1.00.

• Values of -1.00 or +1.00 indicate perfect and strong correlation.

• Values close to 0.0 indicate weak correlation.

• Negative values indicate an inverse relationship and positive values indicate a direct relationship.
Correlation Coefficient - Example
EXAMPLE Using the formula:
Using the Copier Sales of America data which a
scatterplot is shown below, compute the
correlation coefficient and coefficient of
determination.

How do we interpret a correlation of 0.759?


First, it is positive, so we see there is a direct relationship between the number of sales
calls and the number of copiers sold. The value of 0.759 is fairly close to 1.00, so we
conclude that the association is strong. However, does this mean that more sales calls
cause more sales? No, we have not demonstrated cause and effect here, only that the
two variables—sales calls and copiers sold—are related.
Testing the Significance of
the Correlation Coefficient – Copier Sales Example

H0:  = 0 (the correlation in the population is 0)

H1:  ≠ 0 (the correlation in the population is not 0)

Reject H0 if:

t > t/2,n-2 or t < -t/2,n-2

t > t0.025,8 or t < -t0.025,8

t > 2.306 or t < -2.306

The computed t (3.297) is within the rejection region, therefore, we will reject H 0. This means the correlation in the population
is not zero. From a practical standpoint, it indicates to the sales manager that there is correlation with respect to the
number of sales calls made and the number of copiers sold in the population of salespeople.
Regression Analysis
In regression analysis we use the independent variable (X) to estimate the dependent variable (Y).
• The relationship between the variables is linear.
• Both variables must be at least interval scale.
• The least squares criterion is used to determine the equation.

REGRESSION EQUATION An equation that expresses the linear relationship between two variables.

LEAST SQUARES PRINCIPLE Determining a regression equation by minimizing the sum of the squares of the vertical distances
between the actual Y values and the predicted values of Y.

n(  XY )  (  X )(  Y )
b
n(  X 2 )  (  X ) 2
Y X
a  b
n n
Linear Regression Model
Regression Equation - Example

Recall the example involving Copier Sales of America. Step 1 – Find the slope (b) of the line
The sales manager gathered information on the
number of sales calls made and the number of
copiers sold for a random sample of 10 sales
representatives. Use the least squares method to
determine a linear equation to express the
relationship between the two variables. Step 2 – Find the y-intercept (a)
What is the expected number of copiers sold by a
representative who made 20 calls?
The regression equation is :
^
Y a  bX
^
Y 18.9476  1.1842 X
^
Y 18.9476  1.1842(20)
^
Y 42.6316
Assumptions Underlying Linear
Regression
For each value of X, there is a group of Y values, and these
• Y values are normally distributed. The means of these normal
distributions of Y values all lie on the straight line of regression.
• The standard deviations of these normal distributions are equal.
• The Y values are statistically independent. This means that in the
selection of a sample, the Y values chosen for a particular X value do not
depend on the Y values for any other X values.
Time Series and
Forecasting
Time Series and its Components
TIME SERIES is a collection of data recorded
over a period of time (weekly, monthly,
quarterly), an analysis of history, that can be
used by management to make current
decisions and plans based on long-term
forecasting. It usually assumes past pattern
to continue into the future

Components of a Time Series

1. Secular Trend – the smooth long term


direction of a time series

2. Cyclical Variation – the rise and fall of


a time series over periods longer than
one year

3. Seasonal Variation – Patterns of


change in a time series within a year
which tends to repeat each year

4. Irregular Variation – classified into:


Episodic – unpredictable but
identifiable
Residual – also called chance
fluctuation and unidentifiable
The Moving Average Method
• Useful in smoothing time series to see its trend
• Basic method used in measuring seasonal fluctuation
• Applicable when time series follows fairly linear trend that have definite rhythmic pattern
Linear Trend – Using the Least Squares
Method: An Example
The sales of Jensen Foods, a small grocery chain located in southwest Texas, since 2005 are:

Sales
Year t ($ mil.)

2005 1 7
2006 2 10
2007 3 9
2008 4 11
2009 5 13
Seasonal Variation and Seasonal Index

• One of the components of a time series


• Seasonal variations are fluctuations that coincide with certain seasons and are repeated year after year
• Understanding seasonal fluctuations help plan for sufficient goods and materials on hand to meet
varying seasonal demand
• Analysis of seasonal fluctuations over a period of years help in evaluating current sales

SEASONAL INDEX
• A number, usually expressed in percent, that expresses the relative value of a season with respect to
the average for the year (100%)
• Ratio-to-moving-average method
• The method most commonly used to compute the typical seasonal
pattern
• It eliminates the trend (T), cyclical (C), and irregular (I) components from
the time series
Seasonal Index – An Example
EXAMPLE
The table below shows the quarterly sales for Toys
International for the years 2001 through 2006. The
sales are reported in millions of dollars. Determine a
quarterly seasonal index using the ratio-to-moving-
average method.

Step (1) – Organize time series data in column form


Step (2) Compute the 4-quarter moving totals
Step (3) Compute the 4-quarter moving averages
Step (4) Compute the centered moving averages by getting
the average of two 4-quarter moving averages
Step (5) Compute ratio by dividing actual sales by the
centered moving averages
Seasonal Index – An Example
Actual versus Deseasonalized Sales for Toys
International
Deseasonalized Sales = Sales / Seasonal Index
Seasonal Index – An Example Using Excel

Given the deseasonalized linear equation for Toys International sales as Ŷ=8.109 + 0.0899t, generate the seasonally adjusted forecast for each
of the quarters of 2010

Ŷ X SI = 10.62648 X 1.519
Ŷ = 8.10 + 0.0899(28)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy