Commerce 1DA3 Notes-2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 121

Chapter 1: Introduction To Statistics

Have You Ever Wondered…

● if the number of children in the household has a relationship with


the household’s annual income
● whether or not air quality has a relationship with the number of
hospital visits due to respiratory problems
● if the general public is able to tell Pepsi from Coke just by tasting
it
● how wide is the pay gap between male and female CEOs in
Canada
● what factors contribute to the likelihood of a company filing for
bankruptcy in 5 years

Statistical Analysis

● Data is all around us


● We need statistics to extract useful information from data
● Good statistical analysis includes:
○ Finding the right data
○ Categorizing and visualizing the data
○ Applying relevant statistical tools
○ Visualizing and interpreting the results in a meaningful way
○ Solving problems using the data analysis

Chapter 2: Data
What Is Data?

● Data values or observations are information collected regarding


some subject
● Data can be numbers, names, etc., and tell us the “Who and What”
● Data are useless without their context
What is Data?

● The rows of a data table correspond to individual cases about


Whom (or about Which – if they are not people) we record some
characteristics.
● The characteristics recorded about each individual or case are
called variables.
● These are usually shown as the columns of a data table and
identify What has been measured

What is Data?

● Data tables are cumbersome for complex data sets, so often two
or more separate data tables are linked together in a relational
database
● Each data table included in the database is a relation because it is
about a specific set of cases with information about each of these
cases for all of the variables
● Example: A typical relational database is provided consisting of
three relations: customer data, item data, and transaction data

Types of Variables

● Categorical (also known as qualitative): names categories,


indicates whether or not a case falls into a certain category.
● Quantitative: measures numerical values with/without units. Tells
us about the quantity of something.
○ Some quantitative variables have units (purchase amount),
and some are unitless (click count).
● Some variables could potentially be categorical and quantitative
(Age).

Types of Variables

● Counting: is the core of statistics. We usually count things to get


insight into the world.
● Example: Counting cases in each category or how many of
something was observed
Types of Variables

● Identifier: identifies cases in databases (datasets). An identifier is


unique
○ Type of categorical variable
○ Does not have units
○ Helps to combine different datasets and makes relational
databases possible
○ Are not analyzed

Types of Variables

● Nominal: Categorical variables used only to name categories.


(school attended)
● Ordinal: If a variable can be ordered (cat. or quant.) (satisfaction
level, purchase amount)
● Time Series: data that are gathered at regular intervals over time.
(Example?!) (temperature of days in Sep.)
● Cross-sectional data: when data for several variables are
measured at the same point in time, the data is called
cross-sectional. For example, determining sales revenue, number
of customers, and expenses for the last month of business.

Data Collection

● Primary data: collected by the researcher/analyst.


● Secondary data: collected by another party, like Statistics Canada
obtained by the researcher/analyst.
● When and Where data was collected is important

Chapter 3: Survey and Sampling


Key Words:

● Population vs. Sample


● Sampling
● Sample Statistics (Ex. Sample mean)
● Population Parameter (Ex. Population mean)
● Sample data collected is either:
○ Cross-sectional data
○ Time series data

Key Words:

● Structured Data: Well defined length and format


● Unstructured Data: No pre-defined format (doctors notes, reports,
video data)

Sampling

● Why do we take samples?


○ Insight into behaviours of a population
○ Population is big
○ Observing the whole population is impossible or costly or too
time consuming.
○ Only a sample of the population is observable
● Statistics helps us draw insight about the population by observing
and analyzing the sample

Features of Sampling

Feature 1: Examine a part of the whole

● We may use sample surveys. Questions designed to give us


answers on some characteristics of the sample.
● Sample may be biased. A biased sample over- or under-
emphasizes certain characteristics of the population
● A biased sample gives us a biased understanding of the
characteristics of the population.
● Individuals (cases) for samples must be selected randomly!
Feature 2: Randomize

● Randomizing protects us by giving us a representative sample


even for effects we were unaware of.
● Randomization seems fair because nobody can guess the
outcome before it happens and because usually some underlying
set of outcomes will be equally likely.
● Sampling Variability (sampling error): Sample to sample
differences
○ Example: average height of McMaster undergraduate
Business students by drawing samples from different
sections of 1DA3!

Feature 3: Sample size is important!

● The size of the sample determines what we can conclude from the
data regardless of the size of the population

How big a sample do we need?

● It depends on what we are estimating


● Too small sample size may not be representative of population
● Naturally, we prefer a sample that is a good representative of the
population and is as small as possible!

Population and Parameters

● Census: a sample that includes observations from the entire


population.
● A census is usually not the best idea, why?
○ Difficult or impractical or cumbersome to perform one
○ Population characteristics may change. We can’t perform
censuses often.
● Models use mathematics to represent reality
● Parameters: Key numbers in models that represent reality
● Population parameter: A parameter used in a model for a
population
Population and Parameters

● Since we are taking samples, we need to estimate population


parameters through the sample data
● Sample Statistic: Anything calculated from a sample
● Representative: A sample statistic that estimates the
corresponding population parameter accurately
● The goal is to use sample statistics from the sample to estimate
population parameter

Simple Random Sample

● A sample drawn so that every possible sample of the size we plan


to draw has an equal chance of being selected is called a simple
random sample, usually abbreviated SRS.
● To select a sample at random, a sampling frame is first defined.
● Sampling frame: a list of individuals (or cases, record, etc.) from
which the sample will be drawn.
● Once we have a sampling frame, we can assign a sequential
number to each individual in the sampling frame and draw random
numbers to identify those to be sampled.

Other Random Sample Designs

Stratified Sampling:

● Slice the population into homogeneous groups, called strata. Use


simple random sampling within each stratum to select members.
Combine the results at the end.
● Reduced sample variability is one of the most important benefits of
stratified sampling.

Cluster Sampling:

● Split the population into parts or clusters that each represent the
population. Perform a census within one or a few clusters at
random.
● If each cluster fairly represents the population, cluster sampling
will generate an unbiased sample.
Systematic Sampling:

● A systematic approach is used to select individuals. Start from a


randomized individual and follow the approach to create the
sample.
● Example: Pick every 10th individual from a list of employees to
create a sample of 30 individuals.

Multistage Sampling:

● Sampling schemes that combine several methods are called


multistage samples.

The Valid Survey

● A survey that can yield the information you need about the
population in which you are interested is a valid survey
● To help ensure a valid survey, you need to ask four questions:
○ What do I want to know?
○ Who are the right respondents?
○ What are the right questions?
○ What will be done with the results?

The Valid Survey

● Nonresponse bias: When individuals don’t respond to questions.


● voluntary response bias: In volunteer surveys, individuals with
the strongest feelings on either side of an issue are more likely to
respond; those who don’t care may not bother.
● measurement errors: When a question does not take into
account all possible answers
● It is important not to confuse inaccuracy with bias! Both create
errors but the errors are different.
Chapter 4: Displaying and Describing Categorical Data

Displaying Data

● Data visualization is an important part of any statistical or data


analysis.
● It summarizes huge amounts of data into easy to follow, easy to
digest graphs and plots. (2.5 billion GB of data is generated every
day)
● Visualization plays an important role in telling the story of the data.

Charts

● Bar Charts: A bar chart displays the distribution of One


categorical variable, showing the counts for each category next to
each other for easy comparison.

● Pie Charts: Pie charts show the whole group as a circle (“pie”)
sliced into pieces. The size of each piece is proportional to the
fraction of the whole in each category. The pie chart for Loblaw
data is displayed below.

Frequency Tables

● A frequency table organizes data by recording counts and


category names as in the table below (Frequency table of the
number of Loblaw stores in eastern Canada),
● A relative frequency table displays the proportions or
percentages that lie in each category rather than the counts

Frequency Distribution

● Groups data into categories and records the number of (counts the
number of) observations in each category
Contingency Tables

● A Contingency Table shows how the values of one variable is


contingent on the value of another variable (2 variables)
○ Example: Data was collected on the use of social networks
in different countries. To show how social network use is
varied by countries

● The marginal distribution of a variable in a contingency table is


the total count that occurs without reference to the value of the
other variable(s).

● Each cell of a contingency table gives the count for a combination


of values of both variables. (e.g. Country, and social network use).
● We may display the data as a percentage – as a row percent,
column percent, or a total percent which show percentages with
respect to the total count, row count, or column count, respectively.

● A segmented bar chart divides a bar proportionally into segments


corresponding to the percentage in each group.
○ We could display the Super Bowl viewer data as a
segmented bar, which treats each bar as the “whole” and
divides it proportionally into segments corresponding to the
percentage in each group.

Contingency Distribution
● Conditional Distributions: We may want to restrict variables in a
distribution to show the distribution for just those cases that satisfy
a specified condition. This is called a conditional distribution.
(e.g., social networking use given the country of focus is Egypt)

Simpson’s Paradox

● Simpson’s Paradox results from inappropriately combining


percentages of different groups.
● The paradox appears when a certain trend appears in several
different groups of data, but disappears or reverses when these
groups are combined.

Simson's Paradox

● Treatment for Kidney stones (small vs. large stones):


○ Treatment A is more comprehensive and involves open
surgical procedures
○ Treatment B is less comprehensive and involves small
punctures
● Of the 350 patients (with small and large stones combined), the
number of successes is,
○ Treat. A: 273 resulting in a 78% success rate (273/350=78%)
○ Treat. B: 289 resulting in a 83% success rate (289/350=83%)
● Which treatment is suggested for a patient with kidney stone
(unknown size)?

Treatment/ Stone Treatment A Treatment B


Size
Small Stones Group 1 Group 2
93% (81/87) 87% (234/270)
Large Stones Group 3 Group 4
73% (192/263) 69% (55/80)
Both 78% (273/350) 83% (289/350)

Reasons For The Simpson’s Paradox

● Size of the groups: when the effect of the difference in groups is


ignored, the groups with a higher sample size have a greater
influence on the combined results, proportionate to their size!
● The lurking variable (confounding variable) that influences the
results when two groups with significantly different behaviors are
combined!

Simpson’s Paradox

What does it mean for data analysis:

● Analysis should be comprehensive and nuanced


● Content knowledge is important: investigate further if data is
showing you results that are counterintuitive.
● Understand the limitations of data: if data is not detailed enough, it
may give misleading results.
● Shows the trade-off between bias and variance (accuracy):
○ Too much aggregation→more accuracy but may result in
more bias
○ Too much disaggregation→less bias but may result in
smaller data which means less accuracy.

Simpson’s Paradox Can Be Avoided By:

● Reviewing frequency table


● Reviewing correlation among variables
● Investigating any lurking (confounding variables) that may result in
significant differences between groups
● A comprehensive and deep level of content knowledge (domain
knowledge)

Chapter 5: Displaying and Describing Quantitative Data

Displaying Qualitative Data

Frequency Table
Histogram

Example of Histogram
Example of Histogram

Visualizing Quantitative Data

● Stem-And-Leaf Diagrams
Stem and Leaf

Displaying Data Distributions

● Stem-and-Leaf Display

● Before making a histogram or a stem-and-leaf display, the


Quantitative Data Condition must be satisfied: the data values
are of a quantitative variable whose units are known.
● Histograms: A histogram plots the bin counts as the height of
bars and it describes the overall "shape" of data.

Displaying Data Distributions

A stem-and-leaf display for thirty six months of stock price


changes data is shown below together with a histogram
How do histograms work?

1) Decide how wide to make the bins – if there are n data points, use
log2 𝑛 for the number of bins
2) Determine the count for each bin
3) Decide where to place values that land on the endpoint of a bin.
For example, does a value of $5 go into the $0 to $5 bin or the $5
to $10 bin? The standard rule is to place such values in the higher
bin.

Stem and Leaf Display

● A stem and leaf display is like a histogram, but it also gives the
individual values
● These are easy to make by hand for data sets that aren’t too large,
so they’re a great way to look at a small batch of values quickly

Describing Data

● When describing a distribution, attention should be paid to


○ Its shape
○ Its center
○ Its spread
Shape

● We describe the shape of a distribution in terms of its modes, its


symmetry, and whether it has any gaps or outlying values.
○ Negatively Skewed: Skewed to the left
○ Symmetric Distribution: Bars are symmetric
○ Positively Skewed: Skewed to the right

Shape

● Modes: Peaks or humps seen in a histogram are called the


modes of a distribution
● A distribution whose histogram has one main peak is called
unimodal, two peaks – bimodal (see figure), three or more peaks
– multimodal.
○ Unimodal: One main Peak
○ Bimodal: Two Peaks
○ Multimodal: Three or more peaks
Shape

● Uniform Distribution: A distribution whose histogram doesn’t


appear to have any clear mode and in which all the bars are
approximately the same height

● Symmetric Distribution: A distribution is symmetric if the halves


on either side of the center look, at least approximately, like mirror
images.

● Tails: The thinner ends of a distribution


○ If one tail stretches out farther than the other, the distribution
is said to be skewed to the side of the longer tail. The
distribution below is skewed to the right.
Outliers:
● Those values that stand off away from the body of the
distribution
○ Always be careful to point out the outliers in a distribution
● can affect every statistical method we will study (finding dist.)
● can be the most informative part of your data (model adj.)
● may be an error in the data (find the error and correct it)
● should be discussed in any conclusions drawn about the data
● Characterizing the shape of a distribution is often a judgment
call

Centre

● Mean: is a natural summary and the centre point of a unimodal


and symmetric distribution
● To find the mean of the variable y, add all the values of the variable
and divide that sum by the number of data values, n. The mean is
a natural summary for unimodal, symmetric distributions.

If we have the data 𝑦1,𝑦2,𝑦3, ...,𝑦10, then the mean is,


Centre

● The mean of the sample is referred to as 𝑦(bar) (pronounced y-


bar).

Centre

● If a distribution is skewed, contains gaps, or contains outliers,


then it might be better to use the median – the value that splits the
histogram into two equal areas
● The median is found by counting in from the ends of the data until
we reach the middle value
● The median is said to be resistant because it isn’t affected by
unusual observations or by the shape of the distribution (like
outliers).
● If a distribution is roughly symmetric, we’d expect the mean and
median to be close.

Centre

Calculating the median:

1. Sort the data in ascending order.


2. If the number of observations is odd the median is the middle
value
3. If the number of observations is even, the median is the average
of the two middle values.

Centre
Mode - Median - Mean

● Relationship between mean, median and mode in symmetric or


skewed data:
● Negatively skewed (Mean Median Mode)
● Normal (No skew) On top of eachother
● Positively skewed (Mode Median Mean)
Spread

● We need to determine how spread out the data are because the
more the data vary, the less a measure of centre can tell us.
● One simple measure of spread is the range, defined as the
difference between the extremes (max and min)
● Range = Max - Min

Spread

● Range is a weak measure. Because it only considers the two


endpoints of the data.
● The range is a single value and it is not resistant to unusual
observations and outliers. Why?
○ If you have outliers, the min or max will be outliers
● Example: What is the range of the data; 6,4,1,9,7?
○ Range = 9 - 1 = 8

Spread

● Each quartile is a value that frames the middle 50% of the data.
One-quarter of the data lies below the lower quartile, Q1, and
one-quarter lies above the third quartile, Q3.
● The interquartile range (IQR) is defined to be the difference
between the two quartiles: Q1 and Q3
Spread
● Variance
● What is variance?
● Variance: Average of squared deviations between data points and
the mean
○ Variance Unit of Measurement: (Unit of data)^2
● For sample values 𝑦1,𝑦2,...,𝑦𝑛 the sample variance (𝑠^2) is
calculated as,

Spread
For population values 𝑦1,𝑦2,...,𝑦𝑁 the population variance (𝜎^2) [𝜎is
the Greek letter sigma] is calculated as,

Spread
● Standard Deviation: Standard deviation represents, on average,
how far data points are from the mean
○ Standard Deviation Unit of Measurement: Same unit as
data
● What are standard deviations for the sample and
population as calculated in the previous slide?
Spread
● Coefficient of Variation (CV)
○ What is the CV for a dataset: Measure of relative spread
● What is the CV for a sample and a population?

Reporting the Shape, Centre, and Spread

● If the shape is skewed, the median and IQR should be reported.


● If the shape is unimodal and symmetric, the mean and standard
deviation and possibly the median and IQR should be reported.
● If there are unusual observations point them out and report the
mean and standard deviation with and without the values.
● Always pair the median with the IQR and the mean with the
standard deviation.

5-number Summary and Boxplots

● The five-number summary of a distribution reports its median,


quartiles, and extremes (maximum and minimum).
● It provides a good overall summary of the distribution of data.
○ Max
○ Q3
○ Median
○ Q1
○ Min
Percentiles

● Percentile shows where a given percentage of the data lies.


● Suppose the numbers of passengers on 12 flights from
Hamilton to Ottawa are 24, 18, 31, 27, 15, 16,
● 26, 15, 24, 26, 25, 30.
● Step 1. We first put the data in ascending order, getting 15, 15, 16,
18, 24, 24, 25, 26, 26, 27, 30, 31.
● Step 2 (option 1): Suppose we want to calculate the 80th
percentile of this data. Since there are 12 data values, we first
calculate 80% of 12, which is 9.6. Since 9.6 is not an integer, we
round it up to 10 and the 80th percentile is the 10th data value, or
27.
● Step 2 (option 2): Suppose we want to calculate the 50th
percentile of this data. Since there are 12 data values, we first
calculate 50% of 12, which is 6. Since 6 is an integer, we do not
round it up. Instead we take the average of the sixth and seventh
data values (24+25)/2=24.5.

Percentile

● What percentile is the median of a dataset?


○ The 50th percentile
● What are some examples of percentiles used to describe a data
point?
○ Credit scores
○ Standardized Test
● What is the Pth percentile of a dataset?
5-number Summary and Boxplots

Once we have a five-number summary of a variable, we can display that


information in a boxplot. To make a boxplot: IQR = Q3-Q1

​ 1) Draw a single vertical axis spanning the extent of the data


​ 2) Draw short horizontal lines at the lower and upper quartiles
and at the median. Then connect them with vertical lines to
form a box

3) Erect (but don’t show in the final plot) “fences” around the main
part of the data, placing the upper fence 1.5 IQRs above the upper
quartile and the lower fence 1.5 IQRs below the lower quartile.

● 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 0.297
● 𝑄3 + 1.5𝐼𝑄𝑅 = 1.972 + 0.4455 = 2.41
● Q1 - 1.5*IQR
4) Draw lines (whiskers) from each end of the box up and
down to the most extreme data values found within the
fences.

5) Add any outliers by displaying data values that lie beyond


the fences with special symbols. Erase the fences.
5-Number Summary and Boxplots

● The centre of a boxplot shows the middle half of the data between
the quartiles – the height of the box equals the IQR.
● If the median is roughly centred between the quartiles, then the
middle half of the data is roughly symmetric. If it is not centred, the
distribution is skewed.
● The whiskers show skewness as well if they are not roughly the
same length.
● The outliers are displayed individually to keep them out of the way
in judging skewness and to display them for special attention

Boxplot

Question: Draw a boxplot that has two outliers on the right hand
side. Show Q1, Q2 and Q3. Show the IQR. Show the max and min
values.
Comparing Groups

● In attempting to understand the data, we may want to look for


patterns, differences, and trends over different time periods. We
may want to split the data in half and display histograms for each
half. Histograms for six-month split NYSE data is shown below

Comparing Groups

● Histograms work well for comparing two groups, but boxplots tend
to offer better results when side-by-side comparison of several
groups is sought.
● Below the NYSE data is displayed in monthly boxplots.
Standardizing

Example: Compare two companies (from the “top” 100 companies) with
respect to the variables New Jobs (jobs created) and Average Pay.

​ ▪ Starbucks created over 2000 jobs and has an average salary of


$44,790 while Wrigley created 16 jobs and has an average salary
of $56,351.
​ ▪ For all 100 companies, the mean number of new jobs created
was 305.9 and the average salary was $73,229.42.
▪ Which company did better?

Example (Continued): To compare the two companies based on these


variables, we find the mean and standard deviation for all 100
companies.

● To quantify how well each of the companies did and to combine


the two scores, we’ll determine how many standard deviations they
each are from the variable means.

Standardizing
● To find how many standard deviations a value is from the
mean we calculate a standardized value or z-score.
● z-Score Formula:

● For example, a z-score of 2.0 indicates that a data value is two


standard deviations above the mean and a z-score of −2 indicates
that the data value is two standard deviations below the mean.
● A rule of thumb for identifying outliers is z>3 or z< −3

Standardizing

Example (continued): Computing the z-scores for these variables for


Starbucks and Wrigley we obtain the results summarized below.

Analysis of Relative Location

● In the following dataset find the z-score of all sample values (𝑦(bar)
= 6 and 𝑠 = 3.16). This procedure is called standardizing the data.
Standardizing Data

Analysis of Relative Location

● If we know data is from a symmetric and bell-shaped it is usually


approximated by the normal distribution.
● The normal distribution is often used as an approximation for many
real-world applications.
● The empirical rule:

Outlier Identification:

Q3+1.5 * IQR

Q1 - 1.5 * IQR
● Iven the sample mean 𝑥ҧ, the sample standard deviation 𝑠
and a relatively symmetric and bell-shaped distribution,
○ Approximately 68% of all observation fall in the interval 𝑥ҧ ±
𝑠
○ Approximately 95% of all observation fall in the interval 𝑥ҧ ±
2𝑠
○ Approximately 99.7% (almost 100%) of all observation fall in
the interval 𝑥ҧ ± 3𝑠

Outliers (Relative Location)

Transforming Skewed Data


Example: Below we display the skewed distribution of total
compensation for the CEOs of the 500 largest companies.

Transforming Skewed Data

● When a distribution is skewed, it can be hard to summarize the


data simply with a centre and spread, and hard to decide whether
the most extreme values are outliers or just part of the
stretched-out tail.
● One way to make a skewed distribution more symmetric is to
re-express, or transform, the data by applying a simple function
to all the data values.
● If the distribution is skewed to the right, we often transform using
logarithms or square roots; if it is skewed to the left, we may
square the data values.

Example: Below we display the transformed distribution of total


compensation for the CEOs of the 500 largest companies. A simple log
function is used to transform data values.

● This histogram is much more symmetric, and we see that a typical


log compensation is between 6.0 and 7.0 or $1 million and $10
million in the original terms.
Chapter 6: Scatterplots, Association and Correlation

Scatterplots

● A scatterplot, which plots one quantitative variable against another,


can be an effective display for data
● Scatterplots are the ideal way to picture associations between two
quantitative variables.

Scatterplots

● Scatterplots: indicating if two variables are related. And if they


are related, what is the nature of their relationship.
● Examples where a scatterplot could be used is to determine if,
○ Gas prices vary with average monthly temperature
○ Cholesterol varies with dietary intake
○ Job satisfaction varies with number of employees working in
the company
○ The quality of user experience varies with average annual
profit made by the company

Scatterplots

● Relationship between two variables


● Scatterplots showing relationship between two variables

● Relationship between two variables

Scatterplots

● The direction of the association is important


● A pattern that runs from the upper left to the lower right is said to
be negative
● A pattern running from the lower left to the upper right is called
positive
● Look for direction: What’s the sign - positive, negative, or
neither?

Scatterplots

● The second thing to look for in a scatterplot is its form


● If there is a straight line relationship, it will appear as a cloud or
swarm of points stretched out in a generally
● consistent, straight form. This is called linear form.
● Sometimes the relationship curves gently, while still increasing or
decreasing steadily; sometimes it curves sharply up then down
● Look for form: Is it straight, curved, something exotic, or no
pattern?

Scatterplots

● The third feature to look for in a scatterplot is the strength of the


relationship
● Do the points appear tightly clustered in a single stream or do the
points seem to be so variable and spread out that we can barely
discern any trend or pattern?
● Look for strength: How much scatter?

Scatterplots

● Finally, always look for the unexpected


● An outlier is an unusual observation, standing away from the
overall pattern of the scatterplot
● Look for unusual features: Are there unusual observations or
subgroups?

Measures of Association

● Using scatterplots we can visually talk about the strength and/or


direction of linear relationship between two variables

Understanding Correlation
● Correlation Coefficient ( r ):

What is the correlation coefficient?

A measure that evaluates the direction and the strength of a linear


association between x and y

● In order to measure the correlation coefficient, the variable needs


to be quantifiable

What is the unit of correlation coefficient?

r has no unit of measure

What are possible values of the correlation coefficient and what do


they indicate?

[-1, +1]

● -1: Perfect negative linear association


● +1: Perfect positive linear association
● 0: no linear association

Understanding Correlation

The sample correlation coefficient (𝑟) is computed as,

N: sample size

Alternative Formulas for Correlation:

Covariance

● An alternative to the correlation coefficient is the covariance


between two variables, the covariance is calculated as,
● Unit: (Unit of x)(Unit of y)
● Unlike correlation coefficient, covariance depends on the “unit of
measurement” for both variables.
○ No bound on the limit
○ Measure for direction

Measures of Association

Question: Compute the sample correlation coefficient for the


following dataset by completing the table (𝑥(bar) = 6, 𝑦(bar) = 11, 𝑠x
= 3.39, 𝑠y = 4.36)

Measure of Association

Question: Compute the sample correlation coefficient for the


following dataset by completing the table (𝑥(bar) = 6, 𝑦(bar) = 11, 𝑠x
= 3.39, 𝑠y = 4.36)

Association

● Scatterplots are the ideal way to picture associations between


two quantitative variables.
● Association: Is change in the value of one variable associated
with change in the value of the other variable?!
● We may wonder if there are any association between the following
variables. And if there is one, is it positive or negative, and how
strong is it.
○ Temperature and sales of AC devices!
○ Population of a city and the number of parks in it!
○ Number of employees in a company and annual profit!
○ Price of item and its weekly sales

Assigning Roles to Variables In Scatterplots

● To make a scatterplot of two quantitative variables, assign one to


the y-axis and the other to the x-axis
● Since we are investigating two variables, we call this branch of
Statistics bivariate analysis.
● Each point is placed on a scatterplot at a position that corresponds
to values of the two variables
● The point’s horizontal location is specified by its x-value, and its
vertical location is specified by its y-value variable
● Together, these variables are known as coordinates and written
(x, y)

Assigning Roles to Variables in Scatterplots

● One variable plays the role of the explanatory or predictor


variable, while the other takes on the role of the response
variable.
○ Horizontal Axis: Explanatory, predictor, independent
● We place the explanatory variable on the x-axis and the response
variable on the y-axis.
○ Vertical Axis: Response, Dependent
● The x- and y-variables are sometimes referred to as the
independent and dependent variables, respectively.

Example
Understanding Correlation

● The ratio of the sum of the product zxzy for every point in the
scatterplot, to n – 1 is called the correlation coefficient.

Alternative Formulas for Correlation:

Understanding Correlation (An Example)

Finding the Correlation Coefficient

● Suppose the data pairs are:


Understanding Correlation

Correlation measures the strength of the linear association between two


quantitative variables

Before you use correlation, you must check three conditions:

● Quantitative Variables Condition: Correlation applies only to


quantitative variables
● Linearity Condition: Correlation measures the strength only of
the linear association
● Outlier Condition: Unusual observations can distort the
correlation

Understanding Correlation

Correlation Properties:

● The sign of a correlation coefficient gives the direction of the


association
● Correlation is always between −1 and +1
● Correlation treats x and y symmetrically
● Correlation has no units
● Correlation is not affected by changes in the center or scale of
either variable.
● Correlation measures the strength of the linear association
between the two variables.
● Correlation is sensitive to unusual observations

Correlation Coefficient

Understanding Correlation

Correlation Table

Correlation tables are compact and give a lot of summary


information at a glance. There, you’ll see the correlations between
pairs of variables in a data set arranged in a table.

▪ A correlation table for some variables collected on a sample of


Amazon books.
Straightening Scatterplots

● Sometimes scatterplots do not show a linear pattern. There are


ways to straighten the points
● However, if we look at the logarithm of the values, the plot looks
straighter, so the correlation is now a more appropriate measure of
association.
● Simple transformations such as the logarithm, square root, and
reciprocal can sometimes straighten a scatterplot’s form.

Correlation ≠ Causation

● Two variables may be correlated but that does not mean


there is a causal effect between them.
● Some examples include:
○ There is a positive correlation between the sales of ice
cream and the number of deaths by drowning.
○ Number of pirates in the world vs. global warming
● The two variables with a strong correlation, may both be
connected to a third “lurking” variable that is not visible.
● The third variable may be simultaneously affecting both variables.
● Therefore, correlation does not mean causation
Possible Correlation Reasons

A few Notes

● Don’t correlate categorical variables


● Make sure the association is linear
● Beware of outliers and multiple clusters
● The correlation between just two data points is
meaningless.
● Don’t confuse correlation with causation

Selection

● Only look at a fraction of your data and ignore the rest


Chapter 7: Introduction To Linear Regression

The Linear Model

Example: The scatterplot below shows monthly advertising expenditures


against sales over 5 years.

The Linear Model

● Looking at the example, the analyst might be faced with


questions like:
○ What is the expected sales volume if the monthly advertising
expenditure is 0.3 millions (no data points).
○ What is the expected sales volume if the monthly advertising
expenditure is 0.9 millions (more than one point)
○ What level of monthly advertising expenditure can create a
sales volume of $35 millions?
○ If the current sales volume is $25 millions, to double the
sales, how much should the monthly advertising expenditure
be increased?
○ What is the expected relationship between the monthly
advertising expenditure and the sales volume?

The Linear Model

Example (continued): We see that the points don’t all line up, but that a
straight line can summarize the general pattern. We call this line a linear
model. This line can be used to predict sales for the level of advertising
expenses.
The Linear Regression Model

A few other examples for situations where a linear regression


model can be used:

● Estimate a person’s income based on education and years of


experience.
● Predict the selling price of a house based on its size and location.
● Predict auto sales based on consumer income, interest rates and
price discounts.
● Predict hydro consumption based on different electricity pricing
systems.
● Predict a firm’s sales based on its advertising.

The Linear Model

● The regression line: The line that best fits all the points.
● What do we mean by “best fits”?
○ The line is used to predict values of the dependent variable
for values of the independent variable.
● Note: The value predicted by the line is usually not equal to the
value of the data. There are some errors (residuals).
● Even we know that the line is not the “perfect” prediction, we can
still work with the linear model and accept some level of error.
● Unless the points form a perfect line, we will always have some
errors.
The Linear Model

● Remember there is always a difference between the predicted


values of the dependent variable (y-hat) and the actual value for
the dependent variable (𝑦) (if there is a point).
● This difference is called the “residual”.
● A linear model can be written in the form,
The Linear Model

● The difference between the predicted value and the observed


value, y, is called the residual and is denoted e.

● Residuals are shown in the picture below

The Linear Model

Example (continued): For advertising expenses of $1.42 million, the


actual sales are $28.1 million and the predicted sales are $32.9 million.
The residual is $28.1 million − $32.9 million = −$4.8 million of sales.
The Linear Model

The Line of “Best Fit”

● Some residuals will be positive and some negative, so adding up


all the residuals is not a good assessment of how well the line fits
the data
● If we consider the sum of the squares of the residuals, then the
smaller the sum, the better the fit
● The line of best fit is the line for which the sum of the squared
residuals is smallest – often called the least squares line.
● Question: Why do we square the residuals?
○ So that + and - residuals don’t cancel each other out

Correlation and The Line

● The scatterplot of real data won’t fall exactly on a line so we


denote the model of predicted values by the equation

● The “hat” on the y will be used to represent an approximate value


● The approximate value (the predicted value) is used to predict the
dependent variable (𝑦) based on the value of the independent
variable (𝑥). In this example the independent variable is the sales
volume (𝑦) and the independent variable is monthly advertising
expenditure (𝑥).

Correlation and the Line

For our example of sales and advertising expenses, the line shown
with the scatterplot has the equation that follows
Correlation and the Line

a) What is the slope? How can you interpret the slope in this
question?

For every additional unit of increase in advertising expenditure ($


million), we expect sales to increase by 8.31 ($ million)

b) What is the intercept? How can you interpret the intercept?

Expected sales when advertising expenditure = 0 is $21.1 million

The Linear Model

Method of ordinary least squares (OLS):

● Some residuals will be positive and some negative, so adding up


all the residuals is not a good assessment of how well the line fits
the data
● If we consider the sum of the squares of the residuals, then the
smaller the sum, the better the fit
● The regression line is the line for which the sum of the squared
residuals is smallest – often called the least squares line.
Question: Why do we square the residuals?

● + and -

The Linear Model

● The OLS method chooses the line whereby the error sum of
squares (SSE) is minimized.
● SSE is the sum of he squared differences between the observed
values 𝑦 and their predicted values y(hat).
● The OLS method predicts the straight line that is “closest” to the
data.
● The OLS method tries to minimize SSE, which is,

Correlation and the Line

We can find the slope of the least squares line using the correlation
and the standard deviations as follows,

● The slope gets its sign from the correlation. If the correlation is
positive, the scatterplot runs from lower left to upper right and the
slope of the line is positive. (remember, standard deviation is
always a positive number).
● The slope gets its units from the ratio of the two standard
deviations, so the units of the slope are a ratio of the units of the
variables.

Correlation and the Line

● To find the intercept of our line, we use the means. If our line
estimates the data, then it should predict 𝑦(bar) for the x- value
𝑥(bar). Thus we get the following relationship for 𝑦(bar) from our
line,
● We can now solve this equation for the intercept to obtain the
formula for the intercept

Correlation and the Line

● In summary, to find

● The slope of the line is found using the formula,

● Where 𝑟 is the correlation coefficient and 𝑆 and 𝑆 are the 𝑥𝑦


standard deviations of the independent and dependent
variables, respectively.
● The intercept of the line is found using the formula,

● The least squares line has the formula,

Correlation and the Line

Least squares lines are commonly called regression lines. We’ll need to
check the same condition for regression as we did for correlation.
● Quantitative Variables Condition
● Linearity Condition
● Outlier Condition

Standardizing Data

Correlation and then Line

● (recall) Previously, we said that the correlation coefficient can also


be expressed using standardized values (z).

● These z-scores are also useful in interpreting regression models


because they have the simple properties that their means are zero
and their standard deviations are 1.
● In other words, we have
Correlation and the Line

● If we consider finding the least squares line for standardized


variables zx and zy, the formula for slope can be simplified as,

● The intercept formula can be rewritten as,

● So if we are working with standardized values, the slope will


be equal to 𝑟 and the intercept is 0. The formula for the line
will be,

Correlation and the Line

From the least squares line formula,

we can conclude that, for example.

● If the value of x is one SD above the mean, the predicted


standardized value of y will be equal to____r____.
● If the value of x is at the mean, the predicted standardized value
for y will be equal to___0____.
● If the two variables x and y have no linear relationship with each
other, if the value of x is one SD above the mean, the predicted
standardized value of y will be equal to__0____.
Correlation to the Line

● For our data on advertising costs and sales, the correlation


coefficient 𝑟 is 0.693. We can now express the relationship for the
standardized variables as,

● That means, for every standard deviation above (or below) the
mean we are in advertising expenses, we’ll predict that the sales
are _______0.693______ standard deviations above (or below)
their mean.

Regression to the Mean

▪ Regression to the mean is indicating, that each predicted value y(hat)


is closer to its mean that its corresponding 𝑥.

▪ The reason can be seen from the standardized value best fit line
𝑍(hat)y , which is,

● For example if x is 2 SDs above its mean, we won’t ever move


more than 2 SDs away from y, since r can’t be bigger than 1. For
the majority of x values, the corresponding predicted value y(bar)
is closer to its mean than 2 SD.

Regression Lines

● Between two variables, we have only one correlation


coefficient 𝑟.
● However, there are two regression lines; one that has x as the
explanatory variable and one that has y as explanatory
variable.

Regression

Question: We have the following dataset for the number of salespeople


working in an organization and the corresponding sales numbers.

Regression

1. What is the mean and standard deviation of number of


salespeople and sales volume?
2. If we employ 8 salespeople, how many standard deviations is
that above or below the mean?
3. Are these two variables correlated? If yes, is that a positive
correlation or a negative one? Is it a strong correlation?
4. If the relationship can be modelled by a regression line, what
is the equation of this line?
5. If the number of salespeople is increased by 1 person, how
much do you expect sales volume to change (increase or
decrease)?
6. By employing 15 salespeople in the organization, what
value of sales do you expect the organization will achieve?
7. If the number of people working is two standard deviations
above the mean, how many standard deviations above or
below the mean do you expect sales to be?
8. What value of sales does the answer to question 7 correspond
to?
Learning from the Residuals

● Remember residuals are calculated as

● The residuals are part of the data that has not been
Modeled.

● Residuals help us see whether the model makes sense


● A scatterplot of residuals against predicted values should show
nothing interesting – no patterns, no direction, no shape (otherwise
there’s sth wrong!)
● If nonlinearities, outliers, or clusters in the residuals are seen, then
we must try to determine what the regression model missed.

Learning from the Residuals

The plot of the Amazon residuals are given below. It does not
appear that there is anything interesting occurring.
● The variation in the residuals is the key to assessing how well a
model fits.

Learning from the Residuals

● The standard deviation of the residuals, se, gives us a measure of


how much the points are spread around the regression line.
● We estimate the standard deviation of the residuals as shown
below

● The standard deviation around the line should be the same


wherever we apply the model – this is called the Equal Spread
Condition.

Variations in the model and 𝑅^2

● All regression models fall somewhere between the two extremes of


zero correlation or perfect correlation of plus or minus 1.
● We consider the square of the correlation coefficient r to get r^2
which is a value between 0 and 1.
● r^2 gives the fraction of the data’s variation accounted for by the
model and 1 – r^2 is the fraction of the original variation left in the
residuals.
● r^2 by tradition is written R^2 and called “R squared”

Variations in the model and 𝑅^2

● The majority of data is accounted for by the linear model


when data points are very close to a line→𝑟 is close to -1 or 1.

Results: Simple Regression

● Se: you want to be as small as possible


● R^2 you want to be as large as possible

Variations in the model and R^2


● Thus 48.1% of the variation in sales (𝑦) is accounted for (or
explained) by the advertising expenses (𝑥), and 1 − 0.481 = 0.519
or 51.9% of the variability in sales has been left in the residuals
(not explained by the advertising expenses (𝑥)).

Variations in the model and R^2

Question: An economist wants to investigate if debt payment disparity


between different cities is due to differences in average income in those
cities. Assuming variable 𝑥 represents income and variable 𝑦 represents
debt payments. She studies the sample data containing information on
26 and finds the sample correlation coefficient between 𝑥 and 𝑦 as 0.38.

Nonlinear Relationships

A regression model works well if the relationship between the two


variables is linear. What should be done if the relationship is nonlinear?
(relationship between number of Cell Phones (000s) vs. HDI for
countries.
Nonlinear Relationships

● To use linear regression models:


● Transform or re-express one or both variables by a function such
as:
○ Logarithm
○ Square root
○ Reciprocal
Chapter 8: Randomness and Probability

Probability

● The (theoretical) probability of event A can be computed with the


following equation:

● Example:
○ Picking a student at random, the probability that her/his
birthday is in the month of September.
○ If you draw a card from a standard deck of cards, what is the
probability of drawing a face card?

Probability Rules

Rule 1:

● If the probability of an event occurring is 0, the event can’t


occur.
○ Picking a student at random, prob(he/she is from the outer
space).
● If the probability is 1, the event always occurs.
○ Picking a student at random, prob(he/she is older than 3)
● Probability is a number between 0 and 1.
● For any event A, the probability 𝑃(𝐴) is always between 0 and 1.
Probability Rules

Rule 2: the Probability Assignment rule

● The probability of the set of all possible outcomes must be 1.

● where 𝑆 represents the set of all possible outcomes and is called


the sample space.

Probability Rules

Rule 3: the Complement rule

● The probability of an event occurring is 1 minus the probability that


it doesn’t occur.
Complement of an Event

● The event A and its Complement A^c

Probability Rules

Rule 4: the Multiplication rule

● For two independent events A and B, the probability that both A


and B occur is the product of the probabilities of the two events.

● provided that A and B are independent.


● The above equation can be used to determine if two events are
independent. The above holds if they are independent.
Probability Rules

Rule 5: the Addition rule

● Two events are disjoint (or mutually exclusive) if they have no


outcomes in common.
● The Addition Rule allows us to add the probabilities of disjoint
events to get the probability that either event occurs.

● Where A and B are disjoint

Disjoint Events

When A and B are disjoint:

When A and B are not disjoint:


Probability Rules

Rule 6: the General Addition rule

▪ The General Addition Rule calculates the probability that either of two
events occurs. It does not require that the events be disjoint.

Joint Probability

Events may be placed in a contingency table such as the one in the


example below.

Example: As part of a Pick Your Prize Promotion, a store invited


customers to choose which of three prizes they’d like to win. The
responses could be placed in the following contingency table:

Marginal Probability

● Marginal probability depends only on totals found in the margins


of the table.
Joint Probability

● Joint probabilities give the probability of two events occurring


together.

Conditional Probability

● Each row or column shows a conditional distribution given one


event.

● In the table above, the probability that a selected customer wants a


bike given that we have selected a woman is:
● P(bike|woman) = 30/251 = 0.120.

Conditional Probability

● In general, when we want the probability of an event from a


conditional distribution, we write P(B|A) and pronounce it “the
probability of B given A.”
● A probability that takes into account a given condition is called a
conditional probability.
Conditional Probability

● Question: Based on the following contingency table. Specify the


following probabilities.

a. P(bike and woman) (This is a joint probability)


b. P(woman (This is a marginal)
c. 𝑃 (𝑏𝑖𝑘𝑒|𝑤𝑜𝑚𝑎𝑛) (This is conditional probability)
d. Is the event “choosing skis” independent of the event “Man”?
Conditional Probability

Rule 7: the General Multiplication rule

● The General Multiplication Rule calculates the probability that


both of two events occurs. It does not require that the events be
independent.

Independent vs. Disjoint

● Events are independent if the occurrence of one event does not


influence (and is not influenced by) the occurrence of the other(s).
● If two events are independent P(A and B)=P(A)P(B)
● Independent events should not be confused with disjoint events.
Two events are disjoint if only one of the could happen. (example:
head or tail are two disjoint events when tossing a coin).

Independent Events

● Using the General Multiplication Rule, show why if A and B are


independent, then P(A and B)=P(A)P(B).
Chapter 9: Random Variables and Probability Distributions

Random Variable

What is a random variable?

● A measure that represents the outcome of a random event

● A random variable (usually denoted by X) could be


○ discrete, in which case it assumes a countable number of
distinct values.
○ Or continuous, in which case it assumes an uncountable
number of distinct values.

Random Outcomes of Random Variables

● Yes, the outcome of a random variable is random


○ The weight of a randomly selected student is random
○ The number of accidents on HWY-401 during 24 hours is
random
Probability Distribution

● Every random variable is associated with a probability distribution.


● Probability distributions capture the randomness inherent in
random variables.

Discrete Probability Distribution

● It is common to describe a discrete random variables using its


probability mass function (pmf), which is a list of possible
outcomes of the random variable and their associated
probabilities.
● In other words, the probability mass function is a list of all
possible outcomes and their corresponding probabilities

Discrete Probability Distribution


● Question: using a histograms, draw the pmf of random variable 𝑋,
where 𝑋 represents the number of courses a full- time student took
in Fall 2018.

Discrete Probability Distribution

Cumulative Distribution Function

● Discrete random variables can also be defined by their cumulative


distribution function (abbr. as CDF).
● For any value of the random variable 𝑋, the CDF is defined as 𝑃
(𝑋 ≤ x) .
● For instance, suppose you throw a fair die and the random variable
𝑋 is the number showing,
Cumulative Distribution Function

● Question: using a histograms, draw the CDF of random variable


𝑋, where 𝑋 represents the number of courses a full- time student
took in Fall 2018

Expected Value of a Random Variable

How would you interpret the expected value of a random variable?

● A weighted average of the outcomes of a random variable


● For discrete: weight = probability

● Note: 𝐸(𝑋) should not be confused with the most probable value
of the random variable. It may be not even one of the possible
values of the random variable.
Expected Value of a Random Variable

Expected Value (or why insurance could be a profitable business)

Question: The probability model for a particular life insurance policy is


shown. Find the expected annual payout on a policy. How much do we
expect this company is paying out per year, in the long run?
Variance and Standard Deviation of a Random Variable

● Variance talks about how the values are dispersed around the
expected value, whether they are closely clustered or scattered
around it.
● It is a measure of dispersion.

Variance and Standard Deviation of a Random Variable

● The Variance of X denoted as Var(X) is calculated as:

Variance and Standard Deviation of a Random Variable

● Whatever the unit of measurement for 𝑋 is, the variance is


measured in that unit squared (if 𝑋 is in $, the variance is in $2).
The standard deviation of 𝑋 is a measure of dispersion with the
same unit. The standard deviation of random variable 𝑋 is,
Var(X) and SD(X)

Example: Find the variation and standard deviation of the annual


payout.

● Remember variance and standard deviation are measures to


calculate the average spread of data around its mean.
● In the last column, we have the distance (deviation) from the
mean.

Var(X) and SD(X)

Discrete Probability (Uniform Distribution)

● If X is a random variable with possible outcomes 1, 2, ..., n and for


each i, then we say X has a discrete Uniform distribution U[1, ...,
n].
● Example: Tossing a die is described by the Uniform model U[1, 2,
3, 4, 5, 6], with 𝑃 𝑋 = 𝑖 = 1/6.
● Question: Is X a random variable?
Discrete Probability (Uniform Distribution)

Binomial Distribution

● An online vendor has 25 daily visitors on average. Each visitor


makes a purchase with probability 0.2. On any given day, what is
the probability that,
○ No customer makes a purchase?
○ Exactly 4 customers make a purchase?
○ Between 10 and 17 customers make a purchase?
○ Etc.
● Visitor is either a purchasing customer or a non- purchasing
customer
● Visitors are independent!
● The probability of purchase is known!
Discrete Probability (Bernoulli Trial)

Definition: A Bernoulli Trial is a trial with the following characteristics:

1. There are only two possible outcomes (success and failure) for
each trial.
2. The probability of success, denoted p, is the same for each trial.
The probability of failure is q = 1 – p.
3. The trials are independent.

Discrete Probability (Bernoulli Trial)

Discrete Probability (Bernoulli Trial)

● So each event has,


○ only two outcomes
○ each outcome has a probability and the probabilities are
complementary.
● Therefore,

● Or
Introduction: Binomial Distribution

Question 1: Let’s throw a die 3 times, what is the probability of rolling


the number 5 every time?

● Probability of success 𝑝 = 1/6.


● Probability of failure 𝑞 = 5/6.
● Let’s define:
○ A to be the event that number 5 is rolled the first time
○ B to be the event that number 5 is rolled the second time
○ C to be the event that number 5 is rolled the third time
● Question is, what is 𝑃(𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑛𝑑 𝐶)? (Note the joint
probability!)

Introduction: Binomial Distribution

Question 2: Now, let’s throw the die 3 times, what is the probability of
rolling the number 5 exactly 2 times?

● Again we have, Probability of success 𝑝 = 1/6.


● Probability of failure 𝑞 = ⅚
Introduction: Binomial Distribution

Binomial Distribution

Introduction: Binomial Distribution

● Question 3: Now what if we want to find the probability of rolling


number 5 exactly twice in 10 rolls.
● What are some examples of desired events?(SSFFFFFFFF),
(FFFFFFFF,SS), (FFSSFFFFFF), ...
Introduction: Binomial Distribution

Binomial Distribution

● Predicting the number of successes in a series of Bernoulli trials.


● If n = Number of trials
● If p = Probability of success (and q = 1 − p = probability of failure)
● For X = Number of successes in n trials.
● What is P(X = k)?
Binomial Distribution

● The mean and standard deviation of the Binomial Distribution are,

Binomial Distribution

● An online vendor has 25 daily visitors on average. Each visitor


makes a purchase with probability 0.2. On any given day, what is
the probability that exactly 5 customers make a purchase?

Binomial Distribution

● An online vendor has 25 daily visitors on average. Each visitor


makes a purchase with probability 0.2. On any given day, what is
the probability that,
Binomial Distribution

Binomial Distribution

Question: Based on historical data, 85% of claims received at a


department of a small insurance company can clear the preliminary audit
stage without requiring additional documents. Suppose the department
usually receives 10 claims every day. What is the probability that on any
given day 8 claims will successfully pass the preliminary audit without
requiring more documents??
Binomial Distribution

● What is the probability that the number is 7? (x=7)


● What is the probability that the number is 6? (x=6)
● What is the probability that the number is 5? (x=5)
● What is the probability that the number is 4? (x=4)
● What is the probability that the number is 3? (x=3)
● The collection of all probabilities P(X=x) will give the binomial
distribution for this problem

Binomial Distribution

Continuous Random Variables

● A continuous random variable is a random variable that may take


on any value in some interval [a, b].
● The distribution of the probabilities can be shown with a curve, f (x)
called a probability density function (pdf).
Continuous Random Variable

● Therefore, for a continuous r.v. 𝑋, it is only meaningful to find the


probability that the value of 𝑋 falls within some specified interval.
Therefore,

● This is because,

Probability Density Function

● For a continuous r.v. 𝑋, the counterpart of a pmf is used.


● It is called the probability density function denoted by
𝑓 𝑥 (abbr. as PDF)
● Unlike the pmf, the PDF does not provide probabilities
Directly.
● The area under the PDF 𝑓(𝑥) between the two values 𝑎
and 𝑏 represents the probability 𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) .
● The entire area under 𝑓(𝑥) equals 1.
Continuous Random Variables

Density functions must satisfy these requirements:

1. They must be non-negative for every possible value.


2. The area under the curve must exactly equal 1.

● You can find probabilities like 𝑃(𝑋 ≤ 10) or 𝑃(8 ≤ 𝑋 ≤ 10) can be
read from the density function by calculating the area under the
density curve 𝑓(𝑥) .

Probability Density Function

● Suppose the plot is the PDF of continuous r.v. 𝑋. Highlight the


probabilities.

Continuous Random Variable (Uniform)


Cumulative Distribution Function

● Continuous random variables can also be defined by their


cumulative distribution function (abbr. as CDF).
● For any value 𝑥 of the continuous random variable 𝑋, the CDF
denoted by 𝐹(𝑥), is defined as
Cumulative Distribution Function

● Suppose the plot is the PDF of continuous r.v. 𝑋. Highlight the


probabilities.

The Normal Distribution

● It is the familiar bell-shaped distribution


● It is the most extensively used continuous probability distribution in
statistics
● It closely approximates the probability distribution for a wide range
of different random variables
● It is the cornerstone of statistical inference
● Possibilities of where normal distribution can be used is endless, a
few examples include:
○ Advertising expenditure of firms
○ Revenues generated by firms
○ Return on an investment
Normal Distribution

Characteristics of the normal distribution: (PDF)

Normal Distribution

● How do we calculate these probabilities?


○ Using Integrals
○ Using software

Standard Normal Distribution

What is the standard normal distribution?

● It is a special case of the normal distribution with a mean qual


to zero and a standard deviation equal to one.
● It is denoted with letter Z to denote a random variable that is
normal and has 𝐸(𝑍) = 0 and 𝑉𝑎𝑟(𝑍) = 𝑆𝐷(𝑍) = 1
● Each value of this random variable (Z) is actually a z-score.
Standard Normal Distribution

Z-Score (reminder)

Remember the formula

What does a z-score 2.2 imply?

● z-score 2.2 implies that the point is 2.2 standard deviations to the
right of the mean

What does a z-score -1.8 imply?

● z-score -1.8 implies that the point is -1.8 standard deviations to the
left of the mean
Standardization

Standardization

Convert the following x values to z-scores


Standardization

Suppose you have the z-score and want to find the x-value

The empirical rule

The 68-95-99.7 Rule (the Empirical Rule)

● In the normal distributions, about 68% of the values fall within one
standard deviation of the mean, about 95% of the values fall within
two standard deviations of the mean, and about 99.7% of the
values fall within three standard deviations of the mean.

Other Applications (an example)

● Sometimes, stock markets follow an uptrend (or downtrend) within


2 standard deviations of the mean. This is called moving within the
linear regression channel.
● Here is a chart of the Australian index (the All Ordinaries) from
2003 to Sep 2006.
zTable

● Provides cumulative probabilities (e.g., 𝑃(𝑍 ≤ 𝑧)

● Ex. P(Z ≤ -2.24) = 0.0125


zTable

Using the z table provided on the previous slide, find the following
probabilities.

zTable

● We can also find the z-score if we are given the probability.


For instance,
Adding Normal Variable

Adding and Subtracting Normally Distributed Variables

● An important property of Normal models is that the sum or


difference of independent Normal random variables is also
Normal

Adding Normal Variables


Subtracting Normal Variables

Normal Random Variable


Chapter 10: Sampling Distributions
Background (Example)

● A credit card company’s marketing specialist had an idea offering


double air miles to its customers with an airline- affiliated card if
they increased their spending by at least $800 in the month
following the offer.
● Before starting the actual program, in order to forecast the cost
and revenue of the offer, finance department of the credit card
company needed to know what percentage of customers would
actually qualify for the double miles.
● Two marketing specialists (Alice and Bob) have been tasked with
finding out what the true proportion of customers who would qualify
for double miles is.

Background (Example)

● Alice decided to send the offer to a random sample of 1000


customers to find out. In that sample, she found that 211 (21.1%)
of the cardholders increased their spending by more than the
required $800.
● Bob sent the offer to a different random sample of 1000 customers
and found out. In his sample, 202 or (20.2%) of the cardholders
increased their spending by more than the required $800.
● Each sample gave a different result!
● What is the true proportion (percentage) of customers who would
increase their spending by $800?

Background (Example)

● If instead of two specialists we had say 100 of them and they each
collected a sample and found a proportion of customers who’d
increase their spending following the offer, what are your thoughts
on the distribution (shape, center, spread) of these different 100
proportion values?
● What would be the shape of this distribution?
● What would affect the center (mean) of this distribution?
● What would affect the spread (std. dev.) of this distribution?

Parameter vs. Statistic

● Recall: a parameter belongs to the population whereas, a statistic


belongs to a sample
● A parameter is a constant, although its value may be unknown.
● A statistic is a random variable, whose value depends on the
chosen random sample.
● There is only one population but many different samples of the
same size can be drawn.
● A statistic used to make inferences about the population parameter
is called an estimator or point estimator.

Sample Proportions

● If events have only two outcomes, we can call them “success” and
“failure”
● The proportion of “success” in a sample is called the “sample
proportion”.
● Examples: We would like to estimate the proportion of smokers
over the age of 25 in a city. We select 100 people from the city
(25+) and measure the proportion of them who smoke. This
proportion will be the sample proportion.
● The sample proportion is most probably different from the
population proportion (true proportion).

Sample Proportions
What are some examples of proportions?

● Proportion of credit card holders who increase spending after


a promotion
● Percentage of passenger who book first-class
● Probability (proportion) of pregnant women requiring a C- section

Why is the distribution of sample proportions important?

Sample Proportions

Sample Proportion
Sample Proportions

● The result of the simulation can be summarized in a table, as


below. (𝑝 = 0.25 and 𝑛 = 30)

Sample Proportions
● The result of the simulation can be summarized in a table, as
below. (𝑝 = 0.25 and 𝑛 = 70)

Samples Proportions

● Remember
a. the difference between sample proportions, referred to as sampling
error, is not really an error. It’s just the variability you’d expect to
see from one sample to another. A better term might be sampling
variability.

Sample Proportions

● Sample proportions are distributed.


● Seems like sample distributions are distributed around
population proportion.
● Depending on sample size, their dispersion (spread) around the
population proportion is different.
● The distribution of sample proportions can be modeled by a
normal distribution if,
○ All samples are independent
○ Sample size is large enough

Sample Proportions

● Based on what we observed, when certain conditions hold (more


on that later!), we can model the distribution of sample proportions
using a normal distribution.
● The sample proportions (^p ), when modeled by the normal
distribution will have the mean and standard deviation as follows,

Sample Proportions

● In order to model the distribution of sample proportions using the


normal distribution, we need to make sure a number of
assumptions and conditions are satisfied.

1. Independence Assumption: The sampled values must be


independent of each other
2. Sample Size Assumption: The sample size, n, must be large
enough
3. Randomization Condition: If your data come from an experiment,
subjects should have been randomly assigned to treatments
4. 10% Condition: If sampling has not been made with replacement,
then the sample size, n, must be no larger than 10% of the
population.
5. Success/Failure Condition: The sample size must be big enough
so that both the number of “successes,” np, and the number of
“failures,” nq, are expected to be at least 10.

Sample Proportions

Question: Suppose we already know that 15% of the population of a


city smokes. Assuming that the population is homogeneous across
the city,

a. What is the average and standard deviation of the sample


proportion derived from a random sample of 100 individuals
randomly selected?
b. What happens to the expected value (average) and the standard
deviation of 𝑃^ if 𝑛 becomes larger?
Sample Distribution

Question: Rogers provides cable, phone, and internet services to


customers, some of whom subscribe to packages including several
services. Nationwide, suppose that 30% of Rogers customers are
“package subscribers” and subscribe to all three types of service. A
representative in Toronto wonders if the proportion in his region is the
same as the national proportions. If the same proportions holds in his
region and he takes a survey of 100 customers at random from his
subscriber list:
a) What proportion of customers would you expect to be package
subscribers?

b) What is the standard deviation of the sample proportions?

c) What is the shape you expect the sampling distribution of the


proportion?

d) What is the probability that a sample from this population shows a


sample proportion that is at least 0.49? Would you be surprised to draw
a sample with that proportion or more extreme? What conclusion would
you draw based on this sample?
Sample Mean

Some Notations…

● Let 𝑌 be the random variable representing a certain characteristic


of the population (e.g., starting salary of business graduates from
Canadian universities). Then
Sampling Distribution of 𝑌(Bar)

● Suppose we have the sampling distribution of 𝑌(Bar), what can be


derived from this information?

● As you can see, it is extremely useful to find the sampling


distribution of 𝑌(Bar)

Sample mean

● The distribution of the sample mean, 𝑦(bar), when estimated by the


normal distribution has the following parameters,

● where σ is the standard deviation of the population and 𝑛 is the


sample size.

Central Limit Theorem

● Central Limit Theorem (CLT): The sampling distribution of any


mean becomes Normal as the sample size grows.
● This is true regardless of the shape of the population distribution!
● The Central Limit Theorem talks about the sample means and
sample proportions of many different random samples drawn from
the same population.
● For large enough sample sizes (at least 30) sampling distribution
of sample means can be approximated by a Normal model. The
larger the sample, the better the approximation will be.

Sample Mean Distribution

● Based on the Central Limit Theorem (CLT), the distribution of the


sample mean (𝑦 bar) is normally distributed, regardless of the
distribution of the population, if sample size (n) is sufficiently large.
● If the population is Normally distributed, the sample mean (𝑦 bar) is
normally distributed, regardless of the sample size.

Sample Mean

Assumptions and Conditions

● Independence Assumption: The sampled values must be


independent of each other
● Randomization Condition: The data values must be sampled
randomly.
● 10% Condition: When the sample is drawn without
replacement, the sample size, n, should be no more than 10%
of the population
● Large Enough Sample Condition: If the population is
unimodal and symmetric, even a fairly small sample is okay.
For skewed distributions, larger sample sizes are required for
distribution of means to be approximately Normal

Sample Mean

Question: A potato chips manufacturer produces bags of chips with


weights that are normally distributed with mean 60 grams and standard
deviation 2 grams.
a) If you select one bag of chips randomly, what is the probability that it
weighs less than 59 grams?

b) If you select 20 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?

c) If you select 10 bag of chips randomly, what is the probability that the
average weight of this sample is greater than 59 grams?
Central Limit Theorem

Question: A potato chips manufacturer produces bags of chips which


weigh on average 60 grams with a standard deviation 2 grams.

a) If you select one bag of chips randomly, what is the probability that it
weighs less than 59 grams?

b) If you select 35 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?
Example

Question: According to a survey from Statistics Canada, in 2011, the


average earnings in Vancouver for educational services was $42,800.
Suppose the standard deviation of earnings in Educational Services for
the whole of Vancouver was $12,500, and we want the standard
deviation of average earnings in our survey to be at most $1000. What is
the smallest sample that will satisfy this criterion?
Sampling Distributions

● We now have the sampling distribution of (a) sample proportions,


and (b) sample means.

Standard Error

● Population parameters are usually unknown.


● Therefore, we usually use sample characteristics (statistics) to
“estimate” population statistics.
● Whenever we estimate the standard deviation of a sampling
distribution, we call it a standard error (SE).
Standard Error

● For a sample proportion, 𝑝(hat), the standard error is:

Standard Error
Example

Question: A study commissioned by a clothing manufacturer measured


the “waist sizes” of a random sample of 250 men. The mean and
standard deviation of the waist sizes for all 250 men are 36.33 inches
and 4.019 inches, respectively. What do you expect the standard
deviation of the sampling distribution of the mean to be?
Sampling Distribution

Question: When a truckload of apples arrives at a packing plant, a


random sample of 150 is selected and examined for bruises,
discolouration, and other defects. The whole truckload will be rejected if
more than 5% of the sample is unsatisfactory. Suppose that in fact 8% of
the apples on the truck do not meet the desired standard. What’s the
probability that the shipment will be accepted anyway?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy