Commerce 1DA3 Notes-2
Commerce 1DA3 Notes-2
Commerce 1DA3 Notes-2
Statistical Analysis
Chapter 2: Data
What Is Data?
What is Data?
● Data tables are cumbersome for complex data sets, so often two
or more separate data tables are linked together in a relational
database
● Each data table included in the database is a relation because it is
about a specific set of cases with information about each of these
cases for all of the variables
● Example: A typical relational database is provided consisting of
three relations: customer data, item data, and transaction data
Types of Variables
Types of Variables
Types of Variables
Data Collection
Key Words:
Sampling
Features of Sampling
● The size of the sample determines what we can conclude from the
data regardless of the size of the population
Stratified Sampling:
Cluster Sampling:
● Split the population into parts or clusters that each represent the
population. Perform a census within one or a few clusters at
random.
● If each cluster fairly represents the population, cluster sampling
will generate an unbiased sample.
Systematic Sampling:
Multistage Sampling:
● A survey that can yield the information you need about the
population in which you are interested is a valid survey
● To help ensure a valid survey, you need to ask four questions:
○ What do I want to know?
○ Who are the right respondents?
○ What are the right questions?
○ What will be done with the results?
Displaying Data
Charts
● Pie Charts: Pie charts show the whole group as a circle (“pie”)
sliced into pieces. The size of each piece is proportional to the
fraction of the whole in each category. The pie chart for Loblaw
data is displayed below.
Frequency Tables
Frequency Distribution
● Groups data into categories and records the number of (counts the
number of) observations in each category
Contingency Tables
Contingency Distribution
● Conditional Distributions: We may want to restrict variables in a
distribution to show the distribution for just those cases that satisfy
a specified condition. This is called a conditional distribution.
(e.g., social networking use given the country of focus is Egypt)
Simpson’s Paradox
Simson's Paradox
Simpson’s Paradox
Frequency Table
Histogram
Example of Histogram
Example of Histogram
● Stem-And-Leaf Diagrams
Stem and Leaf
● Stem-and-Leaf Display
1) Decide how wide to make the bins – if there are n data points, use
log2 𝑛 for the number of bins
2) Determine the count for each bin
3) Decide where to place values that land on the endpoint of a bin.
For example, does a value of $5 go into the $0 to $5 bin or the $5
to $10 bin? The standard rule is to place such values in the higher
bin.
● A stem and leaf display is like a histogram, but it also gives the
individual values
● These are easy to make by hand for data sets that aren’t too large,
so they’re a great way to look at a small batch of values quickly
Describing Data
Shape
Centre
Centre
Centre
Centre
Mode - Median - Mean
● We need to determine how spread out the data are because the
more the data vary, the less a measure of centre can tell us.
● One simple measure of spread is the range, defined as the
difference between the extremes (max and min)
● Range = Max - Min
Spread
Spread
● Each quartile is a value that frames the middle 50% of the data.
One-quarter of the data lies below the lower quartile, Q1, and
one-quarter lies above the third quartile, Q3.
● The interquartile range (IQR) is defined to be the difference
between the two quartiles: Q1 and Q3
Spread
● Variance
● What is variance?
● Variance: Average of squared deviations between data points and
the mean
○ Variance Unit of Measurement: (Unit of data)^2
● For sample values 𝑦1,𝑦2,...,𝑦𝑛 the sample variance (𝑠^2) is
calculated as,
Spread
For population values 𝑦1,𝑦2,...,𝑦𝑁 the population variance (𝜎^2) [𝜎is
the Greek letter sigma] is calculated as,
Spread
● Standard Deviation: Standard deviation represents, on average,
how far data points are from the mean
○ Standard Deviation Unit of Measurement: Same unit as
data
● What are standard deviations for the sample and
population as calculated in the previous slide?
Spread
● Coefficient of Variation (CV)
○ What is the CV for a dataset: Measure of relative spread
● What is the CV for a sample and a population?
Percentile
3) Erect (but don’t show in the final plot) “fences” around the main
part of the data, placing the upper fence 1.5 IQRs above the upper
quartile and the lower fence 1.5 IQRs below the lower quartile.
● 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 0.297
● 𝑄3 + 1.5𝐼𝑄𝑅 = 1.972 + 0.4455 = 2.41
● Q1 - 1.5*IQR
4) Draw lines (whiskers) from each end of the box up and
down to the most extreme data values found within the
fences.
● The centre of a boxplot shows the middle half of the data between
the quartiles – the height of the box equals the IQR.
● If the median is roughly centred between the quartiles, then the
middle half of the data is roughly symmetric. If it is not centred, the
distribution is skewed.
● The whiskers show skewness as well if they are not roughly the
same length.
● The outliers are displayed individually to keep them out of the way
in judging skewness and to display them for special attention
Boxplot
Question: Draw a boxplot that has two outliers on the right hand
side. Show Q1, Q2 and Q3. Show the IQR. Show the max and min
values.
Comparing Groups
Comparing Groups
● Histograms work well for comparing two groups, but boxplots tend
to offer better results when side-by-side comparison of several
groups is sought.
● Below the NYSE data is displayed in monthly boxplots.
Standardizing
Example: Compare two companies (from the “top” 100 companies) with
respect to the variables New Jobs (jobs created) and Average Pay.
Standardizing
● To find how many standard deviations a value is from the
mean we calculate a standardized value or z-score.
● z-Score Formula:
Standardizing
● In the following dataset find the z-score of all sample values (𝑦(bar)
= 6 and 𝑠 = 3.16). This procedure is called standardizing the data.
Standardizing Data
Outlier Identification:
Q3+1.5 * IQR
Q1 - 1.5 * IQR
● Iven the sample mean 𝑥ҧ, the sample standard deviation 𝑠
and a relatively symmetric and bell-shaped distribution,
○ Approximately 68% of all observation fall in the interval 𝑥ҧ ±
𝑠
○ Approximately 95% of all observation fall in the interval 𝑥ҧ ±
2𝑠
○ Approximately 99.7% (almost 100%) of all observation fall in
the interval 𝑥ҧ ± 3𝑠
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Measures of Association
Understanding Correlation
● Correlation Coefficient ( r ):
[-1, +1]
Understanding Correlation
N: sample size
Covariance
Measures of Association
Measure of Association
Association
Example
Understanding Correlation
● The ratio of the sum of the product zxzy for every point in the
scatterplot, to n – 1 is called the correlation coefficient.
Understanding Correlation
Correlation Properties:
Correlation Coefficient
Understanding Correlation
Correlation Table
Correlation ≠ Causation
A few Notes
Selection
Example (continued): We see that the points don’t all line up, but that a
straight line can summarize the general pattern. We call this line a linear
model. This line can be used to predict sales for the level of advertising
expenses.
The Linear Regression Model
● The regression line: The line that best fits all the points.
● What do we mean by “best fits”?
○ The line is used to predict values of the dependent variable
for values of the independent variable.
● Note: The value predicted by the line is usually not equal to the
value of the data. There are some errors (residuals).
● Even we know that the line is not the “perfect” prediction, we can
still work with the linear model and accept some level of error.
● Unless the points form a perfect line, we will always have some
errors.
The Linear Model
For our example of sales and advertising expenses, the line shown
with the scatterplot has the equation that follows
Correlation and the Line
a) What is the slope? How can you interpret the slope in this
question?
● + and -
● The OLS method chooses the line whereby the error sum of
squares (SSE) is minimized.
● SSE is the sum of he squared differences between the observed
values 𝑦 and their predicted values y(hat).
● The OLS method predicts the straight line that is “closest” to the
data.
● The OLS method tries to minimize SSE, which is,
We can find the slope of the least squares line using the correlation
and the standard deviations as follows,
● The slope gets its sign from the correlation. If the correlation is
positive, the scatterplot runs from lower left to upper right and the
slope of the line is positive. (remember, standard deviation is
always a positive number).
● The slope gets its units from the ratio of the two standard
deviations, so the units of the slope are a ratio of the units of the
variables.
● To find the intercept of our line, we use the means. If our line
estimates the data, then it should predict 𝑦(bar) for the x- value
𝑥(bar). Thus we get the following relationship for 𝑦(bar) from our
line,
● We can now solve this equation for the intercept to obtain the
formula for the intercept
● In summary, to find
Least squares lines are commonly called regression lines. We’ll need to
check the same condition for regression as we did for correlation.
● Quantitative Variables Condition
● Linearity Condition
● Outlier Condition
Standardizing Data
● That means, for every standard deviation above (or below) the
mean we are in advertising expenses, we’ll predict that the sales
are _______0.693______ standard deviations above (or below)
their mean.
▪ The reason can be seen from the standardized value best fit line
𝑍(hat)y , which is,
Regression Lines
Regression
Regression
● The residuals are part of the data that has not been
Modeled.
The plot of the Amazon residuals are given below. It does not
appear that there is anything interesting occurring.
● The variation in the residuals is the key to assessing how well a
model fits.
Nonlinear Relationships
Probability
● Example:
○ Picking a student at random, the probability that her/his
birthday is in the month of September.
○ If you draw a card from a standard deck of cards, what is the
probability of drawing a face card?
Probability Rules
Rule 1:
Probability Rules
Probability Rules
Disjoint Events
▪ The General Addition Rule calculates the probability that either of two
events occurs. It does not require that the events be disjoint.
Joint Probability
Marginal Probability
Conditional Probability
Conditional Probability
Independent Events
Random Variable
● Note: 𝐸(𝑋) should not be confused with the most probable value
of the random variable. It may be not even one of the possible
values of the random variable.
Expected Value of a Random Variable
● Variance talks about how the values are dispersed around the
expected value, whether they are closely clustered or scattered
around it.
● It is a measure of dispersion.
Binomial Distribution
1. There are only two possible outcomes (success and failure) for
each trial.
2. The probability of success, denoted p, is the same for each trial.
The probability of failure is q = 1 – p.
3. The trials are independent.
● Or
Introduction: Binomial Distribution
Question 2: Now, let’s throw the die 3 times, what is the probability of
rolling the number 5 exactly 2 times?
Binomial Distribution
Binomial Distribution
Binomial Distribution
Binomial Distribution
Binomial Distribution
Binomial Distribution
● This is because,
● You can find probabilities like 𝑃(𝑋 ≤ 10) or 𝑃(8 ≤ 𝑋 ≤ 10) can be
read from the density function by calculating the area under the
density curve 𝑓(𝑥) .
Normal Distribution
Z-Score (reminder)
● z-score 2.2 implies that the point is 2.2 standard deviations to the
right of the mean
● z-score -1.8 implies that the point is -1.8 standard deviations to the
left of the mean
Standardization
Standardization
Suppose you have the z-score and want to find the x-value
● In the normal distributions, about 68% of the values fall within one
standard deviation of the mean, about 95% of the values fall within
two standard deviations of the mean, and about 99.7% of the
values fall within three standard deviations of the mean.
Using the z table provided on the previous slide, find the following
probabilities.
zTable
Background (Example)
Background (Example)
● If instead of two specialists we had say 100 of them and they each
collected a sample and found a proportion of customers who’d
increase their spending following the offer, what are your thoughts
on the distribution (shape, center, spread) of these different 100
proportion values?
● What would be the shape of this distribution?
● What would affect the center (mean) of this distribution?
● What would affect the spread (std. dev.) of this distribution?
Sample Proportions
● If events have only two outcomes, we can call them “success” and
“failure”
● The proportion of “success” in a sample is called the “sample
proportion”.
● Examples: We would like to estimate the proportion of smokers
over the age of 25 in a city. We select 100 people from the city
(25+) and measure the proportion of them who smoke. This
proportion will be the sample proportion.
● The sample proportion is most probably different from the
population proportion (true proportion).
Sample Proportions
What are some examples of proportions?
Sample Proportions
Sample Proportion
Sample Proportions
Sample Proportions
● The result of the simulation can be summarized in a table, as
below. (𝑝 = 0.25 and 𝑛 = 70)
Samples Proportions
● Remember
a. the difference between sample proportions, referred to as sampling
error, is not really an error. It’s just the variability you’d expect to
see from one sample to another. A better term might be sampling
variability.
Sample Proportions
Sample Proportions
Sample Proportions
Sample Proportions
Some Notations…
Sample mean
Sample Mean
Sample Mean
b) If you select 20 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?
c) If you select 10 bag of chips randomly, what is the probability that the
average weight of this sample is greater than 59 grams?
Central Limit Theorem
a) If you select one bag of chips randomly, what is the probability that it
weighs less than 59 grams?
b) If you select 35 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?
Example
Standard Error
Standard Error
Example