Unit 8 Regression
Unit 8 Regression
Unit 8 Regression
Unit 8
Regression
Learning Outcomes:
1. Students should be able to estimate the correlation coefficient for a given data set
2. Students should be able to estimate the line of best fit for a given data set
3. Students should be able to determine whether a regression model is significant
Pre-requisites:
1. Students must be able to plot points on the Cartesian coordinate system
2. They should have basic understanding of statistics and central tendencies
Key Concepts: Regression, Correlation, Pearson’s r
56
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
The value shows how good the correlation is (not how steep the line is), and if it is positive or negative.
57
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
The table below is a crosstab that shows by age whether somebody has an unlisted phone number.
This table shows the number of observations with each combination of possible values of the two
variables in each cell of the table
We can see, for example, there are 185 people aged 18 to 34 years who do not have an unlisted phone
number.
Column percentages are also shown (these are percentages within the columns, so that each column’s
percentages add up to 100%); for example, 24% of all the people without an unlisted phone number are
aged 18 to 34 years.
The age distribution for people without unlisted numbers is different from that for people with
unlisted numbers. In other words, the crosstab reveals a relationship between the two: people
with unlisted phone numbers are more likely to be younger.
Thus, we can also say that the variables used to create this table are correlated. If there were
no relationship between these two categorical variables, we would say that they were not
correlated.
In this example, the two variables can both be viewed as being ordered. Consequently, we can
potentially describe the patterns as being positive or negative correlations (negative in the table
shown). However, where both variables are not ordered, we can simply refer to the strength of the
correlation without discussing its direction (i.e., whether it is positive or negative).
58
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
1.1.2 Scatterplots
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric
variables. The position of each dot on the horizontal and vertical axis indicates values for an individual
data point. Scatter plots are used to observe relationships between variables.
Example
This is a scatter plot showing the amount of sleep needed per day by age.
As seen above, as you grow older, you need less sleep (but still probably more than you’re currently
getting).
Answer: This is a negative correlation. As we move along the x-axis toward the greater numbers,
the points move down which means the y-values are decreasing, making this a negative correlation.
59
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
1.2 Pearson’s r
The Pearson correlation coefficient is used to measure the strength of a linear association between
two variables, where the value r = 1 means a perfect positive correlation and the value r = -1 means a
perfect negative correlation. So, for example, you could use this test to find out whether people's
height and weight are correlated (the taller the people are, the heavier they're likely to be).
Requirements for Pearson's correlation coefficient are as follows:Scale of measurement should be
interval or ratio
60
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
How can we determine the strength of association based on the Pearson correlation coefficient?
The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will
be to either +1 or -1 depending on whether the relationship is positive or negative, respectively.
Achieving a value of +1 or -1 means that all your data points are included on the line of best fit – there
are no data points that show any variation away from this line. Values for r between +1 and -1 (for
example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the value
of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation
coefficients are shown in the diagram below:
Remember that these values are guidelines and whether an association is strong or not will also depend on
what you are measuring.
61
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Example 1
In the example below of 6 people with different age and different weight, let us try calculating the value of the
Pearson r.
Solution:
For the Calculation of the Pearson Correlation Coefficient, we will first calculate the following values:
62
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Assumptions
There are four "assumptions" that underpin a Pearson's correlation. If any of these four assumptions are not
met, analysing your data using a Pearson's correlation might not lead to a valid result.
Assumption # 1: The two variables should be measured at the continuous level. Examples of such continuous
variables include height (measured in feet and inches), temperature (measured in °C), salary (measured in
dollars/INR), revision time (measured in hours), intelligence (measured using IQ score), reaction time (measured
in milliseconds), test performance (measured from 0 to 100), sales (measured in number of transactions per
month), and so forth.
Assumption # 2: There needs to be a linear relationship between your two variables. Whilst there are a number
of ways to check whether a Pearson's correlation exists, we suggest creating a scatterplot using Stata, where
you can plot your two variables against each other. You can then visually inspect the scatterplot to check for
linearity. Your scatterplot may look something like one of the following:
Assumption #3: There should be no significant outliers. Outliers are simply single data points within your data
that do not follow the usual pattern (e.g. in a study of 100 students' IQ scores, where the mean score was 108
with only a small variation between students, one student had a score of 156, which is very unusual, and may
even put her in the top 1% of IQ scores globally). The following scatterplots highlight the potential impact of
outliers:
63
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Pearson's r is sensitive to outliers, which can have a great impact on the line of best fit and the Pearson
correlation coefficient, leading to very difficult conclusions regarding your data. Therefore, it is best if there are
no outliers or they are kept to a minimum. Fortunately, you can use Stata to detect possible outliers
using scatterplots.
Assumption # 4: Your variables should be approximately normally distributed. In order to assess the statistical
significance of the Pearson correlation, you need to have bivariate normality, but this assumption is difficult to
assess, so a simpler method is more commonly used.
Let there be two variables x and y. If y depends on x, then the result comes in the form of a simple regression.
Furthermore, we name the variables x and y as:
Also, we can have one more definition for the regression line of y on x. We can call it the best fit as
the result comes from least squares. This method is the most suitable for finding the value
of y on x i.e. the value of a dependent variable on an independent variable.
Least Squares Method
∑ ei2 = ∑ (yi – y ^ i)2 = ∑ (yi – a – bxi)2
64
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Here:
y ^ i = a + bxi, denotes the estimated value of yi for a given random value of a variable of xi
ei = Difference between observed and estimated value and is the error or residue. The
regression line of y or x along with the estimation errors are as follows:
On minimizing the least squares equation, here is what we get. We refer to these equations Normal Equations.
∑yi = na + b ∑xi
∑xiyi = a ∑xi2 + b ∑xi
We get the least squares estimate for a and b by solving the above two equations for both a and b.
b = Cov(x,y)/Sx2
= (r.SxSy)/Sx2
= (r.Sy)/Sx
a = y¯ – bx¯
[ y – y¯ ]/Sy = r[ x – x¯ ]/Sx
Sometimes, it might so happen that variable x depends on variable y. In such cases, the line of regression of x
on y is:
x = a ^ + b^y
Regression Equation
[ x – x¯ ]/Sx = r[ y – y¯ ]/Sy
Question
65
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Solution
(i) The intersection of two lines have the same intersection point and that is [x¯, y¯]. Therefore, we replace, x
and y with x¯ and y¯
7x – 3y = 18
4x – y = 11
(ii) We know,
r2 = 7/12
Therefore,
Regression lines are very useful for forecasting procedures. The purpose of the line is to describe the
interrelation of a dependent variable (Y variable) with one or many independent variables (X variable). By using
the equation obtained from the regression line an analyst can forecast future behaviours of the dependent
variable by inputting different values for the independent ones. Regression lines are widely used in the financial
sector and in business in general.
Financial analysts employ linear regressions to forecast stock prices, commodity prices and to perform
valuations for many different securities. On the other hand, companies employ regressions for the purpose of
forecasting sales, inventories and many other variables that are crucial for strategy and planning.
(Y = a + bX + u)
b is the slope
66
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Example: 1
Data was collected on the “depth of dive” and the “duration of dive” of penguins. The following linear model is
a fairly good summary of the data:
Where:
Interpretation of the slope: If the duration of the dive increases by 1 minute, we predict the depth of the
dive will increase by approximately 2.915 yards.
Interpretation of the intercept: If the duration of the dive is 0 seconds, then we predict the depth of the dive
is 0.015 yards.
Comments: The interpretation of the intercept doesn’t make sense in the real world. It isn’t reasonable for
the duration of a dive to be near t = 0, because that’s too short for a dive. If data with x-values near zero
wouldn’t make sense, then usually the interpretation of the intercept won’t seem realistic in the real world.
It is, however, acceptable (even required) to interpret this as a coefficient in the model.
Example: 2
Reinforced concrete buildings have steel frames. One of the main factors affecting the durability of these
buildings is carbonation of the concrete (caused by a chemical reaction that changes the pH of the concrete),
which then corrodes the steel reinforcing the building.
Data is collected on specimens of the core taken from such buildings, where the following are measured:
Interpretation of the slope: If the depth of the carbonation increases by 1 mm, then the model predicts that the
strength of the concrete will decrease by approximately 2.8 Mpa.
Interpretation of the intercept: If the depth of the carbonation is 0, then the model predicts that the strength
of the concrete is approximately 24.5 Mpa.
Comments: Notice that it isn’t necessary to fully understand the units in which the variables are measured in
order to correctly interpret these coefficients. While it is good to understand data thoroughly, it is also important
to understand the structure of linear models. In this model, notice that the strength decreases as the
carbonation increases, which is shown by the negative slope coefficient. When you interpret a negative slope,
notice that you must say that, as the explanatory variable increases, then the response variable decreases.
Example: 3
When cigarettes are burned, one by-product in the smoke is carbon monoxide. Data is collected to determine
whether the carbon monoxide emission can be predicted by the nicotine level of the cigarette.
It is determined that the relationship is approximately linear when we predict carbon monoxide, C,
from the nicotine level, N
Both variables are measured in milligrams
The formula for the model is C = 3.0 + 10.3.N
67
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Interpretation of the slope: If the amount of nicotine goes up by 1 mg, then we predict the amount of carbon
monoxide in the smoke will increase by 10.3 mg.
Interpretation of the intercept: If the amount of nicotine is zero, then we predict that the amount of carbon
monoxide in the smoke will be about 3.0 mg.
Correlation is a statistical technique which tells us how strongly the pair of variables are linearly related and
change together. It does not tell us why and how behind the relationship but it just says the relationship exists.
Causation takes a step further than correlation. It says any change in the value of one variable will cause a
change in the value of another variable, which means one variable makes the other happen. It is also referred
to as cause and effect.
Two or more variables considered to be related, in a statistical context, if their values change so that as the
value of one variable increases or decreases so does the value of the other variable (it may be in the same or
opposite direction).
68
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
For example,
For the two variables "hours worked" and "income earned" there is a relationship between the two
such that the increase in hours worked is associated with an increase in income earned as well.
If we consider the two variables "price" and "purchasing power", as the price of goods increases a
person's ability to buy these goods decreases (assuming a constant income).
Therefore:
Correlation is a statistical measure (expressed as a number) that describes the size and direction of a
relationship between two or more variables.
A correlation between variables, however, does not automatically mean that the change in one variable
is the cause of change in the values of the other variable.
Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a
causal relationship between the two events. This is also referred to as cause and effect.
Theoretically, the difference between the two types of relationships are easy to identify — an action or
occurrence can cause another (e.g. smoking causes an increase in the risk of developing lung cancer), or it
can correlate with another (e.g. smoking is correlated with alcoholism, but it does not cause alcoholism). In
practice, however, it remains difficult to clearly establish cause and effect, compared to establishing correlation.
Example 1
Suppose a study of speeding violations and drivers who use cell phones produced the following fictional data:
Cell Phone User Speeding violation in the last year No speeding violation in the last year Total
No 45 405 450
The total number of people in the sample is 755. The row totals are 305 and 450. The column totals are 70 and
685. Notice that 305 + 450 = 755 and 70 + 685 = 755.
69
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
3. Find P (Person had no violation in the last year AND was a cell phone user)
Number of cell phone users with no violation / Total number in study = 280/755
4. Find P (Person is a cell phone user OR person had no violation in the last year)
This table shows a random sample of 100 hikers and the areas of hiking they prefer.
Sex The Coastline Near Lakes and Streams On Mountain Peaks Total
Female 18 16 ___ 45
Sex The Coastline Near Lakes and Streams On Mountain Peaks Total
Female 18 16 11 45
Male 16 25 14 55
Total 34 41 25 100
2. Find the probability that a person is male given that the person prefers hiking near lakes and streams
Hint:
Let M = being male, and let L = prefers hiking near lakes and streams.
1. What word tells you this is conditional?
2. Fill in the blanks and calculate the probability: P(___|___) = ___.
3. Is the sample space for this problem all 100 hikers? If not, what is it?
Answer
1. The word “given” tells you that this is a conditional.
2. P(M|L) =2541
3. No, the sample space for this problem is the 41 hikers who prefer lakes and streams.
70
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
2. Reading
2.1 Correlation
Correlation is a measure of how closely two variables move together. Pearson’s correlation coefficient is a
common measure of correlation, and it ranges from +1 for two variables that are perfectly in sync with each
other, to 0 when they have no correlation, to -1 when the two variables are moving opposite to each other.
For linear regression, one way of calculating the slope of the regression line uses Pearson’s correlation, so it
is worth understanding what correlation is.
Y = a + bx
The correlation coefficient that indicates the strength of the relationship between two variables can be
found using the following formula:
Where:
rxy – the correlation coefficient of the linear relationship between the variables x and y
In order to calculate the correlation coefficient using the formula above, you must undertake the following
steps:
2. Calculate the means (averages) x̅ for the x-variable and ȳ for the y-variable.
3. For the x-variable, subtract the mean from each value of the x-variable (let’s call this new variable
“a”). Do the same for the y-variable (let’s call this variable “b”).
4. Multiply each a-value by the corresponding b-value and find the sum of these multiplications (the
final value is the numerator in the formula).
6. Find the square root of the value obtained in the previous step (this is the denominator in the
formula).
71
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
You can see that the manual calculation of the correlation coefficient is an extremely tedious process,
especially if the data sample is large. However, there are many software tools that can help you save time
when calculating the coefficient. ‘CORREL’ function of MS Excel returns the correlation coefficient of two cell
range.
Example of Correlation
X is an investor; he invests money in share market. His portfolio primarily tracks the performance of
the S&P 500 (this is a stock market index in USA that measures the performance of top 500 large
companies in the USA).
X wants to add the stock of Apple Inc. Before adding Apple to his portfolio, he wants to assess the
correlation between the stock and the S&P 500 to ensure that adding the stock won’t increase the
systematic risk of his portfolio.
To find the coefficient, X gathers the following prices from the last five years (Step 1)
Using the formula above, X can determine the correlation between the prices of the S&P 500 Index and
Apple Inc.
Next, X calculates the average prices of each security for the given periods (Step 2):
After the calculation of the average prices, we can find the other values. A summary of the
calculations is given in the table below:
72
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
The coefficient indicates that the prices of the S&P 500 and Apple Inc. have a high positive correlation. This
means that their respective prices tend to move in the same direction. Therefore, adding Apple to his portfolio
would, in fact, increase the level of systematic risk.
2.2 Regression
With correlation, we determined how much two sets of numbers changed together. With regression, we will
to use one set of numbers to make a prediction on the value in the other set. Correlation is part of what we
need for regression. But we also need to know how much each set of numbers change individually, via the
standard deviation, and where we should put the line, i.e. the intercept.
The regression that we are calculating is very similar to correlation. So you might ask, why do we have both
regression and correlation? It turns out that regression and correlation give related but distinct information.
Correlation gives you a measurement that can be interpreted independently of the scale of the two
variables. Correlation is always bounded by ±1. The closer the correlation is to ±1 the closer the two
variables are to a perfectly linear relationship.
The regression slope by itself does not tell you that. The regression slope tells you the expected
change in the dependent variable y when the independent variable x changes one unit. That
information cannot be calculated from the correlation alone.
A fallout of those two points is that correlation is a unit-less value, while the slope of the regression line has
units. If for instance, you owned a large business and were doing an analysis on the amount of revenue in
each region compared to the number of salespeople in that region, you would get a unit-less result with
correlation, and with regression, you would get a result that was the amount of money per person.
73
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Regression Equations
With linear regression, we are trying to solve for the equation of a line, which is shown below.
Y = a + bx
The values that we need to solve for are ‘b’ the slope of the line, and ‘a’ the intercept of the line. The hardest
part of calculating the slope ‘b’, is finding the correlation between x and y, which we have already done. The
only modification that needs to be made to that correlation is multiplying it by the ratio of the standard
deviations of x and y, which we also already calculated when finding the correlation. The equation for slope
is shown below
Once we have the slope, getting the intercept is easy. Assuming that you are using the standard equations
for correlation and standard deviation, which go through the average of x and y (x̄,ȳ), the equation for
intercept is
The best way to determine whether it is a simple linear regression problem is to do a plot of Marks vs Hours.
If the plot comes like below, it may be inferred that a linear model can be used for this problem.
74
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
The data represented in the above plot would be used to find out a line such as the following which
represents a best-fit line. The slope of the best-fit line would be the value of “m”.
The value of m (slope of the line) can be determined using an objective function which is a combination of loss
function and a regularization term. For simple linear regression, the objective function would be
the summation of Mean Squared Error (MSE). MSE is the sum of squared distances between the target
variable (actual marks) and the predicted values (marks calculated using the above equation). The best fit line
would be obtained by minimizing the objective function (summation of mean squared error).
75
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
A statistics instructor at a university would like to examine the relationship (if any) between the number of
optional homework problems students do during the semester and their final course grade. She randomly
selects 12 students for study and asks them to keep track of the number of these problems completed during
the course of the semester. At the end of the class each student’s total is recorded along with their final grade.
The data is available in the following table:
51 62 3162
58 68 3944
62 66 4092
65 66 4290
68 67 4556
76 72 5472
77 73 5621
78 72 5616
78 78 6084
84 73 6132
85 76 6460
91 75 6825
76
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
7) Use the regression equation to predict a student’s final course grade if 75 optional homework
assignments are done.
8) Use the regression equation to compute the number of optional homework assignments that need to be
completed if a student expects a course grade of 85
Problem 2
The following data set of the heights and weights of a random sample of 15 male students is acquired. Is
there any apparent relationship between the two variables?
1 5 ft 6 inch 60 kgs
2 5 ft 4 inch 55 kgs
3 5 ft 8 inch 78 kgs
5 5 ft 4 inch 53 kgs
6 5 ft 7 inch 56 kgs
7 5 ft 3 inch 54 kgs
9 5 ft 6 inch 74 kgs
10 5 ft 3 inch 65 kgs
11 5 ft 9 inch 76 kgs
14 5 ft 4 inch 63 kgs
15 5 ft 7 inch 62 kgs
Would you expect the same relationship (if any) to exist between the heights and weights of the opposite
sex?
77
LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Problem 3
From the following data of hours worked in a factory (x) and output units (y), determine the regression line
of y on x, the linear correlation coefficient and determine the type of correlation.
Hours (X) 80 79 83 84 78 60 82 85 79 84 80 62
Production (Y) 300 302 315 330 300 250 300 340 315 330 310 240
Problem 4
The height (in cm) and weight (in kg) of 10 basketball players on a team are as below:
Height (X) 186 189 190 192 193 193 198 201 203 205
Weight (Y) 85 85 86 90 87 91 93 103 100 101
Calculate:
78