Unit 8 Regression

LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL
Unit 8
Regression
Title: Regression Approach: Problem Solving , Discussion, Team

Activity, Case studies
Summary: Artificial Intelligence / Machine Learning has become prevalent in almost every aspect
of our life, society and business. People across different disciplines are trying to apply AI to be more
accurate and to have better control of the future. For example, economists are using AI to predict
future market prices to make a profit, doctors use AI to classify whether a tumour is malignant or
benign, meteorologists use AI to predict the weather, HR recruiters use AI to check the resume of
applicants to verify if the applicant meets the minimum criteria for the job, banks are using AI to
check paying capacity of the customers before loan disbursement.
The AI / ML algorithm that every AI learner starts with is a linear regression (and correlation)
algorithm. So let us learn the foundations of linear regression to build a solid base for the learning
of AI and ML.
Linear regression is a method for modelling the relationship between one or more independent
variables and a dependent variable. It is the foundation block of Machine Learning and Artificial
Intelligence. It is a form of predictive modelling technique that depicts the relationship between a
dependent (target) and the independent variables (predictors).
This technique is used for forecasting, time series modelling and finding the cause - effect
relationship between the variables.
Objectives:
1. To understand the difference between correlation and regression.
2. To Understand the Pearson correlation coefficient (r) measures,
3. To understand how regression analysis is used to predict outcome.
4. To understand the main features and characteristics of the Pearson r.
Learning Outcomes:
1. Students should be able to estimate the correlation coefficient for a given data set
2. Students should be able to estimate the line of best fit for a given data set
3. Students should be able to determine whether a regression model is significant
Pre-requisites:
1. Students must be able to plot points on the Cartesian coordinate system
2. They should have basic understanding of statistics and central tendencies
Key Concepts: Regression, Correlation, Pearson’s r
56
1. Regression and Correlation

Regression can be defined as a method or an algorithm in Machine Learning that models a target value
based on independent predictors. It is essentially a statistical tool used in finding out the relationship
between a dependent variable and an independent variable. This method comes to play in forecasting
and finding out the cause and effect relationship between variables.
Regression techniques differ based on:
1. The number of independent variables
2. The type of relationship between the independent and dependent variable
Regression is basically performed when the dependent variable is of a continuous data type. The
independent variables, however, could be of any data type — continuous, nominal/categorical etc.
Regression methods find the most accurate line describing the relationship between the dependent
variable and predictors with least error. In regression, the dependent variable is the function of the
independent variable and the coefficient and the error term.
Correlation is a measure of the strength of a linear relationship between two quantitative variables
(e.g. price, sales)
 Correlation is positive when the values increase together

 Correlation is negative when one value decreases as the other increases
A correlation is assumed to be linear i.e. following a line.
Correlation can have a value:
 1 is a perfect positive correlation
 0 is no correlation (the values don't seem linked at all)
 -1 is a perfect negative correlation
The value shows how good the correlation is (not how steep the line is), and if it is positive or negative.
57
1.1 Crosstabs and Scatterplots

1.1.1 Crosstabs
Cross tabs help us establish a relationship between two variables. This relationship is exhibited in a tabular form.
The table below is a crosstab that shows by age whether somebody has an unlisted phone number.
 This table shows the number of observations with each combination of possible values of the two
variables in each cell of the table
 We can see, for example, there are 185 people aged 18 to 34 years who do not have an unlisted phone
number.
 Column percentages are also shown (these are percentages within the columns, so that each column’s
percentages add up to 100%); for example, 24% of all the people without an unlisted phone number are
aged 18 to 34 years.
 The age distribution for people without unlisted numbers is different from that for people with
unlisted numbers. In other words, the crosstab reveals a relationship between the two: people
with unlisted phone numbers are more likely to be younger.
 Thus, we can also say that the variables used to create this table are correlated. If there were
no relationship between these two categorical variables, we would say that they were not
correlated.
In this example, the two variables can both be viewed as being ordered. Consequently, we can
potentially describe the patterns as being positive or negative correlations (negative in the table
shown). However, where both variables are not ordered, we can simply refer to the strength of the
correlation without discussing its direction (i.e., whether it is positive or negative).
58
1.1.2 Scatterplots
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric
variables. The position of each dot on the horizontal and vertical axis indicates values for an individual
data point. Scatter plots are used to observe relationships between variables.
Example
This is a scatter plot showing the amount of sleep needed per day by age.
As seen above, as you grow older, you need less sleep (but still probably more than you’re currently
getting).
Question: What type of correlation is shown here?
Answer: This is a negative correlation. As we move along the x-axis toward the greater numbers,
the points move down which means the y-values are decreasing, making this a negative correlation.
59
1.2 Pearson’s r
The Pearson correlation coefficient is used to measure the strength of a linear association between
two variables, where the value r = 1 means a perfect positive correlation and the value r = -1 means a
perfect negative correlation. So, for example, you could use this test to find out whether people's
height and weight are correlated (the taller the people are, the heavier they're likely to be).
Requirements for Pearson's correlation coefficient are as follows:Scale of measurement should be
interval or ratio
 Variables should be approximately normally distributed

 The association should be linear
 There should be no outliers in the data
Equation
What does this test do?

The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a
measure of the strength of a linear association between two variables and is denoted by ‘r’. Basically,
a Pearson product-moment correlation attempts to draw a line of best fit through the data of two
variables, and the Pearson correlation coefficient, r, indicates how far away all these data points are to
this line of best fit (i.e., how well the data points fit this new model/line of best fit).
What values can the Pearson correlation coefficient take?
The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates
that there is no association between the two variables. A value greater than 0 indicates a positive
association; that is, as the value of one variable increases, so does the value of the other variable. A
value less than 0 indicates a negative association; that is, as the value of one variable increases, the
value of the other variable decreases. This is shown in the diagram below:
60
How can we determine the strength of association based on the Pearson correlation coefficient?
The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will
be to either +1 or -1 depending on whether the relationship is positive or negative, respectively.
Achieving a value of +1 or -1 means that all your data points are included on the line of best fit – there
are no data points that show any variation away from this line. Values for r between +1 and -1 (for
example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the value
of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation
coefficients are shown in the diagram below:
Are there guidelines to interpreting the Pearson's correlation coefficient?

Yes, the following guidelines have been proposed:
Coefficient, r
Strength of Association Positive Negative
Small .1 to .3 -0.1 to -0.3
Medium .3 to .5 -0.3 to -0.5
Large .5 to 1.0 -0.5 to -1.0
Remember that these values are guidelines and whether an association is strong or not will also depend on
what you are measuring.
61
Example 1
In the example below of 6 people with different age and different weight, let us try calculating the value of the
Pearson r.
Solution:
For the Calculation of the Pearson Correlation Coefficient, we will first calculate the following values:
Here the total number of people is 6 so, n=6
Now the calculation of the Pearson R is as follows:
62
 r = (n (∑xy)- (∑x)(∑y))/(√ [n ∑x2-(∑x)2][n ∑y2– (∑y)2 )

 r = (6 * (13937)- (202)(409)) / (√ [6 *7280 -(202)2] * [6 * 28365- (409)2 )
 r = (6 * (13937)- (202) * (409))/(√ [6 *7280 -(202)2] * [6 * 28365- (409)2 )
 r = (83622- 82618)/(√ [43680 -40804] * [170190- 167281 )
 r = 1004/(√ [2876] * [2909 )
 r = 1004 / (√ 8366284)
 r = 1004 / 2892.452938
 r = 0.35
Thus the value of the Pearson correlation coefficient is 0.35
Assumptions
There are four "assumptions" that underpin a Pearson's correlation. If any of these four assumptions are not
met, analysing your data using a Pearson's correlation might not lead to a valid result.
Assumption # 1: The two variables should be measured at the continuous level. Examples of such continuous
variables include height (measured in feet and inches), temperature (measured in °C), salary (measured in
dollars/INR), revision time (measured in hours), intelligence (measured using IQ score), reaction time (measured
in milliseconds), test performance (measured from 0 to 100), sales (measured in number of transactions per
month), and so forth.
Assumption # 2: There needs to be a linear relationship between your two variables. Whilst there are a number
of ways to check whether a Pearson's correlation exists, we suggest creating a scatterplot using Stata, where
you can plot your two variables against each other. You can then visually inspect the scatterplot to check for
linearity. Your scatterplot may look something like one of the following:
Assumption #3: There should be no significant outliers. Outliers are simply single data points within your data
that do not follow the usual pattern (e.g. in a study of 100 students' IQ scores, where the mean score was 108
with only a small variation between students, one student had a score of 156, which is very unusual, and may
even put her in the top 1% of IQ scores globally). The following scatterplots highlight the potential impact of
outliers:
63
Pearson's r is sensitive to outliers, which can have a great impact on the line of best fit and the Pearson
correlation coefficient, leading to very difficult conclusions regarding your data. Therefore, it is best if there are
no outliers or they are kept to a minimum. Fortunately, you can use Stata to detect possible outliers
using scatterplots.
Assumption # 4: Your variables should be approximately normally distributed. In order to assess the statistical
significance of the Pearson correlation, you need to have bivariate normality, but this assumption is difficult to
assess, so a simpler method is more commonly used.
1.3 Regression – Finding The line

When we make a distribution in which there is an involvement of more than one variable, then such an analysis
is called Regression Analysis. It generally focuses on finding or rather predicting the value of the variable that is
dependent on the other.
Let there be two variables x and y. If y depends on x, then the result comes in the form of a simple regression.
Furthermore, we name the variables x and y as:
y – Regression or Dependent Variable or Explained Variable

x – Independent Variable or Predictor or Explanator
Therefore, if we use a simple linear regression model where y depends on x, then the regression line
of y on x is:
y = a + bx
Regression Coefficient
The two constants a and b are regression parameters. Furthermore, we denote the
variable b as byx and we term it as regression coefficient of y on x.
Also, we can have one more definition for the regression line of y on x. We can call it the best fit as
the result comes from least squares. This method is the most suitable for finding the value
of y on x i.e. the value of a dependent variable on an independent variable.
Least Squares Method
∑ ei2 = ∑ (yi – y ^ i)2 = ∑ (yi – a – bxi)2
64
Here:
 Variable yi is the actual value or the observed value
 y ^ i = a + bxi, denotes the estimated value of yi for a given random value of a variable of xi
 ei = Difference between observed and estimated value and is the error or residue. The
regression line of y or x along with the estimation errors are as follows:
On minimizing the least squares equation, here is what we get. We refer to these equations Normal Equations.
∑yi = na + b ∑xi
∑xiyi = a ∑xi2 + b ∑xi
We get the least squares estimate for a and b by solving the above two equations for both a and b.
b = Cov(x,y)/Sx2
= (r.SxSy)/Sx2
= (r.Sy)/Sx
The estimate of a, after the estimation of b is:
a = y¯ – bx¯
On substituting the estimates of a and b is:
[ y – y¯ ]/Sy = r[ x – x¯ ]/Sx
Sometimes, it might so happen that variable x depends on variable y. In such cases, the line of regression of x
on y is:
x = a ^ + b^y
Regression Equation
The standard form of the regression equation of variable x on y is:
[ x – x¯ ]/Sx = r[ y – y¯ ]/Sy
Question
The regression equation for variables x and y are 7x – 3y – 18 = 0 and 4x – y – 11 = 0.
1.What is the AM for x and y?
2.Find the correlation coefficient in between x and y.
65
Solution
(i) The intersection of two lines have the same intersection point and that is [x¯, y¯]. Therefore, we replace, x
and y with x¯ and y¯
7x – 3y = 18
4x – y = 11
Hence, on solving these two equations we get x¯ = 3 and y¯ = 1.
(ii) We know,
r2 = 7/12
Therefore,
r = 712−−√ (r is positive as both the coefficients are positive)

= 0.7638
1.4 Regression – Describing the line

Definition: In statistics, a regression line is a line that best describes the behaviour of a set of data. In other
words, it’s a line that best fits the trend of a given data.
What Does Regression Line Mean?
Regression lines are very useful for forecasting procedures. The purpose of the line is to describe the
interrelation of a dependent variable (Y variable) with one or many independent variables (X variable). By using
the equation obtained from the regression line an analyst can forecast future behaviours of the dependent
variable by inputting different values for the independent ones. Regression lines are widely used in the financial
sector and in business in general.
Financial analysts employ linear regressions to forecast stock prices, commodity prices and to perform
valuations for many different securities. On the other hand, companies employ regressions for the purpose of
forecasting sales, inventories and many other variables that are crucial for strategy and planning.
The regression line formula is like the following:
(Y = a + bX + u)
The multiple regression formula looks like this:
(Y = a + b1X1 + b2X2 + b3X3 + … + btXt +u.)
Y is the dependent variable
X is the independent ones
a is the interception point
b is the slope
u is the residual regression
66
Example: 1
Data was collected on the “depth of dive” and the “duration of dive” of penguins. The following linear model is
a fairly good summary of the data:
Where:
 t is the duration of the dive in minutes

 d is the depth of the dive in yards
The equation for the model is: d t = + 0.015 2.915
Interpretation of the slope: If the duration of the dive increases by 1 minute, we predict the depth of the
dive will increase by approximately 2.915 yards.
Interpretation of the intercept: If the duration of the dive is 0 seconds, then we predict the depth of the dive
is 0.015 yards.
Comments: The interpretation of the intercept doesn’t make sense in the real world. It isn’t reasonable for
the duration of a dive to be near t = 0, because that’s too short for a dive. If data with x-values near zero
wouldn’t make sense, then usually the interpretation of the intercept won’t seem realistic in the real world.
It is, however, acceptable (even required) to interpret this as a coefficient in the model.
Example: 2
Reinforced concrete buildings have steel frames. One of the main factors affecting the durability of these
buildings is carbonation of the concrete (caused by a chemical reaction that changes the pH of the concrete),
which then corrodes the steel reinforcing the building.
Data is collected on specimens of the core taken from such buildings, where the following are measured:
 Depth of the carbonation (in mm) is called d
 Strength of the concrete (in Mpa) is called s
It is found that the model is s = 24.5 – 2.8.d
Interpretation of the slope: If the depth of the carbonation increases by 1 mm, then the model predicts that the
strength of the concrete will decrease by approximately 2.8 Mpa.
Interpretation of the intercept: If the depth of the carbonation is 0, then the model predicts that the strength
of the concrete is approximately 24.5 Mpa.
Comments: Notice that it isn’t necessary to fully understand the units in which the variables are measured in
order to correctly interpret these coefficients. While it is good to understand data thoroughly, it is also important
to understand the structure of linear models. In this model, notice that the strength decreases as the
carbonation increases, which is shown by the negative slope coefficient. When you interpret a negative slope,
notice that you must say that, as the explanatory variable increases, then the response variable decreases.
Example: 3
When cigarettes are burned, one by-product in the smoke is carbon monoxide. Data is collected to determine
whether the carbon monoxide emission can be predicted by the nicotine level of the cigarette.
 It is determined that the relationship is approximately linear when we predict carbon monoxide, C,
from the nicotine level, N
 Both variables are measured in milligrams
 The formula for the model is C = 3.0 + 10.3.N
67
Interpretation of the slope: If the amount of nicotine goes up by 1 mg, then we predict the amount of carbon
monoxide in the smoke will increase by 10.3 mg.
Interpretation of the intercept: If the amount of nicotine is zero, then we predict that the amount of carbon
monoxide in the smoke will be about 3.0 mg.
1.4 Correlation is not Causation

Correlation and causation are terms which are mostly misunderstood and often used interchangeably.
Understanding both the statistical terms is very important not only to make conclusions but more importantly,
making correct conclusions at the end. In this section we will understand why correlation does not imply
causation.
Correlation is a statistical technique which tells us how strongly the pair of variables are linearly related and
change together. It does not tell us why and how behind the relationship but it just says the relationship exists.
Example: Correlation between Ice cream sales and sunglasses sold.
As the sales of ice creams is increasing so do the sales of sunglasses.
Causation takes a step further than correlation. It says any change in the value of one variable will cause a
change in the value of another variable, which means one variable makes the other happen. It is also referred
to as cause and effect.
Two or more variables considered to be related, in a statistical context, if their values change so that as the
value of one variable increases or decreases so does the value of the other variable (it may be in the same or
opposite direction).
68
For example,
 For the two variables "hours worked" and "income earned" there is a relationship between the two
such that the increase in hours worked is associated with an increase in income earned as well.
 If we consider the two variables "price" and "purchasing power", as the price of goods increases a
person's ability to buy these goods decreases (assuming a constant income).
Therefore:
 Correlation is a statistical measure (expressed as a number) that describes the size and direction of a
relationship between two or more variables.
 A correlation between variables, however, does not automatically mean that the change in one variable
is the cause of change in the values of the other variable.
 Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a
causal relationship between the two events. This is also referred to as cause and effect.
Theoretically, the difference between the two types of relationships are easy to identify — an action or
occurrence can cause another (e.g. smoking causes an increase in the risk of developing lung cancer), or it
can correlate with another (e.g. smoking is correlated with alcoholism, but it does not cause alcoholism). In
practice, however, it remains difficult to clearly establish cause and effect, compared to establishing correlation.
1.6 Contingency Tables – Examples

A contingency table provides a way of portraying data that can facilitate calculating probabilities. The table
helps in determining conditional probabilities quite easily. The table displays sample values in relation to two
different variables that may be dependent or contingent on one another. Later on, we will use contingency
tables again, but in another manner.
Example 1
Suppose a study of speeding violations and drivers who use cell phones produced the following fictional data:
Cell Phone User Speeding violation in the last year No speeding violation in the last year Total
Yes 25 280 305
No 45 405 450
Total 70 685 755
The total number of people in the sample is 755. The row totals are 305 and 450. The column totals are 70 and
685. Notice that 305 + 450 = 755 and 70 + 685 = 755.
Calculate the following probabilities using the table:
1. Find P (Person is a cell phone user)
Number of cell phone users / Total number in study = 305 / 755
2. Find P (person had no violation in the last year)
Number of no violations / Total number in study = 685 / 755
69
3. Find P (Person had no violation in the last year AND was a cell phone user)
Number of cell phone users with no violation / Total number in study = 280/755
4. Find P (Person is a cell phone user OR person had no violation in the last year)
(305 / 755 + 685 / 755) − 280 / 755= 710 / 755

Example 2
This table shows a random sample of 100 hikers and the areas of hiking they prefer.
Hiking Area Preference
Sex The Coastline Near Lakes and Streams On Mountain Peaks Total
Female 18 16 ___ 45
Male ___ ___ 14 55
Total ___ 41 ___ ___
1. Complete the above table
Hiking Area Preference
Sex The Coastline Near Lakes and Streams On Mountain Peaks Total
Female 18 16 11 45
Male 16 25 14 55
Total 34 41 25 100
2. Find the probability that a person is male given that the person prefers hiking near lakes and streams
Hint:
Let M = being male, and let L = prefers hiking near lakes and streams.
1. What word tells you this is conditional?
2. Fill in the blanks and calculate the probability: P(___|___) = ___.
3. Is the sample space for this problem all 100 hikers? If not, what is it?
Answer
1. The word “given” tells you that this is a conditional.
2. P(M|L) =2541
3. No, the sample space for this problem is the 41 hikers who prefer lakes and streams.
70
2. Reading
2.1 Correlation
Correlation is a measure of how closely two variables move together. Pearson’s correlation coefficient is a
common measure of correlation, and it ranges from +1 for two variables that are perfectly in sync with each
other, to 0 when they have no correlation, to -1 when the two variables are moving opposite to each other.
For linear regression, one way of calculating the slope of the regression line uses Pearson’s correlation, so it
is worth understanding what correlation is.
The equation for a line is
Y = a + bx
here a = intercept and b = the slope.
How to Find the Correlation?
The correlation coefficient that indicates the strength of the relationship between two variables can be
found using the following formula:
Where:
rxy – the correlation coefficient of the linear relationship between the variables x and y
xi – the values of the x-variable in a sample
x̅ – the mean of the values of the x-variable
yi – the values of the y-variable in a sample
ȳ – the mean of the values of the y-variable
In order to calculate the correlation coefficient using the formula above, you must undertake the following
steps:
1. Obtain a data sample with the values of x-variable and y-variable.
2. Calculate the means (averages) x̅ for the x-variable and ȳ for the y-variable.
3. For the x-variable, subtract the mean from each value of the x-variable (let’s call this new variable
“a”). Do the same for the y-variable (let’s call this variable “b”).
4. Multiply each a-value by the corresponding b-value and find the sum of these multiplications (the
final value is the numerator in the formula).
5. Square each a-value and calculate the sum of the result
6. Find the square root of the value obtained in the previous step (this is the denominator in the
formula).
7. Divide the value obtained in step 4 by the value obtained in step 7.
71
You can see that the manual calculation of the correlation coefficient is an extremely tedious process,
especially if the data sample is large. However, there are many software tools that can help you save time
when calculating the coefficient. ‘CORREL’ function of MS Excel returns the correlation coefficient of two cell
range.
Example of Correlation
 X is an investor; he invests money in share market. His portfolio primarily tracks the performance of
the S&P 500 (this is a stock market index in USA that measures the performance of top 500 large
companies in the USA).
 X wants to add the stock of Apple Inc. Before adding Apple to his portfolio, he wants to assess the
correlation between the stock and the S&P 500 to ensure that adding the stock won’t increase the
systematic risk of his portfolio.
 To find the coefficient, X gathers the following prices from the last five years (Step 1)
Using the formula above, X can determine the correlation between the prices of the S&P 500 Index and
Apple Inc.
 Next, X calculates the average prices of each security for the given periods (Step 2):
 After the calculation of the average prices, we can find the other values. A summary of the
calculations is given in the table below:
72
 Using the obtained numbers, X can calculate the coefficient:
The coefficient indicates that the prices of the S&P 500 and Apple Inc. have a high positive correlation. This
means that their respective prices tend to move in the same direction. Therefore, adding Apple to his portfolio
would, in fact, increase the level of systematic risk.
2.2 Regression
With correlation, we determined how much two sets of numbers changed together. With regression, we will
to use one set of numbers to make a prediction on the value in the other set. Correlation is part of what we
need for regression. But we also need to know how much each set of numbers change individually, via the
standard deviation, and where we should put the line, i.e. the intercept.
The regression that we are calculating is very similar to correlation. So you might ask, why do we have both
regression and correlation? It turns out that regression and correlation give related but distinct information.
 Correlation gives you a measurement that can be interpreted independently of the scale of the two
variables. Correlation is always bounded by ±1. The closer the correlation is to ±1 the closer the two
variables are to a perfectly linear relationship.
 The regression slope by itself does not tell you that. The regression slope tells you the expected
change in the dependent variable y when the independent variable x changes one unit. That
information cannot be calculated from the correlation alone.
A fallout of those two points is that correlation is a unit-less value, while the slope of the regression line has
units. If for instance, you owned a large business and were doing an analysis on the amount of revenue in
each region compared to the number of salespeople in that region, you would get a unit-less result with
correlation, and with regression, you would get a result that was the amount of money per person.
73
Regression Equations
With linear regression, we are trying to solve for the equation of a line, which is shown below.
Y = a + bx
The values that we need to solve for are ‘b’ the slope of the line, and ‘a’ the intercept of the line. The hardest
part of calculating the slope ‘b’, is finding the correlation between x and y, which we have already done. The
only modification that needs to be made to that correlation is multiplying it by the ratio of the standard
deviations of x and y, which we also already calculated when finding the correlation. The equation for slope
is shown below
Once we have the slope, getting the intercept is easy. Assuming that you are using the standard equations
for correlation and standard deviation, which go through the average of x and y (x̄,ȳ), the equation for
intercept is
Simple Linear Model for Predicting Marks

Let’s consider the problem of predicting the marks of a student based on the number of hours he/she put in
towards preparation. Although at the outset, it may look like a problem which can be modelled using simple
linear regression, it could turn out to be a multiple linear regression problem depending on multiple input
features. Alternatively, it may also turn out to be a non-linear problem. However, for the sake of example,
let’s consider this as a simple linear regression problem.
 Let’s assume for the sake of understanding that the marks of a student (M) do depend on the
number of hours (H) he/she has put in towards preparation.
The following formula can represent the model:
Marks = function (No. of hours)

=> Marks = m*Hours + c
The best way to determine whether it is a simple linear regression problem is to do a plot of Marks vs Hours.
If the plot comes like below, it may be inferred that a linear model can be used for this problem.
74
Plot representing a simple linear model for predicting marks
The data represented in the above plot would be used to find out a line such as the following which
represents a best-fit line. The slope of the best-fit line would be the value of “m”.
Plot representing a simple linear model with a regression line
The value of m (slope of the line) can be determined using an objective function which is a combination of loss
function and a regularization term. For simple linear regression, the objective function would be
the summation of Mean Squared Error (MSE). MSE is the sum of squared distances between the target
variable (actual marks) and the predicted values (marks calculated using the above equation). The best fit line
would be obtained by minimizing the objective function (summation of mean squared error).
75
2.3 Practice Exercise

Problem 1
A statistics instructor at a university would like to examine the relationship (if any) between the number of
optional homework problems students do during the semester and their final course grade. She randomly
selects 12 students for study and asks them to keep track of the number of these problems completed during
the course of the semester. At the end of the class each student’s total is recorded along with their final grade.
The data is available in the following table:
Final Course Grade Vs the Number of optional homework

problems completed
No. of Problems Final Course Grade Problem Grade

completed
51 62 3162
58 68 3944
62 66 4092
65 66 4290
68 67 4556
76 72 5472
77 73 5621
78 72 5616
78 78 6084
84 73 6132
85 76 6460
91 75 6825
873 848 62254
ΣPrb Σgrd Σprb * Grd
1) For this setting identify the response variable
2) For this setting, identify the predictor variable
3) Compute the linear correlation coefficient – r – for this data set
4) Classify the direction and strength of the correlation
76
5) Test the hypothesis for a significant linear correlation
6) What is the valid prediction range for this setting?
7) Use the regression equation to predict a student’s final course grade if 75 optional homework
assignments are done.
8) Use the regression equation to compute the number of optional homework assignments that need to be
completed if a student expects a course grade of 85
Problem 2
The following data set of the heights and weights of a random sample of 15 male students is acquired. Is
there any apparent relationship between the two variables?
S.no Height Weight
1 5 ft 6 inch 60 kgs
4 5ft 9 inch 82 kgs
8 5ft 5 inch 65 kgs
12 5ft 10 inch 79 kgs
13 5ft 6 inch 75 kgs
Would you expect the same relationship (if any) to exist between the heights and weights of the opposite
sex?
77
Problem 3
From the following data of hours worked in a factory (x) and output units (y), determine the regression line
of y on x, the linear correlation coefficient and determine the type of correlation.
Hours (X) 80 79 83 84 78 60 82 85 79 84 80 62
Production (Y) 300 302 315 330 300 250 300 340 315 330 310 240
Problem 4
The height (in cm) and weight (in kg) of 10 basketball players on a team are as below:
Height (X) 186 189 190 192 193 193 198 201 203 205
Weight (Y) 85 85 86 90 87 91 93 103 100 101
Calculate:
i) The regression line of y on x.

ii) The coefficient of correlation.
Iii) The estimated weight of a player who measures 208 cm.
78

Unit 8 Regression

Uploaded by

Copyright:

Available Formats

Unit 8 Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 8 Regression

Uploaded by

Copyright:

Available Formats

LEVEL 2: AI INQUIRED (AI APPLY) TEACHER INSTRUCTION MANUAL

Title: Regression Approach: Problem Solving , Discussion, Team

1. Regression and Correlation

 Correlation is positive when the values increase together

Correlation can have a value:

 1 is a perfect positive correlation

 0 is no correlation (the values don't seem linked at all)

 -1 is a perfect negative correlation

1.1 Crosstabs and Scatterplots

Question: What type of correlation is shown here?

 Variables should be approximately normally distributed

What does this test do?

Are there guidelines to interpreting the Pearson's correlation coefficient?

Strength of Association Positive Negative

Small .1 to .3 -0.1 to -0.3

Medium .3 to .5 -0.3 to -0.5

Large .5 to 1.0 -0.5 to -1.0

Here the total number of people is 6 so, n=6

Now the calculation of the Pearson R is as follows:

 r = (n (∑xy)- (∑x)(∑y))/(√ [n ∑x2-(∑x)2][n ∑y2– (∑y)2 )

Thus the value of the Pearson correlation coefficient is 0.35

1.3 Regression – Finding The line

y – Regression or Dependent Variable or Explained Variable

 Variable yi is the actual value or the observed value

The estimate of a, after the estimation of b is:

On substituting the estimates of a and b is:

The standard form of the regression equation of variable x on y is:

The regression equation for variables x and y are 7x – 3y – 18 = 0 and 4x – y – 11 = 0.

1.What is the AM for x and y?

2.Find the correlation coefficient in between x and y.

Hence, on solving these two equations we get x¯ = 3 and y¯ = 1.

r = 712−−√ (r is positive as both the coefficients are positive)

1.4 Regression – Describing the line

What Does Regression Line Mean?

The regression line formula is like the following:

The multiple regression formula looks like this:

(Y = a + b1X1 + b2X2 + b3X3 + … + btXt +u.)

Y is the dependent variable

X is the independent ones

a is the interception point

u is the residual regression

 t is the duration of the dive in minutes

 Depth of the carbonation (in mm) is called d

 Strength of the concrete (in Mpa) is called s

It is found that the model is s = 24.5 – 2.8.d

1.4 Correlation is not Causation

Example: Correlation between Ice cream sales and sunglasses sold.

As the sales of ice creams is increasing so do the sales of sunglasses.

1.6 Contingency Tables – Examples

Yes 25 280 305

Total 70 685 755

Calculate the following probabilities using the table:

1. Find P (Person is a cell phone user)

Number of cell phone users / Total number in study = 305 / 755

2. Find P (person had no violation in the last year)

Number of no violations / Total number in study = 685 / 755

(305 / 755 + 685 / 755) − 280 / 755= 710 / 755

Hiking Area Preference

Male ___ ___ 14 55

Male _ _ 14 55

Total _ 41 _ ___