0% found this document useful (0 votes)
4 views13 pages

Topic 4 ETC1000

The document discusses models of relationships between data, focusing on understanding causal relationships through inputs and outputs. It introduces the simple linear regression model, which describes the relationship between one input and one output using a linear equation, and explores tools like scatter plots, covariance, and correlation to analyze these relationships. The document emphasizes the importance of building models to explain why data behaves in certain ways, using examples such as the relationship between education and income.

Uploaded by

deepfriedbaby814
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views13 pages

Topic 4 ETC1000

The document discusses models of relationships between data, focusing on understanding causal relationships through inputs and outputs. It introduces the simple linear regression model, which describes the relationship between one input and one output using a linear equation, and explores tools like scatter plots, covariance, and correlation to analyze these relationships. The document emphasizes the importance of building models to explain why data behaves in certain ways, using examples such as the relationship between education and income.

Uploaded by

deepfriedbaby814
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Topic 4

Models of Relationships between Data


So far in the earlier topics, we have focused on describing data – understanding what is happening with a set
of data, both in terms of categorical and numerical data.

Now we move to trying to explain why data behaves the way it does. What causes some people to earn more
than others? What causes some companies to be more profitable than others? What causes countries to have
better economic growth in some years than in others? These are why questions. In order to answer why
questions with evidence and data, we need to build models. These next two topics will discuss the most
commonly used model for understanding the causal relationships between variables.

The simple idea of a model is that we have inputs which cause an output. The output is the quantity or the
variable that we want to influence. The inputs are the factors that can make a difference to that variable, and
typically the inputs are things that we can have some control over. In contrast, the output is often not directly
controllable. The idea is that if you can understand the connection between the inputs and the output, and you
have a goal of influencing the output in a positive way, you achieve that goal by changing the levels of your
inputs. For example, you are a company wanting to make profit. Profit is your output, but you cannot simply
walk into work one day and say, let's make more profit. You have no direct control over that output. What you
can control is the products you sell, the quality of staff you have, the amount of marketing you do, your business
costs, these and other inputs that produce sales and in turn produce profit.

A model helps you understand the relationship between those inputs and the output. In this example, it will
help you to know where to focus your efforts and then in turn help you to decide the best levels of those inputs
to get the maximum profit.

A model is a set of equations that measures the relationship between inputs and output or outputs. In this Topic
we will introduce a very simple but powerful model, the linear regression model.

4.1 The Simple Linear Regression Model

This is the simplest model of all, where we have one input X and one output Y, and X and Y are related by a
linear equation.

We refer to this line capturing the relationship between X & Y as a regression line. We describe the line by
the equation,
𝑌 𝛽 𝛽𝑋

1
This should be familiar from your earlier studies of maths in high school. You would probably have seen this
equation written as: 𝑌 𝑚𝑋 𝑐. Take a look at Figure 4.1.

Figure 4.1: The Equation of a Regression Line

In the linear equation, 𝛽 is known as the intercept – the predicted value of Y when X = 0. That is easy to see
if we insert X = 0 into the equation:

𝑌 𝛽 𝛽𝑋 𝛽 𝛽 0 𝛽

𝛽 is the slope, the predicted change in Y for a 1 unit change in X. We can see that by combining the next two
equations:
𝑌 𝛽 𝛽𝑋

Now add 1 to X and we get:


𝑌 𝛽 𝛽 𝑋 1

Subtract 𝑌 𝑌 𝛽 𝛽𝑋 𝛽 𝛽 𝑋 1 𝛽

So, 𝛽 is the predicted change in Y for a one unit change in X – the slope of the line in the figure above.

Before we go into the details of this model and how you would quantify it, it is useful to develop some
exploratory tools that allow us to work out whether a potential X and Y are even related to each other. We
will explore that question by the use of a visual tool, the scatter diagram, and a quantitative measure, the
correlation.

4.2 Visualising Patterns in Bivariate Data

A scatter plot is a graph that allows us to visually see how two numerical variables relate to each other.

An example: Suppose we are interested in whether investing in education improves one’s income or not. Let
us turn back to the data discussed earlier. Here we have income, education and a bunch of other variables. A
snapshot of the data is shown in Figure 4.2.

2
Figure 4.2: Data Snapshot – US Income Data

One way of exploring the relationship between income and education is to draw a scatter plot. That is, we plot
matched pairs of income and education for each of the observations in our data. We will do this for just a
sample of 50 people to make the graph clearer. The result is shown in Figure 4.3.

Notice that we have put education on the X-axis and income on the Y-axis. In general, the variable or
characteristic that we have some control over goes on the X-axis (the Input), and the variable we want to
influence goes on the Y-axis (the Output). We think X might do a good job of predicting Y, or X might cause
Y.

Figure 4.3: Scatter Plot – Income and Education (Random Sample of 50 Observations)

The scatter plot in Figure 4.3 shows a general pattern: someone with more education tends to earn more than
someone with less education. As we move along the X-axis, education increases, and so does the income. Note
this is not true for all people in the sample – there are some with high levels of education who don’t earn very
much – perhaps they took Arts degrees . But our overall pattern suggests that in most cases, there does
seem to be a relationship between education and income, and this relationship is positive.

The scatter plot in general is used for exploring a few key questions:
 Is there a relationship between X and Y?
 What direction is the relationship?
 Does the relationship appear to be linear or non-linear?

3
As you gain more experience with scatter plots, you will learn to judge the answers to each of these questions,
so you can learn some basics about the relationship between X and Y.

Scatter plots are very useful for visualising relationships, but they have two limitations:

1. How we interpret them is somewhat subjective. One person may see a relationship between variables
X and Y while another person may not.

2. They don’t allow us to quantify the strength of the relationship between X and Y. We would like some
measure that tells us how closely X can predict Y, for example.

In the next two sections we consider ways of quantifying the relationship between two variables.

4.3 How Much Connection Is There Between X & Y?

We can quantify how (i.e. in what direction) two variables move together by a summary measure called the
covariance.

The Formula:
The sample covariance of two variables, X and Y, is given by the formula:

1
𝐶𝑜𝑣 𝑋, 𝑌 𝑋 𝑋 𝑌 𝑌
𝑛 1

Note you have seen something a bit like this before. What is the covariance of X with X? It is just the variance,
Cov(X, X) = Var(X)! The covariance is an extension of the variance.

Roughly speaking, the covariance is the average of the products of paired deviations from the mean for the
two variables, X and Y. To see what that means, notice that inside the summation operator, the formula
involves subtracting the mean of X from each X data point, subtracting the mean of Y from each Y data point,
and multiplying each 'de-meaned' pair together.

Figure 4.4: Scatter Plot and Covariance

4
To see how this works visually, first consider where the mean of X and the mean of Y lie in a scatter plot of
some illustrative data shown in Figure 4.4. The means are shown by the red lines in Figure 4.4.

Drawing lines for the mean of X and the mean of Y, we split the data into 4 quadrants. The values we sum –
the de-meaned pairs – will be positive or negative depending on which quadrant the data lies in, and the sign
of the covariance will be positive or negative accordingly.

In the scatter plot shown in Figure 3.4, there is clearly a positive relationship between X and Y.

 For the data points in the lower left quadrant: 𝑋 𝑋 is negative and 𝑌 𝑌 is negative, so 𝑋
𝑋 𝑌 𝑌 will be positive in each case.

 For the data points in the upper right quadrant: 𝑋 𝑋 is positive and 𝑌 𝑌 is positive, so
𝑋 𝑋 𝑌 𝑌 will again be positive in each case.

Summing mostly positive values together will give you a covariance that is positive.

Consider an example with a negative relationship. In this case, the data will mostly be in the upper left and
lower right quadrants. Thus negative values of 𝑋 𝑋 will be matched with positive values of 𝑌 𝑌 and
the positive values of 𝑋 𝑋 will be matched with negative values of 𝑌 𝑌 . This will give a negative
covariance.

In Excel: We can calculate the covariance automatically in Excel. The formula is:
=COVARIANCE.S(range of X data, range of Y data)
(e.g. =COVARIANCE.S(A2:A5086,B2:B5086)).

Interpreting:
What does the covariance mean? The covariance indicates both the strength and direction of the linear
association between two variables. However, it is not easy to interpret – it is measured in squared units of the
random variables, like the variance of a random variable. The main thing we learn from the covariance is
about the direction of a relationship – do the two variables tend to move in the same direction or in opposite
directions?

A positive covariance indicates that when X is big, then it is more likely that Y will be big. Similarly, when
X is small there is a higher probability of Y being small – i.e. X and Y tend to move together. This is captured
with positive covariance.

e.g. Income and education: When someone has completed a lot of education (e.g. a degree as opposed
to completion of VCE), chances are they will earn higher income. So, we would say education and
income have a positive covariance.

5
e.g. Interest rates and the price of bonds: When interest rates increase, the price of bonds decreases.
Conversely, the prices of bonds increase when interest rates decrease. So, we would say that interest
rates and bond prices have a negative covariance.

4.4 Correlation: A Better Measure of Connection between X & Y

As noted, the problem with using the covariance to measure the relationship between two variables is that we
can only interpret the direction of the relationship, not the strength of it. The solution? We calculate a
standardised measure, the correlation.

The correlation is a standardised measure: its values range from -1 (perfect negative correlation) to +1 (perfect
positive correlation). It is not influenced by the scale of the variables (i.e. it will not change if we start
measuring income in terms of thousands of dollars).

Interpreting:
Figure 3.5 illustrates some scatter plots and the resulting correlations.

Perfect Correlation: In Figures 4.5a and 4.5b we have cases of perfect positive and negative correlation
respectively. In the case of perfect positive correlation, this means that when X changes, Y will change exactly
proportionately with the change in X. Perfect correlation is pretty rare, as there are almost always other factors
that changes in X don’t reflect perfectly in changes in Y.

Zero Correlation: In Figure 4.5c we have two variables which are uncorrelated. Here, when one variable
changes, there is no tendency for the other variable to change in a predictable way.

Some Correlation: The most common case is some degree of correlation. Figure 4.5d shows a case where the
correlation between X and Y is quite strong at 0.85, but not perfect. That is, given a value for X, say a high
value, we would be pretty sure that Y will also be high but we do not know what its precise value will be.

Figure 4.5: Scatter Plots and Correlation

(a) Perfect Positive Correlation (b) Perfect Negative Correlation

6
(c) No Correlation (d) Correlation of 0.85

How do we calculate the correlation? The formula is:

∑ 𝑋 𝑋 𝑌 𝑌
𝐶𝑜𝑟𝑟 𝑋, 𝑌
∑ 𝑋 𝑋 ∑ 𝑌 𝑌

This looks a bit like the formula for the covariance – the numerator is very similar. Let us do a bit of
manipulation and you will see they are actually very closely related.

∑ 𝑋 𝑋 𝑌 𝑌
𝐶𝑜𝑟𝑟 𝑋, 𝑌
∑ 𝑋 𝑋 ∑ 𝑌 𝑌
1
∑ 𝑋 𝑋 𝑌 𝑌
𝑛 1
1 1
∑ 𝑋 𝑋 ∑ 𝑌 𝑌
𝑛 1 𝑛 1
𝐶𝑜𝑣 𝑋, 𝑌
𝑆 𝑆

These three measures – Scatter Diagram, Covariance, Correlation – are all used to explore possible linear
relationships between two variables X and Y. Once we find a possible relationship, we can construct a model
– the linear regression model.

4.5 The Theory of the Simple Linear Regression Model

We write the Simple Linear Regression Model as:

𝑌 𝛽 𝛽𝑋 𝑒

The subscript 𝑖 denotes the 𝑖th observation. This might be for the 𝑖th person, as in our previous example of
income and education, or if we are looking at data for a country's GDP and average education of its workers
then 𝑖 would stand for the country, or it could be the 𝑖th firm, household, etc.

7
The key components of the model are:

 𝑌 is called the dependent variable (the thing you want to influence).

 𝑋 is called the independent or explanatory variable (the thing you can control).

 𝛽 and 𝛽 are the population parameters which we are most interested in and want to estimate.

 𝑒 is the error – how far the observed value of Yi is from the regression line.

We need to include 𝑒 in our model because not all data points lie on the line – e.g. education does not entirely
determine a person's income. This can be seen in Figure 4.6.

Figure 4.6: The Regression Line and Regression Errors

The way we have set up the model so far is really only valid when we have data on the whole population of
interest (e.g. all Australians, all countries, all firms, etc.). But in reality, we almost never have information on
the whole population. Rather, we would have data on a sample. That is fine, as long as our sample is
representative of the population (e.g. a non-representative sample would be one where we select a few firms
in a particular industry rather than a representative sample of all industries), we can use the sample data to
estimate the true population model. We just make some slight, but important notational, differences.

We denote the true population model as:

𝑌 𝛽 𝛽𝑋 𝑒

An estimate of the true model using our sample data is:

𝑌 𝛽 𝛽𝑋 𝑒̂

All we have done is replace the 𝛽 parameters with 𝛽 values. That is, because the true population values are
unknown, we replace them with estimates of the intercept and slope based on our sample.

8
Now, the key issue is how we find out or choose the values of 𝛽 and 𝛽 . We could use our eye and determine
the particular values the intercept and the slope which seem to give the best-looking line that fits the data point.
But this is necessarily imprecise. Instead, what we do is use a formula to find the “best” line. The best line is
the one that makes the errors as small as possible. To be more precise, we find the value of the intercept and
the slope that minimises the “sum of squared errors” given by the formula below:

𝑆𝑆𝐸 𝛽 , 𝛽 𝑒̂ 𝑌 𝛽 𝛽𝑋 𝑌 𝑌

For this reason, the regression line we estimate with sample of data will be known as the “least squares” line.

4.6 Estimating a Simple Linear Regression in Excel

What a linear regression does is find the line of best fit between the two variables; Y and X. To illustrate how
to fit such a model in Excel, we use our previous example where we relate a person's education and their
income. A snapshot of the data is shown in the Figure 4.7 and some of the data is graphed in Figure 4.8 with
the regression line. The question is, how did we obtain this fitted line in Excel?

Figure 4.7: Data Snapshot – US Income Data

Figure 4.8: Scatter Plot – Income and Education with Trend Line

9
To estimate a linear regression in Excel, we use Excel's 'Data Analysis' tools. Go to the 'Data' tab and select
'Data Analysis' and select 'Regression'. Then fill out the dialogue box as shown in Figure 4.9. After pressing
'OK' the output is shown in Figure 4.10.

Figure 4.9: Excel Dialogue Box to Run a Simple Linear Regression

Figure 4.10: Regression Results – Income on Education

This technique, linear regression, will come up often in later parts of the unit so make sure you are familiar
with how to calculate a regression in Excel. In following sections, we will discuss the meaning of the various
pieces of the results. For now, the main thing is the equation for a line that is produced:

Predicted Income = -48428.4 + 5734.3 Education.

How is such a line useful? It allows us to quantify the average relationship between X and Y. In this case, we
learn that an extra year of education will, on average, lead to an increased income of $5,734 per year. Notice
we are careful here to interpret the meaning of this slope estimate, including the use of the right units for both
X (years) and Y ($ per year).

10
We can also use the line to predict income, for a given year of education: if you have 12 years of education
(finishing Secondary School, say), the prediction is:

Predicted Income = -48428.4 + 5734.3 x 12 = $20,383 per year.

4.7 Multiple Linear Regression Model

Our simple regression model is a start, but clearly inadequate – an outcome almost always has several factors
contributing to it. So, we need a model with one output and several inputs.

In the multiple regression model we have a model of the form:

𝑌 𝛽 𝛽𝑋 𝛽 𝑋 ⋯ 𝛽 𝑋 𝑒

Here 𝑌 is the dependent variable of interest – Income in our example in the previous section. 𝑋 is the first
independent variable, for example Age, and 𝑋 is the second independent variable, such as Education. We
now include 𝑘 different independent variables. Note that because we now have lots of X variables, we need to
number them. We do this by adding a subscript. The subscript 𝑖 still denotes the fact that this is an observation
for a particular person and we also include 1, 2, …, 𝑘 to denote the number of the particular variable.

We can estimate a multiple linear regression model in Excel just the same way as we estimated a simple linear
regression model. The only difference is that instead of selecting a single column in the ‘Input X Range’ we
just select a number of columns.

In our case we are going to estimate the model using two of the variables in the table of data: Education and
Age. So in this example, 𝑘 2. The results are shown in Figure 4.11.

Figure 4.11: Regression Results – Income on Education and Age

11
The first thing to focus on is the coefficients. These are the estimated intercept and slopes from the line. What
is evident is that the coefficients on both variables are positive. More education and being older leads to higher
income. We will interpret the magnitude of these coefficients in the next section.

4.8 Interpreting the Model

We want to understand what the model we have estimated tells us about the nature of the relationship between
the X’s and Y. More specifically, what do the estimates 𝛽 , 𝛽 and 𝛽 actually tell us?

𝛽 is known as the intercept – the estimated value of 𝑌 when 𝑋 0 and 𝑋 0. 𝛽 and 𝛽 are the slopes
of 𝑌 with respect to 𝑋 and 𝑋 – they estimate the change in 𝑌 for a 1 unit change in each of the variables. Let
us explain by looking at our results in Figure 4.11.

Interpreting the intercept 𝜷𝟎 :


In the results in Figure 4.11 𝛽 is a negative number: -117120. What the model is saying is that if 𝑋 0 and
𝑋 0 (i.e. Age = 0 and Education = 0) then the income that this person would earn is -$117, 120 per year.
This is not particularly meaningful. There is no one in our data who is zero years old – and there is never likely
to be! There is also no one with zero years of education. And it is not realistic to earn a negative income! In
the case of this particular regression, the intercept is not informative. But in some cases, as we will see below,
the intercept is quite useful.

Interpreting 𝜷𝟏 :
In our results 𝛽 4541 is an estimate of the effect of Education on Income: it tells us how much Income
would change if Education were one year higher, holding Age constant. In particular, we can say “take two
people of the same age, one of whom has one more year of education than the other. The person with one extra
year of education can expect to earn, on average, $4,541 more per year than the person with lower education
of the same age.”

Interpreting 𝜷𝟐 :
In our results 𝛽 3369 is an estimate of the effect of Age on Income: it tells us how much Income would
change if Age were 1 year higher, holding years of Education constant. In particular, we can say “take two
people with the same years of education, one of whom is 1 year older than the other. The person who is 1 year
older can expect to earn, on average, $3,369 more per year than the younger person with the same education.”

Being able to specify quantities like these can be very important in policy and planning. For example, it
suggests that if people could be encouraged to stay at school for longer, then they could earn more income
(and the government could receive more tax revenue!).

This way of thinking about regression shows that it is a very powerful and useful way of exploring the way
certain variables influence the outcome of interest, allowing us to consider the effects of several factors in the
same integrated model.

12
4.9 Obtaining Predictions from the Model

One of the reasons why you might want to construct a regression model in the first place is for prediction. That
is, we might want to know what the average income is for a person with certain characteristics (i.e. someone
with 12 years of education who is 30 years old).

This is straightforward. What we have obtained from constructing the model is a linear equation which relates
the Y variable to the various X variables. That is, we have an equation of the form below where 𝛽 , 𝛽 , etc.
are the coefficients that we have estimated. To construct predictions, we simply insert the values for
𝑋 , 𝑋 , … , 𝑋 that we are interested in and we denote our estimate as 𝑌.

𝑌 𝛽 𝛽𝑋 𝛽𝑋 ⋯ 𝛽 𝑋

In the example previously we had,

𝐼𝑛𝑐𝑜𝑚𝑒 𝛽 𝛽𝑋 𝛽𝑋
𝛽 𝛽 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝛽 𝐴𝑔𝑒
117120 4541 𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 3369 𝐴𝑔𝑒

To obtain an estimate for the income of an individual with 12 years of education and who is 30 years old,
we simply plug in these values for Age and Education into the above equation,

𝐼𝑛𝑐𝑜𝑚𝑒 117120 4541 12 3369 30 38442

That is, a person who has 12 years of education and is 30 years old is expected to earn $38,442 per year.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy