0% found this document useful (0 votes)
14 views26 pages

Notes - Correlation and Regression

The document discusses correlation and linear regression, focusing on methods to determine the existence and strength of linear relationships between paired quantitative variables. It covers concepts such as the linear correlation coefficient, hypothesis testing for correlation, and regression analysis for predicting values of a dependent variable based on an independent variable. Additionally, it highlights common errors in correlation interpretation and provides examples related to job satisfaction and burnout.

Uploaded by

RODERICK BALCE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views26 pages

Notes - Correlation and Regression

The document discusses correlation and linear regression, focusing on methods to determine the existence and strength of linear relationships between paired quantitative variables. It covers concepts such as the linear correlation coefficient, hypothesis testing for correlation, and regression analysis for predicting values of a dependent variable based on an independent variable. Additionally, it highlights common errors in correlation interpretation and provides examples related to job satisfaction and burnout.

Uploaded by

RODERICK BALCE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Correlation and Linear

Regression

Roderick D. Balce

•In this lesson, we again consider paired sample data, but the objective is
fundamentally different from that of paired samples t test and repeated measures
ANOVA.
•Here, we introduce methods for determining whether a correlation, or
association, between two variables exists and whether the correlation is linear.
For linear correlations, we can identify an equation that best fits the data and we
can use that equation to predict the value of one variable given the value of the
other variable.
Learning Outcomes
• Describe the direction and strength of linear
relationship between quantitative variables.
• Use paired data to find the value of the linear
correlation coefficient r.
• Perform a hypothesis test for correlation.
• Determine and interpret R2 value.
• Use the regression equation generated from
paired data to predict the value of the
dependent variable given the value of the
independent variable.
Basic Concepts
A correlation exists between two variables
when the values of one are somehow
associated with the values of the other in
some way.

The linear correlation coefficient r measures


the strength of the linear relationship
between the paired quantitative x- and y-
values in a sample.

•Correlation - a measure of association between two numerical variables or


simply the statistical association between two variables.
•Linear correlation coefficient r - is a numerical measure of the direction and
strength of the linear relationship or association between two paired variables
representing quantitative data (It is sometimes referred to as the Pearson product
moment correlation coefficient in honor of Karl Pearson, who originally
developed it.)
•Here, we use paired sample data (sometimes called bivariate data) to find the
value of r, then we use that value to conclude that there is (or is not) a linear
correlation between the two variables.
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X

•A scatterplot is the best place to start. A scatterplot is a graph of the paired (x,
y) sample data with a horizontal x-axis and a vertical y-axis. Each individual (x,
y) pair is plotted as a single point.
•If you are asked to “describe the association” in a scatterplot, you must discuss
these three things: 1. FORM (linear or non-linear) In this lesson we consider
only linear relationships, which means that when graphed, the points
approximate a straight-line pattern.
•2. DIRECTION (positive? negative?) 3. STRENGTH (weak, moderate, strong)
Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X

•Spread of data points relative to the straight line


•Since our eyes are not always good judges of assessing the
STRENGTH of a linear association, we need a NUMERICAL
MEASURE
Strength and Direction of Linear
Correlation
r value Interpretation
1 Perfect positive linear relationship
0 No linear relationship
-1 Perfect negative linear relationship

•r can be any value from –1 to +1. The closer r is to 1 or -1, the stronger the
linear association. The closer to –1, the stronger the negative linear relationship
and the closer to 1, the stronger the positive linear relationship
•The closer to 0, the weaker the relationship. And if r equals zero, then there is
no linear association between the two variables.
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y

X X X
r = -1 r = -.625 r=0
Y
Y Y

X X X
r=1 r = .351 r=0

•0.8 < |r| < 1 = strong


•0.5 < |r| < 0.8 = mod
•0 < |r| < 0.5 = weak
Properties of the
Linear Correlation Coefficient r
1. –1  r  1
2. If all values of either variable are converted
to a different scale, the value of r does not
change.
3. The value of r is not affected by the choice
of x and y. Interchange all x- and y-values
and the value of r will not change.
4. r measures strength of a linear relationship.
5. r is very sensitive to outliers, they can
dramatically affect its value.

•Ranges from –1 and 1


•Treats x and y symmetrically, that is x and y can be interchanged (unlike in
regression analysis)
Hypothesis Test for Correlation
Requirements:
1. The sample of paired (x, y) data is a simple
random sample of quantitative data.
2. Visual examination of the scatterplot must
confirm that the points approximate a
straight-line pattern.
3. The outliers must be removed if they are
known to be errors.

•The effects of any other outliers should be considered by calculating r


with and without the outliers included.
•Requirements 2 and 3 above are simplified attempts at checking this
formal requirement: The pairs of (x, y) data must have a bivariate
normal distribution.
This assumption basically requires that for any fixed value of x, the
corresponding values of y have a distribution that is approximately normal, and
for any fixed value of y,
the values of x have a distribution that is approximately normal. This
requirement is usually difficult to check, so for now, we will use Requirements 2
and 3 as listed here.
Hypothesis Test for Correlation

Notation:

n = number of pairs of sample data

r = linear correlation coefficient for a sample


of paired data

 = linear correlation coefficient for a


population of paired data

•Because the linear correlation coefficient is calculated using sample data, it is a


sample statistic represented by r. If we had every pair of x and y values from an
entire population, the result would be a population parameter, represented by the
Greek letter rho.
Hypothesis Test for Correlation
Hypotheses:
H0: = (There is no linear correlation.)
H1:  (There is a linear correlation.)

For one-tailed tests:

•We wish to determine whether there is a significant linear correlation


between two variables.
•One-tailed tests can occur with a claim of a positive linear correlation or
a claim of a negative linear correlation. In such cases, the hypotheses will
be as shown here.
Hypothesis Test for Correlation

Decision and Conclusion:


If |r| > critical value or the p-value ≤ α, reject
H0 and conclude that there is enough
evidence to support the claim of a linear
correlation.

If |r| ≤ critical value or the p-value > α, fail to


reject H0 and conclude that there is not
enough evidence to support the claim of a
linear correlation.
Example: Job Satisfaction
A cross-sectional study was conducted to
explore the relationships between job
satisfaction (JS) and burnout (BO) among 200
medical technologists working in government
hospitals. At α=0.05, is there a linear
correlation between job satisfaction and
burnout?

Hypotheses:
H0: =
H1:  
Scatterplot
Assumption Check and Correlation
Analysis
Shapiro-Wilk Test for Bivariate Normality
Shapiro-Wilk p
BO - JS 0.991 0.229

Pearson's Correlation
Pearson's r p
BO - JS -0.650 < .001

Decision and Conclusion:


_______________________________________
Common Errors
Involving Correlation
1. Causation: It is wrong to conclude that
correlation implies causality.
2. Averages: Averages suppress individual
variation and may inflate the correlation
coefficient.
3. Linearity: There may be some relationship
between x and y even when there is no
linear correlation.

•Here are three of the most common errors made when interpreting results
involving correlation:
•Assuming that correlation implies causality. Know that correlation does not
imply causality. We should not make any conclusion that includes a statement
about a cause-effect relationship between the two variables. Just because two
variables are correlated does not mean that one variable causes the other variable
to change.
•Using data based on averages
•Ignoring the possibility of a nonlinear relationship.
Regression Analysis
Used to predict the value of a dependent
variable (ŷ) based on the value of at least one
independent variable (x).
Simple linear regression
– only one independent
variable
Multiple regression
– two or more independent
variables

•Once we have identified two variables that are correlated, we would like to
model this relationship using one variable as
a predictor or explanatory variable (x) to explain the other variable,
the response or outcome variable (y or y-hat). Dependent variable: the
variable we wish to explain; Independent variable: the variable used to explain
the dependent variable
•A correlation analysis provides information on the strength and direction of
the linear relationship between two variables, while a simple linear regression
analysis estimates parameters in a linear equation that can be used
to predict values of one variable based on the other.
•Simple – bivariate (2 variables – 1 dependent and 1 independent); multiple –
multivariate; Simple linear regression = correlation
Regression Analysis
 Regression Line
The graph of the regression equation is
called the regression line (or line of best
fit, or least squares line).

 Regression Equation
Given a collection of paired data, the
regression equation
y^ = b + b x 0 1

algebraically describes the relationship


between the two variables.

•Here, we find the equation of the straight line that best fits the paired sample
data and that algebraically describes the relationship between two variables. The
best-fitting straight line is called a regression line and its equation is called the
regression equation.
•The typical equation of a straight line y = mx + b is expressed in the form y-hat
= b0 + b1x
Notation for
Regression Equation
Population Sample
Parameter Statistic
y-intercept of
regression equation 0 b0
Slope of regression
equation 1 b1

Equation of the
regression line
y = 0 + 1x y^ = b0 + b1x

•The simple linear regression model is: y = b0 + b1x +e where: e is a random


variable called the error term.
•b0 = y intercept (estimated value of y when x is 0)
•b1 = slope or regression coefficient (estimated change in y relative to a 1-unit
change in x; the sign tells you whether the outcome y will increase or decrease
due to an increase in the predictor x
b > 0: there is a positive correlation between x and y (the greater x, the
greater y)
b < 0: there is a negative correlation between x and y (the greater x, the
smaller y)
Coefficient of Determination, R2

Measures the proportion of the variation in the


dependent variable (y) that is explained by the
independent variable (x) for a linear regression
model.

Mathematically related to the correlation


coefficient.

Values range from 0 to 1.

•In order to find out how well the regression model can predict or
explain the dependent variable, we can refer to the coefficient of
determination.
•In simple linear regression where there is only 1 independent variable, R2 = r2
•A quantitative measure of the explanatory power of a model: an R2 close to
zero indicates a model with very little explanatory power; an R2 close to one
indicates a model with more explanatory power indicative of better fit for the
model.
•For example, if the R2 is 0.9, it indicates that 90% of the variation in the
dependent variable is explained by the independent variable.
Example: Job Satisfaction
What proportion of the variation in job
satisfaction can be explained by the variation in
burnout score?
Model Summary - JS
Model R R² Adjusted R² RMSE
H₀ 0.000 0.000 0.000 29.319
H₁ 0.650 0.423 0.420 22.334

R2 = ______
Explained variation: _______
Unexplained variation: _______

•We conclude that burnout explains 42.3% of the variation in job satisfaction
and therefore, 57.7% of the variation in job satisfaction cannot be explained by
the variation in burnout score.
•Adjusted R-squared is a modified version of R-squared that has been adjusted
for the number of predictors in the model.
•Root Mean Square Error (RMSE) is the standard deviation of the
residuals (prediction errors)
Predicting Values of ŷ

•Use the regression equation for predictions only if:


•the graph of the regression line on the scatterplot confirms that
the regression line fits the points reasonably well.
•the linear correlation coefficient r indicates that there is a linear
correlation between the two variables.
•the data do not go much beyond the scope of the available
sample data. (Predicting too far beyond the scope of the available
sample data is called extrapolation, and it could result in bad
predictions.) BO = 31-71
•If the regression equation does not appear to be useful for making predictions,
the best predicted value of a variable is its point estimate, which is its sample
mean.
Predicting Values of ŷ
Coefficients
Unstand Standa
Model SE t p
ardized rdized
H₀ Intercept 132.165 2.073 63.751 < .001
H₁ Intercept 235.459 8.724 26.990 < .001
BO -2.112 0.175 -0.650 -12.039 < .001

Regression equation:

yˆ  ____________________
 x

•The main purpose of regression analysis is to calculate estimates of the slope


and intercept.
•y-hat= b0 + b1x
•b0 = H1 intercept/constant (unstandardized) = 235.459 (estimated value of y
when x is 0)
•b1 = unstandardized regression coefficient or the slope of the independent
variable= -2.112 (estimated change in y relative to a 1-unit change in x; the sign
tells you whether the outcome y will increase or decrease due to an increase in
the predictor x; For each unit increase in BO, JS will decrease by 2.112 units)
•The linear model can thus be expressed as y-hat= 235.459 – 2.112x
•Unstandardized coefficients are 'raw' coefficients produced by regression
analysis when the analysis is performed on original, unstandardized variables.
Unlike standardized coefficients, which are normalized unit-less coefficients, an
unstandardized coefficient has units and a 'real life' scale.
•Unstandardized coefficients are those produced by the
linear regression model using the independent variables measured in their
original scales.
Example: Job Satisfaction
If more employees are measured for burnout,
what do you predict their job satisfaction will
be?

Burnout score (x) Job satisfaction (y)


a. 25 ______
b. 50 ______
c. 70 ______
d. 85 ______

•y-hat= 235.459 – 2.112x or JS = 235.459 – 2.112 (BO)


•y-hat= 235.459 – 2.112 (25) = 182.659 y-hat= 235.459 – 2.112 (50) =
129.859
•y-hat= 235.459 – 2.112 (70) = 87.619 y-hat= 235.459 – 2.112 (85) =
55.939
Practice Test
• Using the JASP outputs below, determine the
following:
– Correlation between job satisfaction (JS) and
turnout intention or intention to quit (TOI)
• r = _______
• P-value = ________
• Decision and conclusion: ________________________
_____________________________________________

r = -0.749 P-value = < .001 Reject the Ho;


enough evidence to support the claim that there is a linear
correlation between…..
Regression Analysis (x=JS; y=TOI)
• R2 = _________ Unexplained variation (%) = ______
• ŷ = ________________________
• Predict the TOI for the following JS values:
62 130 55

R2 = 0.561 U. V. = 43.9%

ŷ = 35.108 – 0.168x x=62 y=24.692 130;


13.268 55; 25.868

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy