Notes - Correlation and Regression
Notes - Correlation and Regression
Regression
Roderick D. Balce
•In this lesson, we again consider paired sample data, but the objective is
fundamentally different from that of paired samples t test and repeated measures
ANOVA.
•Here, we introduce methods for determining whether a correlation, or
association, between two variables exists and whether the correlation is linear.
For linear correlations, we can identify an equation that best fits the data and we
can use that equation to predict the value of one variable given the value of the
other variable.
Learning Outcomes
• Describe the direction and strength of linear
relationship between quantitative variables.
• Use paired data to find the value of the linear
correlation coefficient r.
• Perform a hypothesis test for correlation.
• Determine and interpret R2 value.
• Use the regression equation generated from
paired data to predict the value of the
dependent variable given the value of the
independent variable.
Basic Concepts
A correlation exists between two variables
when the values of one are somehow
associated with the values of the other in
some way.
Y Y
X X
Y Y
X X
•A scatterplot is the best place to start. A scatterplot is a graph of the paired (x,
y) sample data with a horizontal x-axis and a vertical y-axis. Each individual (x,
y) pair is plotted as a single point.
•If you are asked to “describe the association” in a scatterplot, you must discuss
these three things: 1. FORM (linear or non-linear) In this lesson we consider
only linear relationships, which means that when graphed, the points
approximate a straight-line pattern.
•2. DIRECTION (positive? negative?) 3. STRENGTH (weak, moderate, strong)
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
•r can be any value from –1 to +1. The closer r is to 1 or -1, the stronger the
linear association. The closer to –1, the stronger the negative linear relationship
and the closer to 1, the stronger the positive linear relationship
•The closer to 0, the weaker the relationship. And if r equals zero, then there is
no linear association between the two variables.
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.625 r=0
Y
Y Y
X X X
r=1 r = .351 r=0
Notation:
Hypotheses:
H0: =
H1:
Scatterplot
Assumption Check and Correlation
Analysis
Shapiro-Wilk Test for Bivariate Normality
Shapiro-Wilk p
BO - JS 0.991 0.229
Pearson's Correlation
Pearson's r p
BO - JS -0.650 < .001
•Here are three of the most common errors made when interpreting results
involving correlation:
•Assuming that correlation implies causality. Know that correlation does not
imply causality. We should not make any conclusion that includes a statement
about a cause-effect relationship between the two variables. Just because two
variables are correlated does not mean that one variable causes the other variable
to change.
•Using data based on averages
•Ignoring the possibility of a nonlinear relationship.
Regression Analysis
Used to predict the value of a dependent
variable (ŷ) based on the value of at least one
independent variable (x).
Simple linear regression
– only one independent
variable
Multiple regression
– two or more independent
variables
•Once we have identified two variables that are correlated, we would like to
model this relationship using one variable as
a predictor or explanatory variable (x) to explain the other variable,
the response or outcome variable (y or y-hat). Dependent variable: the
variable we wish to explain; Independent variable: the variable used to explain
the dependent variable
•A correlation analysis provides information on the strength and direction of
the linear relationship between two variables, while a simple linear regression
analysis estimates parameters in a linear equation that can be used
to predict values of one variable based on the other.
•Simple – bivariate (2 variables – 1 dependent and 1 independent); multiple –
multivariate; Simple linear regression = correlation
Regression Analysis
Regression Line
The graph of the regression equation is
called the regression line (or line of best
fit, or least squares line).
Regression Equation
Given a collection of paired data, the
regression equation
y^ = b + b x 0 1
•Here, we find the equation of the straight line that best fits the paired sample
data and that algebraically describes the relationship between two variables. The
best-fitting straight line is called a regression line and its equation is called the
regression equation.
•The typical equation of a straight line y = mx + b is expressed in the form y-hat
= b0 + b1x
Notation for
Regression Equation
Population Sample
Parameter Statistic
y-intercept of
regression equation 0 b0
Slope of regression
equation 1 b1
Equation of the
regression line
y = 0 + 1x y^ = b0 + b1x
•In order to find out how well the regression model can predict or
explain the dependent variable, we can refer to the coefficient of
determination.
•In simple linear regression where there is only 1 independent variable, R2 = r2
•A quantitative measure of the explanatory power of a model: an R2 close to
zero indicates a model with very little explanatory power; an R2 close to one
indicates a model with more explanatory power indicative of better fit for the
model.
•For example, if the R2 is 0.9, it indicates that 90% of the variation in the
dependent variable is explained by the independent variable.
Example: Job Satisfaction
What proportion of the variation in job
satisfaction can be explained by the variation in
burnout score?
Model Summary - JS
Model R R² Adjusted R² RMSE
H₀ 0.000 0.000 0.000 29.319
H₁ 0.650 0.423 0.420 22.334
R2 = ______
Explained variation: _______
Unexplained variation: _______
•We conclude that burnout explains 42.3% of the variation in job satisfaction
and therefore, 57.7% of the variation in job satisfaction cannot be explained by
the variation in burnout score.
•Adjusted R-squared is a modified version of R-squared that has been adjusted
for the number of predictors in the model.
•Root Mean Square Error (RMSE) is the standard deviation of the
residuals (prediction errors)
Predicting Values of ŷ
Regression equation:
yˆ ____________________
x
R2 = 0.561 U. V. = 43.9%