0% found this document useful (0 votes)
8 views

The Truth About Linear Regression

The document discusses misconceptions about linear regression, emphasizing that correlation does not imply causation and the complexities of interpreting regression coefficients. It highlights issues such as collinearity, omitted variables, and errors in variable measurement that can distort results. Additionally, it addresses the challenges of setting up a regression model and the importance of theoretical justification for interactions among variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

The Truth About Linear Regression

The document discusses misconceptions about linear regression, emphasizing that correlation does not imply causation and the complexities of interpreting regression coefficients. It highlights issues such as collinearity, omitted variables, and errors in variable measurement that can distort results. Additionally, it addresses the challenges of setting up a regression model and the importance of theoretical justification for interactions among variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

“The Truth About Linear Regression”

Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and
gesture furtively while mouthing 'look over there'.
xkcd #552

Content from Cosma Shalizi, Advanced SML310: Research Projects in Data Science, Fall 2018
Data Analysis from an Elementary Michael Guerzhoy
1
Point of View
“Lies about Linear Regression”
• Because a variable has a significant regression
coefficient, it must influence the response
• Because a variable has an insignificant regression
coefficient, it must not influence the response
• If the input variables change, we can predict how
much the response will change by plugging in to
the regression

2
Collinearity
• Two predictor variables are correlated (e.g., weight
and height)
• We will be uncertain about the coefficients for both
weight and height
• Could make weight matter less and height matter more and
vice versa
• We cannot say “a 1cm increase in height is associated
with a 0.1 increase in GPA”
• Ways to handle
• Remove redundant variables (dangerous)
• PCA (will discuss later (possibly))

3
Omitted variables
• Variables that are not measured, but predict the
response
• Will influence coefficient estimate
• Will make correlation look like causation
• Examples?

4
Omitted variables
• The amount of ice cream consumed in a day is
correlated with the number of drownings
• Lurking variable: the weather
• Including omitted variables: “controlling for the
variables”

5
Omitted variables: tricky cases
• Ronald Fisher (one of the founders of the field of
Statistics) remained unconvinced by observational
studies that showed association between smoking
and lung cancer because of possible lurking
variables
• Suggested genetics might cause both smoking and lung
cancer
• Suggested illness might cause people to take up smoking
• (Accepted funding from tabacco companies; seemed to
be ideologically opposed to public health campaigns in
general)
• Is widely considered to have been wrong

6
Omitted variables: tricky cases
• The gender wage gap
• Do you control for having children? How?
• Do you control for the field of employment?
• The Harvard admissions lawsuit
• E.g., do you control for the interview score?
• Harvard and the plaintiffs submitted statistical analyses, arguing
(among other things) for different controls
• Generally, more controls smaller effect size
• Sometimes, controlling for a variable can be inappropriate because
the variable and the outcome basically measure the same thing
• Whether it’s important or trivial that one of the variables predicts the
outcome well depends on the situation
• Obviously, both of those are complex issues to which one
slidedeck cannot do justice
• And most of the issues are not necessarily statistical

7
Errors in variables
• Input variables measured imprecisely
• The relationship between family income and school
performance is often explored
• But what’s measured is the reported family income
• Tends to obscure the true relationship (and push
the coefficients (effect sizes) toward 0)
• Makes sense: more noise means it’s harder to detect the
trend

8
Significant coefficients
• All coefficients are significant if the sample size is
large enough

9
Setting up a regression model
• Identify variables that could conceivably influence
the response
• Are there lurking variables?
• Can you theoretically justify interactions?
• Would you want to have a hypothesis that involves the
presence of interactions?
• Do model checking

• Example of an interaction?

10
11
12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy