6 - Dummy-Variable Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Dummy-Variable Regression 1

Sociology 740 John Fox


1. Introduction
• One of the limitations of multiple-regression analysis is that it accommo-
dates only quantitative explanatory variables.
• Dummy-variable regressors can be used to incorporate qualitative
Lecture Notes explanatory variables into a linear model, substantially expanding the
range of application of regression analysis.

5. Dummy-Variable Regression

Copyright © 2009 by John Fox

c 2009 by John Fox


° Sociology 740

Dummy-Variable Regression 2 Dummy-Variable Regression 3

2. Goals: 3. A Dichotomous Explanatory Variable


• To show how dummy regessors can be used to represent the categories • The simplest case: one dichotomous and one quantitative explanatory
of a qualitative explanatory variable in a regression model. variable.
• To introduce the concept of interaction between explanatory variables, • Assumptions:
and to show how interactions can be incorporated into a regression – Relationships are additive — the partial effect of each explanatory
model by forming interaction regressors. variable is the same regardless of the specific value at which the other
• To introduce the principle of marginality, which serves as a guide to explanatory variable is held constant.
constructing and testing terms in complex linear models. – The other assumptions of the regression model hold.
• To show how incremental -tests are employed to test terms in dummy • The motivation for including a qualitative explanatory variable is the
regression models. same as for including an additional quantitative explanatory variable:
– to account more fully for the response variable, by making the errors
smaller; and
– to avoid a biased assessment of the impact of an explanatory variable,
as a consequence of omitting another explanatory variables that is
related to it.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 4 Dummy-Variable Regression 5

• Figure 1 represents idealized examples, showing the relationship


between education and income among women and men.
(a) (b)
– In both cases, the within-gender regressions of income on education
are parallel. Parallel regressions imply additive effects of education
and gender on income. Men Men

– In (a), gender and education are unrelated to each other: If we ignore

Income

Income
gender and regress income on education alone, we obtain the same
slope as is produced by the separate within-gender regressions;
ignoring gender inflates the size of the errors, however. Women Women

– In (b) gender and education are related, and therefore if we regress


income on education alone, we arrive at a biased assessment of Education Education

the effect of education on income. The overall regression of income


on education has a negative slope even though the within-gender
regressions have positive slopes. Figure 1. In both cases the within-gender regressions of income on educa-
tion are parallel: in (a) gender and education are unrelated; in (b) women
have higher average education than men.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 6 Dummy-Variable Regression 7

• We could perform separate regressions for women and men. This 3.1 Introducing a Dummy Regressor
approach is reasonable, but it has its limitations:
• One way of formulating the common-slope model is
– Fitting separate regressions makes it difficult to estimate and test for
gender differences in income.
= + + +
where , called a dummy-variable regressor or an indicator variable, is
– Furthermore, if we can assume parallel regressions, then we can more
coded 1 for men and 0 for women:
½
efficiently estimate the common education slope by pooling sample 1 for men
data from both groups. =
0 for women

– Thus, for women the model becomes


= + + (0) + = + +
– and for men
= + + (1) + =( + )+ +
• These regression equations are graphed in Figure 2.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 8 Dummy-Variable Regression 9

Y
3.2 Regressors vs. Explanatory Variables
D 1 • This is our initial encounter with an idea that is fundamental to many
linear models: the distinction between explanatory variables and
D 0 regressors.
– Here, gender is a qualitative explanatory variable, with categories
1
male and female.
– The dummy variable is a regressor, representing the explanatory
variable gender.
1
– In contrast, the quantitative explanatory variable income and the
regressor are one and the same.
• We will see later that an explanatory variable can give rise to several
regressors, and that some regressors are functions of more than one
0 X
explanatory variable.

Figure 2. The parameters in the additive dummy-regression model.


c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 10 Dummy-Variable Regression 11

3.3 How and Why Dummy Regression Works


• Interpretation of parameters in the additive dummy-regression model: Y
– gives the difference in intercepts for the two regression lines.
Because these regression lines are parallel, also represents the 1
constant separation between the lines — the expected income 1
advantage accruing to men when education is held constant.
If men were disadvantaged relative to women, then would be D

negative.
– gives the intercept for women, for whom = 0. 1
1
– is the common within-gender education slope.
• Figure 3 reveals the fundamental geometric ‘trick’ underlying the coding
of a dummy regressor:
0 X
– We are, in fact, fitting a regression plane to the data, but the dummy
regressor is defined only at the values zero and one.
Figure 3. The regression ‘plane’ underlying the additive dummy-regression
model.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 12 Dummy-Variable Regression 13

• Essentially similar results are obtained if we code zero for men and
Y
one for women (Figure 4): D 0
– The sign of is reversed, but its magnitude remains the same.
– The coefficient now gives the income intercept for men. D 1
– It is therefore immaterial which group is coded one and which is coded
zero.
1
• This method can be applied to any number of quantitative variables, as
long as we are willing to assume that the slopes are the same in the
1
two categories of the dichotomous explanatory variable (i.e., parallel
regression surfaces):
= + 1 1+···+ + +
– For = 0 we have
= + 1 1 +···+ + 0 X

– and for =1
=( + )+ 1 1 +···+ + Figure 4. Parameters corresponding to alternative coding = 0 for men
and = 1 for women.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 14 Dummy-Variable Regression 15

4. Polytomous Explanatory Variables – This model describes three parallel regression planes, which can differ
in their intercepts (see Figure 5):
• Recall the regression of the rated prestige of 102 Canadian occupations Professional: = ( + 1) + 1 1 + 2 2 +
on their income and education levels. White Collar: = ( + 2) + 1 1 + 2 2 +
– I have classified 98 of the occupations into three categories: (1) Blue Collar: = + 1 1+ 2 2+
professional and managerial; (2) ‘white-collar’; and (3) ‘blue-collar’.
– The three-category classification can be represented in the regression gives the intercept for blue-collar occupations.
equation by introducing two dummy regressors:
1 represents the constant vertical difference between the parallel
Category 1 2 regression planes for professional and blue-collar occupations (fixing
Professional & Managerial 1 0 the values of education and income).
White Collar 0 1
Blue Collar 0 0 2 represents the constant vertical distance between the regression
planes for white-collar and blue-collar occupations.
– The regression model is then
– Blue-collar occupations are coded 0 for both dummy regressors,
= + 1 1+ 2 2+ 1 1+ 2 2+
so ‘blue collar’ serves as a baseline category with which the other
where 1 is income and 2 is education.
occupational categories are compared.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 16 Dummy-Variable Regression 17

– The choice of a baseline category is usually arbitrary, for we would


fit the same three regression planes regardless of which of the three
Y 2
categories is selected for this role.
1 X2 • Because the choice of baseline is arbitrary, we want to test the null
hypothesis of no partial effect of occupational type,
2
0: 1 = 2 = 0
1
2
but the individual hypotheses 0: 1 = 0 and 0 : 2 = 0 are of less
1

1
interest.
1
1 – The hypothesis 0 : 1 = 2 = 0 can be tested by the incremental-
sum-of-squares approach.
1

1
2 1

X1

Figure 5. The additive dummy-regression model showing three parallel


regression planes.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 18 Dummy-Variable Regression 19

4.1 How Many Dummy Regressors Are Needed? – Likewise, we cannot calculate unique least-squares estimates for the
model, since the set of three dummy variables is perfectly collinear:
• It may seem more natural to code three dummy regressors:
3 = 1 1 2.
Category 1 2 3
Professional & Managerial 1 0 0 • For a polytomous explanatory variable with categories, we code 1
White Collar 0 1 0 dummy regressors.
Blue Collar 0 0 1 – One simple scheme is to select the last category as the baseline,
and to code = 1 when observation falls in category , and 0
– Then, for the th occupational type, we would have otherwise:
Category 1 2 ··· 1
=( + )+ 1 1+ 2 2+ 1 1 0 ··· 0
• The problem with this procedure is that there are too many parameters: 2 0 1 ··· 0
– We have used four parameters ( 1 2 3) to represent only three · · · ·
group intercepts. · · · ·
– We could not find unique values for these four parameters even if we · · · ·
knew the three population regression lines. 1 0 0 ··· 1
0 0 ··· 0

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 20 Dummy-Variable Regression 21

– When there is more than one qualitative explanatory variable with – Inserting dummy variables for type of occupation into the regression
additive effects, we can code a set of dummy regressors for each. equation produces the following results:
b = 0 6229 + 0 001013 1 + 3 673 2 + 6 039 1 2 737 2
– To test the hypothesis that the effects of a qualitative explanatory
variable are nil, delete its dummy regressors from the model and (5 2275) (0 000221) (0 641) (3 867) (2 514)
2
compute an incremental -test. = 83486
• The regression of prestige on income and education – The three fitted regression equations are:
b = 7 621 + 0 001241 1 + 4 292 2 2
= 81400 Professional: b = 5 416 + 0 001013 1 + 3 673 2
(3 116) (0 000219) (0 336) White collar: b = 3 360 + 0 001013 1 + 3 673 2
Blue collar: b = 0 623 + 0 001013 1 + 3 673 2

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 22 Dummy-Variable Regression 23

– To test the null hypothesis of no partial effect of type of occupation, 5. Modeling Interactions
0: 1 = 2 = 0
• Two explanatory variables interact in determining a response variable
calculate the incremental -statistic
1 2 2 when the partial effect of one depends on the value of the other.
1 0
0 = × 2 – Additive models specify the absence of interactions.
1 1
98 4 1 83486 81400 – If the regressions in different categories of a qualitative explanatory
= × = 5 874 variable are not parallel, then the qualitative explanatory variable
2 1 83486
with 2 and 93 degrees of freedom, for which = 0040. interacts with one or more of the quantitative explanatory variables.
– The dummy-regression model can be modified to reflect interactions.
• Consider the hypothetical data in Figure 6 (and contrast these examples
with those shown in Figure 1, where the effects of gender and education
were additive):
– In (a), gender and education are independent, since women and men
have identical education distributions.
– In (b), gender and education are related, since women, on average,
have higher levels of education than men.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 24 Dummy-Variable Regression 25

– In both (a) and (b), the within-gender regressions of income on


(a) (b) education are not parallel — the slope for men is larger than the slope
for women.
Because the effect of education varies by gender, education and
gender interact in affecting income.
Men
Men – It is also the case that the effect of gender varies by education. Be-
Income

Income
cause the regressions are not parallel, the relative income advantage
of men changes with education.
Interaction is a symmetric concept — the effect of education varies
Women by gender, and the effect of gender varies by education.
Women

Education Education

Figure 6. In both cases, gender and education interact in determining


income. In (a) gender and education are independent; in (b) women on
average have more education than men.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 26 Dummy-Variable Regression 27

• These examples illustrate another important point: Interaction and 5.1 Constructing Interaction Regressors
correlation of explanatory variables are empirically and logically distinct
• We could model the data in the example by fitting separate regressions
phenomena.
of income on education for women and men.
– Two explanatory variables can interact whether or not they are related
– A combined model facilitates a test of the gender-by-education
to one-another statistically.
interaction, however.
– Interaction refers to the manner in which explanatory variables
– A properly formulated unified model that permits different intercepts
combine to affect a response variable, not to the relationship between
and slopes in the two groups produces the same fit as separate
the explanatory variables themselves.
regressions.
• The following model accommodates different intercepts and slopes for
women and men:
= + + + ( )+
– Along with the dummy regressor for gender and the quantitative
regressor for education, I have introduced the interaction regressor
.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 28 Dummy-Variable Regression 29

– The interaction regressor is the product of the other two regressors:


Y
is a function of and , but it is not a linear function, avoiding D 1
perfect collinearity.
– For women,
= + + (0) + ( · 0) +
= + +
– and for men, 1
D 0
= + + (1) + ( · 1) +
= ( + )+( + ) +
• These regression equations are graphed in Figure 7: 1
– and are the intercept and slope for the regression of income on
education among women.
0 X
– gives the difference in intercepts between the male and female
groups
– gives the difference in slopes between the two groups. Figure 7. The parameters in the dummy-regression model with interaction.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 30 Dummy-Variable Regression 31

To test for interaction, we can test the hypothesis 0: = 0. 5.2 The Principle of Marginality
• In the additive, no-interaction model, represented the unique partial • The separate partial effects, or main effects, of education and gender
effect of gender, while the slope represented the unique partial effect are marginal to the education-by-gender interaction.
of education.
– In the interaction model, is no longer interpretable as the unqualified • In general, we neither test nor interpret main effects of explanatory
income difference between men and women of equal education — variables that interact.
is now the income difference at = 0. – If we can rule out interaction either on theoretical or empirical grounds,
then we can proceed to test, estimate, and interpret main effects.
– Likewise, in the interaction model, is not the unqualified partial effect
of education, but rather the effect of education among women. • It does not generally make sense to specify and fit models that include
The effect of education among men ( + ) does not appear directly interaction regressors but that delete main effects that are marginal to
in the model. them.
– Such models — which violate the principle of marginality — are
interpretable, but they are not broadly applicable.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 32 Dummy-Variable Regression 33

– Consider the model


= + + ( )+ (a) (b)
As shown in Figure 8 (a), this model describes regression lines Y Y
Dã1 Dã1
for women and men that have the same intercept but (potentially)
different slopes, a specification that is peculiar and of no substantive
interest.
Dã0
1 1
– Similarly, the model
= + + ( )+ 1

graphed in Figure 8 (b), constrains the slope for women to 0, which is


Dã0
needlessly restrictive. X X
0 0

Figure 8. Two models that violate the principle of marginality, by including


the interaction regressor but (a) omitting or (b) omitting .

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 34 Dummy-Variable Regression 35

5.3 Interactions With Polytomous Explanatory The regressors 1 1 and 1 2 capture the interaction between
income and occupational type;
Variables
2 1 and 2 2 capture the interaction between education and
• The method of modeling interactions by forming product regressors occupational type.
is easily extended to polytomous explanatory variables, to several
qualitative explanatory variables, and to several quantitative explanatory – The model permits different intercepts and slopes for the three types
variables. of occupations:
Professional: = ( + 1) + ( 1 + 11 ) 1
• For example, for the Canadian occupational prestige regression: + ( 2 + 21) 2 +
= + 1 1+ 2 2+ 1 1+ 2 2 White Collar: = ( + 2) + ( 1 + 12 ) 1
+ 11 1 1 + 12 1 2 + ( 2 + 22) 2 +
+ 21 2 1 + 22 2 2 + Blue Collar: = + 1 1
– We require one interaction regressor for each product of a dummy + 2 2 +
regressor with a quantitative explanatory variable. – Blue-collar occupations, coded 0 for both dummy regressors, serve
as the baseline for the intercepts and slopes of the other occupational
types.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 36 Dummy-Variable Regression 37

– Fitting this model to the Canadian occupational prestige data produces – The regression equation for each group:
the following results: \ = 17 63 + 0 000619 × Income + 3 101 × Education
Professional: Prestige
b = 2 276 + 0 003522 1 + 1 713 2 \ = 31 26 + 0 001450 × Income + 6 004 × Education
White-Collar: Prestige
(7 057) (0 000556) (0 927) \ = 2 276 + 0 003522 × Income + 1 713 × Education
Blue-Collar: Prestige
+ 15 35 1 33 54 2
(13 72) (17 54)
0 002903 1 1 0 002072 1 2
(0 000599) (0 000894)
+ 1 388 2 1 + 4 291 2 2
(1 289) (1 757)
2
= 8747

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 38 Dummy-Variable Regression 39

5.4 Hypothesis Tests for Main Effects and


Regression
Interactions Model Terms Parameters Sum of Squares
• To test the null hypothesis of no interaction between income and type, 1 2 1 2
1 × × 24,794. 8
0: 11 = 12 = 0, we need to delete the interaction regressors 1 1 and 11 12 21 22
1 2 from the full model and calculate an incremental -test. 1 2 1 2
2 × 24,556. 6
– Likewise, to test the null hypothesis of no interaction between 11 12
education and type, 0: 21 = 22 = 0, we delete the interaction 1 2 1 2
3 × 23,842. 6
regressors 2 1 and 2 2 from the full model. 21 22

– These tests, and tests for the main effects of occupational type, 4 1 2 1 2 23,666. 4
income, and education, are detailed in the following tables: 5 1 2 23,074. 2
1 1 2
6 × 23,488. 5
11 12
2 1 2
7 × 22,710. 5
21 22

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 40 Dummy-Variable Regression 41

Models Sum of • Although the analysis-of-variance table shows the tests for the main
Source Contrasted Squares effects of education, income, and type before the education-by-type and
Income 3 7 1132. 1 28.35 .0001 income-by-type interactions, the logic of interpretation is to examine the
Education 2 6 1068. 1 26.75 .0001 interactions first:
Type 4 5 592. 2 7.41 .0011 – Conforming to the principle of marginality, the test for each main
Income × Type 1 3 952. 2 11.92 .0001 effect is computed assuming that the interactions that are higher-order
Education × Type 1 2 238. 2 2.98 .056 relatives of that main effect are 0.
Residuals 3553. 89 – Thus, for example, the test for the income main effect assumes that
Total 28,347. 97 the income-by-type interaction is absent (i.e., that 11 = 12 = 0), but
not that the education-by-type interaction is absent ( 21 = 22 = 0).
Source Models 0
Income 3 7 1 = 0 | 11 = 12 = 0
Education 2 6 2 = 0 | 21 = 22 = 0
Type 4 5 1 = 2 = 0| 11 = 12 = 21 = 22 =0
Income ×Type 1 3 11 = 12 = 0
Education×Type 1 2 21 = 22 = 0

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740

Dummy-Variable Regression 42 Dummy-Variable Regression 43

• The degrees of freedom for the several sources of variation add to the 6. A Caution Concerning Standardized
total degrees of freedom, but — because the regressors in different sets
are correlated — the sums of squares do not add to the total sum of Coefficients
squares. • An unstandardized coefficient for a dummy regressor is interpretable as
– What is important is that sensible hypotheses are tested, not that the the expected response-variable difference between a particular category
sums of squares add to the total sum of squares. and the baseline category for the dummy-regressor set.
• If a dummy-regressor coefficient is standardized, then this straight-
forward interpretation is lost.
• Furthermore, because a 0/1 dummy regressor cannot be increased
by one standard deviation, the usual interpretation of a standardized
regression coefficient also does not apply.
– A similar point applies to interaction regressors.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 44 Dummy-Variable Regression 45

7. Summary • The principle of marginality specifies that a model including a high-


order term (such as an interaction) should normally also include the
• A dichotomous explanatory variable can be entered into a regression lower-order relatives of that term (the main effects that ‘compose’ the
equation by formulating a dummy regressor, coded 1 for one category interaction).
of the variable and 0 for the other category. – The principle of marginality also serves as a guide to constructing
• A polytomous explanatory variable can be entered into a regression by incremental -tests for the terms in a model that includes interactions.
coding a set of 0/1 dummy regressors, one fewer than the number of • It is not sensible to standardize dummy regressors or interaction
categories of the variable. regressors.
– The ‘omitted’ category, coded 0 for all dummy regressors in the set,
serves as a baseline.
• Interactions can be incorporated by coding interaction regressors, taking
products of dummy regressors with quantitative explanatory variables.
– The model permits “different slopes for different folks” — that is,
regression surfaces that are not parallel.

c 2009 by John Fox


° Sociology 740 c 2009 by John Fox
° Sociology 740

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy