6 - Dummy-Variable Regression
6 - Dummy-Variable Regression
6 - Dummy-Variable Regression
5. Dummy-Variable Regression
Income
Income
gender and regress income on education alone, we obtain the same
slope as is produced by the separate within-gender regressions;
ignoring gender inflates the size of the errors, however. Women Women
We could perform separate regressions for women and men. This 3.1 Introducing a Dummy Regressor
approach is reasonable, but it has its limitations:
One way of formulating the common-slope model is
– Fitting separate regressions makes it difficult to estimate and test for
gender differences in income.
= + + +
where , called a dummy-variable regressor or an indicator variable, is
– Furthermore, if we can assume parallel regressions, then we can more
coded 1 for men and 0 for women:
½
efficiently estimate the common education slope by pooling sample 1 for men
data from both groups. =
0 for women
Y
3.2 Regressors vs. Explanatory Variables
D 1 This is our initial encounter with an idea that is fundamental to many
linear models: the distinction between explanatory variables and
D 0 regressors.
– Here, gender is a qualitative explanatory variable, with categories
1
male and female.
– The dummy variable is a regressor, representing the explanatory
variable gender.
1
– In contrast, the quantitative explanatory variable income and the
regressor are one and the same.
We will see later that an explanatory variable can give rise to several
regressors, and that some regressors are functions of more than one
0 X
explanatory variable.
negative.
– gives the intercept for women, for whom = 0. 1
1
– is the common within-gender education slope.
Figure 3 reveals the fundamental geometric ‘trick’ underlying the coding
of a dummy regressor:
0 X
– We are, in fact, fitting a regression plane to the data, but the dummy
regressor is defined only at the values zero and one.
Figure 3. The regression ‘plane’ underlying the additive dummy-regression
model.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 12 Dummy-Variable Regression 13
Essentially similar results are obtained if we code zero for men and
Y
one for women (Figure 4): D 0
– The sign of is reversed, but its magnitude remains the same.
– The coefficient now gives the income intercept for men. D 1
– It is therefore immaterial which group is coded one and which is coded
zero.
1
This method can be applied to any number of quantitative variables, as
long as we are willing to assume that the slopes are the same in the
1
two categories of the dichotomous explanatory variable (i.e., parallel
regression surfaces):
= + 1 1+···+ + +
– For = 0 we have
= + 1 1 +···+ + 0 X
– and for =1
=( + )+ 1 1 +···+ + Figure 4. Parameters corresponding to alternative coding = 0 for men
and = 1 for women.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740
4. Polytomous Explanatory Variables – This model describes three parallel regression planes, which can differ
in their intercepts (see Figure 5):
Recall the regression of the rated prestige of 102 Canadian occupations Professional: = ( + 1) + 1 1 + 2 2 +
on their income and education levels. White Collar: = ( + 2) + 1 1 + 2 2 +
– I have classified 98 of the occupations into three categories: (1) Blue Collar: = + 1 1+ 2 2+
professional and managerial; (2) ‘white-collar’; and (3) ‘blue-collar’.
– The three-category classification can be represented in the regression gives the intercept for blue-collar occupations.
equation by introducing two dummy regressors:
1 represents the constant vertical difference between the parallel
Category 1 2 regression planes for professional and blue-collar occupations (fixing
Professional & Managerial 1 0 the values of education and income).
White Collar 0 1
Blue Collar 0 0 2 represents the constant vertical distance between the regression
planes for white-collar and blue-collar occupations.
– The regression model is then
– Blue-collar occupations are coded 0 for both dummy regressors,
= + 1 1+ 2 2+ 1 1+ 2 2+
so ‘blue collar’ serves as a baseline category with which the other
where 1 is income and 2 is education.
occupational categories are compared.
1
interest.
1
1 – The hypothesis 0 : 1 = 2 = 0 can be tested by the incremental-
sum-of-squares approach.
1
1
2 1
X1
4.1 How Many Dummy Regressors Are Needed? – Likewise, we cannot calculate unique least-squares estimates for the
model, since the set of three dummy variables is perfectly collinear:
It may seem more natural to code three dummy regressors:
3 = 1 1 2.
Category 1 2 3
Professional & Managerial 1 0 0 For a polytomous explanatory variable with categories, we code 1
White Collar 0 1 0 dummy regressors.
Blue Collar 0 0 1 – One simple scheme is to select the last category as the baseline,
and to code = 1 when observation falls in category , and 0
– Then, for the th occupational type, we would have otherwise:
Category 1 2 ··· 1
=( + )+ 1 1+ 2 2+ 1 1 0 ··· 0
The problem with this procedure is that there are too many parameters: 2 0 1 ··· 0
– We have used four parameters ( 1 2 3) to represent only three · · · ·
group intercepts. · · · ·
– We could not find unique values for these four parameters even if we · · · ·
knew the three population regression lines. 1 0 0 ··· 1
0 0 ··· 0
– When there is more than one qualitative explanatory variable with – Inserting dummy variables for type of occupation into the regression
additive effects, we can code a set of dummy regressors for each. equation produces the following results:
b = 0 6229 + 0 001013 1 + 3 673 2 + 6 039 1 2 737 2
– To test the hypothesis that the effects of a qualitative explanatory
variable are nil, delete its dummy regressors from the model and (5 2275) (0 000221) (0 641) (3 867) (2 514)
2
compute an incremental -test. = 83486
The regression of prestige on income and education – The three fitted regression equations are:
b = 7 621 + 0 001241 1 + 4 292 2 2
= 81400 Professional: b = 5 416 + 0 001013 1 + 3 673 2
(3 116) (0 000219) (0 336) White collar: b = 3 360 + 0 001013 1 + 3 673 2
Blue collar: b = 0 623 + 0 001013 1 + 3 673 2
– To test the null hypothesis of no partial effect of type of occupation, 5. Modeling Interactions
0: 1 = 2 = 0
Two explanatory variables interact in determining a response variable
calculate the incremental -statistic
1 2 2 when the partial effect of one depends on the value of the other.
1 0
0 = × 2 – Additive models specify the absence of interactions.
1 1
98 4 1 83486 81400 – If the regressions in different categories of a qualitative explanatory
= × = 5 874 variable are not parallel, then the qualitative explanatory variable
2 1 83486
with 2 and 93 degrees of freedom, for which = 0040. interacts with one or more of the quantitative explanatory variables.
– The dummy-regression model can be modified to reflect interactions.
Consider the hypothetical data in Figure 6 (and contrast these examples
with those shown in Figure 1, where the effects of gender and education
were additive):
– In (a), gender and education are independent, since women and men
have identical education distributions.
– In (b), gender and education are related, since women, on average,
have higher levels of education than men.
c 2009 by John Fox
° Sociology 740 c 2009 by John Fox
° Sociology 740
Dummy-Variable Regression 24 Dummy-Variable Regression 25
Income
cause the regressions are not parallel, the relative income advantage
of men changes with education.
Interaction is a symmetric concept — the effect of education varies
Women by gender, and the effect of gender varies by education.
Women
Education Education
These examples illustrate another important point: Interaction and 5.1 Constructing Interaction Regressors
correlation of explanatory variables are empirically and logically distinct
We could model the data in the example by fitting separate regressions
phenomena.
of income on education for women and men.
– Two explanatory variables can interact whether or not they are related
– A combined model facilitates a test of the gender-by-education
to one-another statistically.
interaction, however.
– Interaction refers to the manner in which explanatory variables
– A properly formulated unified model that permits different intercepts
combine to affect a response variable, not to the relationship between
and slopes in the two groups produces the same fit as separate
the explanatory variables themselves.
regressions.
The following model accommodates different intercepts and slopes for
women and men:
= + + + ( )+
– Along with the dummy regressor for gender and the quantitative
regressor for education, I have introduced the interaction regressor
.
To test for interaction, we can test the hypothesis 0: = 0. 5.2 The Principle of Marginality
In the additive, no-interaction model, represented the unique partial The separate partial effects, or main effects, of education and gender
effect of gender, while the slope represented the unique partial effect are marginal to the education-by-gender interaction.
of education.
– In the interaction model, is no longer interpretable as the unqualified In general, we neither test nor interpret main effects of explanatory
income difference between men and women of equal education — variables that interact.
is now the income difference at = 0. – If we can rule out interaction either on theoretical or empirical grounds,
then we can proceed to test, estimate, and interpret main effects.
– Likewise, in the interaction model, is not the unqualified partial effect
of education, but rather the effect of education among women. It does not generally make sense to specify and fit models that include
The effect of education among men ( + ) does not appear directly interaction regressors but that delete main effects that are marginal to
in the model. them.
– Such models — which violate the principle of marginality — are
interpretable, but they are not broadly applicable.
5.3 Interactions With Polytomous Explanatory The regressors 1 1 and 1 2 capture the interaction between
income and occupational type;
Variables
2 1 and 2 2 capture the interaction between education and
The method of modeling interactions by forming product regressors occupational type.
is easily extended to polytomous explanatory variables, to several
qualitative explanatory variables, and to several quantitative explanatory – The model permits different intercepts and slopes for the three types
variables. of occupations:
Professional: = ( + 1) + ( 1 + 11 ) 1
For example, for the Canadian occupational prestige regression: + ( 2 + 21) 2 +
= + 1 1+ 2 2+ 1 1+ 2 2 White Collar: = ( + 2) + ( 1 + 12 ) 1
+ 11 1 1 + 12 1 2 + ( 2 + 22) 2 +
+ 21 2 1 + 22 2 2 + Blue Collar: = + 1 1
– We require one interaction regressor for each product of a dummy + 2 2 +
regressor with a quantitative explanatory variable. – Blue-collar occupations, coded 0 for both dummy regressors, serve
as the baseline for the intercepts and slopes of the other occupational
types.
– Fitting this model to the Canadian occupational prestige data produces – The regression equation for each group:
the following results: \ = 17 63 + 0 000619 × Income + 3 101 × Education
Professional: Prestige
b = 2 276 + 0 003522 1 + 1 713 2 \ = 31 26 + 0 001450 × Income + 6 004 × Education
White-Collar: Prestige
(7 057) (0 000556) (0 927) \ = 2 276 + 0 003522 × Income + 1 713 × Education
Blue-Collar: Prestige
+ 15 35 1 33 54 2
(13 72) (17 54)
0 002903 1 1 0 002072 1 2
(0 000599) (0 000894)
+ 1 388 2 1 + 4 291 2 2
(1 289) (1 757)
2
= 8747
– These tests, and tests for the main effects of occupational type, 4 1 2 1 2 23,666. 4
income, and education, are detailed in the following tables: 5 1 2 23,074. 2
1 1 2
6 × 23,488. 5
11 12
2 1 2
7 × 22,710. 5
21 22
Models Sum of Although the analysis-of-variance table shows the tests for the main
Source Contrasted Squares effects of education, income, and type before the education-by-type and
Income 3 7 1132. 1 28.35 .0001 income-by-type interactions, the logic of interpretation is to examine the
Education 2 6 1068. 1 26.75 .0001 interactions first:
Type 4 5 592. 2 7.41 .0011 – Conforming to the principle of marginality, the test for each main
Income × Type 1 3 952. 2 11.92 .0001 effect is computed assuming that the interactions that are higher-order
Education × Type 1 2 238. 2 2.98 .056 relatives of that main effect are 0.
Residuals 3553. 89 – Thus, for example, the test for the income main effect assumes that
Total 28,347. 97 the income-by-type interaction is absent (i.e., that 11 = 12 = 0), but
not that the education-by-type interaction is absent ( 21 = 22 = 0).
Source Models 0
Income 3 7 1 = 0 | 11 = 12 = 0
Education 2 6 2 = 0 | 21 = 22 = 0
Type 4 5 1 = 2 = 0| 11 = 12 = 21 = 22 =0
Income ×Type 1 3 11 = 12 = 0
Education×Type 1 2 21 = 22 = 0
The degrees of freedom for the several sources of variation add to the 6. A Caution Concerning Standardized
total degrees of freedom, but — because the regressors in different sets
are correlated — the sums of squares do not add to the total sum of Coefficients
squares. An unstandardized coefficient for a dummy regressor is interpretable as
– What is important is that sensible hypotheses are tested, not that the the expected response-variable difference between a particular category
sums of squares add to the total sum of squares. and the baseline category for the dummy-regressor set.
If a dummy-regressor coefficient is standardized, then this straight-
forward interpretation is lost.
Furthermore, because a 0/1 dummy regressor cannot be increased
by one standard deviation, the usual interpretation of a standardized
regression coefficient also does not apply.
– A similar point applies to interaction regressors.