Statistics I
Statistics I
Statistics I
Statistics I
WWW.SYSTAT.COM
For more information about SYSTAT Software Inc. products , please visit our WWW
site at http://www.systat.com or contact
Marketing Department
SYSTAT Software Inc.
501 Canal Boulevard, Suite F
Richmond, CA 94804-2028
Tel: (800) 797-7401, (866) 797-8288
Fax: (800) 797-7406
05 04 03 02 01 00
ISBN 81-88341-04-5
Contents
1 Introduction to Statistics
I-1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. I-2
. I-3
. I-3
. I-4
. I-5
. I-6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. I-7
. I-8
I-10
I-10
I-11
I-12
I-14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-16
I-17
iii
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-29
I-31
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I-32
I-35
I-35
I-36
I-37
I-38
I-38
4 Cluster Analysis
I-53
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. I-55
. I-56
. I-60
.I-62
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-64
. I-67
. I-68
. I-69
. I-70
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-71
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-84
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-84
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84
5 Conjoint Analysis
I-87
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-88
. I-89
. I-91
. I-92
6 Correlations, Similarities,
and Distance Measures
I-115
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-117
.I-117
.I-119
.I-122
.I-123
7 Correspondence Analysis
I-147
vi
8 Crosstabulation
I-157
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. I-166
. I-167
. I-170
. I-171
. I-172
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-173
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-203
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-203
9 Descriptive Statistics
I-205
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I-206
I-207
I-207
I-208
I-209
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I-211
I-213
I-214
I-215
I-215
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-216
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225
vii
10 Design of Experiments
I-227
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-228
.I-229
.I-230
.I-231
.I-231
.I-232
.I-236
.I-239
.I-244
.I-248
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-250
.I-251
.I-252
.I-252
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-253
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-273
11 Discriminant Analysis
I-275
viii
12 Factor Analysis
I-327
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. I-328
. I-331
. I-334
. I-334
13 Linear Models
I-365
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. I-366
. I-369
. I-369
. I-371
. I-371
. I-372
. I-373
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-383
.I-384
.I-385
.I-386
.I-386
.I-387
.I-388
.I-388
.I-389
.I-390
I-399
I-431
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I-432
I-434
I-436
I-438
I-438
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-439
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485
I-487
xi
17 Logistic Regression
I-549
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-550
.I-552
.I-552
.I-554
.I-556
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-557
.I-561
.I-562
.I-563
.I-563
.I-564
.I-565
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-566
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-613
18 Loglinear Models
I-617
xii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.I-621
.I-625
.I-626
.I-626
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-627
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646
Index
649
xiii
List of Examples
Additive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-82
Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . I-462
ANOVA Assumptions and Contrasts . . . . . . . . . . . . . . . . . . I-442
Automatic Stepwise Regression . . . . . . . . . . . . . . . . . . . . I-417
Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-216
Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-566
Binary Logit with Interactions . . . . . . . . . . . . . . . . . . . . . . I-569
Binary Logit with Multiple Predictors . . . . . . . . . . . . . . . . . . . I-568
Box-Behnken Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-264
Box-Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-103
Box-Hunter Fractional Factorial Design . . . . . . . . . . . . . . . . I-256
By-Choice Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . I-598
Canonical Correlation Analysis. . . . . . . . . . . . . . . . . . . . . . I-544
Canonical Correlations: Using Text Output . . . . . . . . . . . . . . . I-26
Central Composite Response Surface Design . . . . . . . . . . . . . I-269
Choice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-96
Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-45
Cochrans Test of Linear Trend. . . . . . . . . . . . . . . . . . . . . . . I-194
Conditional Logistic Regression . . . . . . . . . . . . . . . . . . . . . I-588
Confidence Interval on a Median . . . . . . . . . . . . . . . . . . . . . I-25
Confidence Intervals for One-Way Table Percentages . . . . . . . . . I-199
Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-313
Correspondence Analysis (Simple) . . . . . . . . . . . . . . . . . . . . I-151
Covariance Alternatives to Repeated Measures. . . . . . . . . . . . . I-532
Crossover and Changeover Designs. . . . . . . . . . . . . . . . . . . . I-520
Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-321
Deciles of Risk and Model Diagnostics . . . . . . . . . . . . . . . . . . . I-574
xv
xvi
xvii
I-426
xviii
xix
Chapter
1
Introduction to Statistics
Leland Wilkinson
Statistics and state have the same root. Statistics are the numbers of the state. More
generally, they are any numbers or symbols that formally summarize our observations
of the world. As we all know, summaries can mislead or elucidate. Statistics also
refers to the introductory course we all seem to hate in college. When taught well,
however, it is this course that teaches us how to use numbers to elucidate rather than
to mislead.
Statisticians specialize in many areasprobability, exploratory data analysis,
modeling, social policy, decision making, and others. While they may philosophically
disagree, statisticians nevertheless recognize at least two fundamental tasks:
description and inference. Description involves characterizing a batch of data in
simple but informative ways. Inference involves generalizing from a sample of data
to a larger population of possible data. Descriptive statistics help us to observe more
acutely, and inferential statistics help us to formulate and test hypotheses.
Any distinctions, such as this one between descriptive and inferential statistics, are
potentially misleading. Lets look at some examples, however, to see some
differences between these approaches.
Descriptive Statistics
Descriptive statistics may be single numerical summaries of a batch, such as an
average. Or, they may be more complex tables and graphs. What distinguishes
descriptive statistics is their reference to a given batch of data rather than to a more
general population or class. While there are exceptions, we usually examine
descriptive statistics to understand the structure of a batch. A closely related field is
I-1
I-2
Chapter 1
called exploratory data analysis. Both exploratory and descriptive methods may lead
us to formulate laws or test hypotheses, but their focus is on the data at hand.
Consider, for example, the following batch. These are numbers of arrests by sex in
1985 for selected crimes in the United States. The source is the FBI Uniform Crime
Reports. What can we say about differences between the patterns of arrests of men and
women in the United States in 1985?
CRIME
MALES
FEMALES
murder
rape
robbery
assault
burglary
larceny
auto
arson
battery
forgery
fraud
embezzle
vandal
weapons
vice
sex
drugs
gambling
family
dui
drunk
disorderly
vagrancy
runaway
12904
28865
105401
211228
326959
744423
97835
13129
416735
46286
151773
5624
181600
134210
29584
74602
562754
21995
35553
1208416
726214
435198
24592
53808
1815
303
8639
32926
26753
334053
10093
2003
75937
23181
111825
3184
20192
10970
67592
6108
90038
3879
5086
157131
70573
99252
3001
72473
I-3
Introduc tion to Sta tistics
State laws vary on the definitions of some of these crimes. Agencies may modify arrest
statistics for political purposes. Know where your batch came from before you use it.
MALES
24
5624.000
1208416.000
5649688.000
235403.667
305947.056
FEMALES
24
303.000
334053.000
1237007.000
51541.958
74220.864
How about the average (mean) number of arrests for a crime? For males, this was
235,403 and for females, 51,542. Does the mean make any sense to you as a summary
statistic? Another statistic in the table, the standard deviation, measures how much
these numbers vary around the average. The standard deviation is the square root of the
average squared deviation of the observations from their mean. It, too, has problems in
this instance. First of all, both the mean and standard deviation should represent what
you could observe in your batch, on average: the mean number of fish in a pond, the
mean number of children in a classroom, the mean number of red blood cells per cubic
millimeter. Here, we would have to say, the mean murder-rape-robbery--runaway
type of crime. Second, even if the mean made sense descriptively, we might question
its use as a typical crime-arrest statistic. To see why, we need to examine the shape of
these numbers.
Stem-and-Leaf Plots
Lets look at a display that compresses these data a little less drastically. The stem-andleaf plot is like a tally. We pick a most significant digit or digits and tally the next digit
to the right. By using trailing digits instead of tally marks, we preserve extra digits in
the data. Notice the shape of the tally. There are mostly smaller numbers of arrests and
a few crimes (such as larceny and driving under the influence of alcohol) with larger
I-4
Chapter 1
numbers of arrests. Another way of saying this is that the data are positively skewed
toward larger numbers for both males and females.
Stem and Leaf Plot of variable:
Minimum:
5624.000
Lower hinge:
29224.500
Median:
101618.000
Upper hinge:
371847.000
Maximum: 1208416.000
MALES, N = 24
0 H 011222234579
1 M 0358
2
1
3 H 2
4
13
5
6
6
7
24
* * * Outside Values * * *
12
0
Stem and Leaf Plot of variable:
Minimum:
303.000
Lower hinge:
4482.500
Median:
21686.500
Upper hinge:
74205.000
Maximum:
334053.000
FEMALES, N = 24
0 H 00000000011
0 M 2223
0
0 H 6777
0
99
1
1
1
1
5
* * * Outside Values * * *
3
3
The Median
When data are skewed like this, the mean gets pulled from the center of the majority
of numbers toward the extreme with the few. A statistic that is not as sensitive to
extreme values is the median. The median is the value above which half the data fall.
More precisely, if you sort the data, the median is the middle value or the average of
the two middle values. Notice that for males the median is 101,618, and for females,
21,686. Both are considerably smaller than the means and more typical of the majority
of the numbers. This is why the median is often used for representing skewed data,
such as incomes, populations, or reaction times.
We still have the same representativeness problem that we had with the mean,
however. Even if the medians corresponded to real data values in this batch (which
they dont because there is an even number of observations), it would be hard to
characterize what they would represent.
I-5
Introduc tion to Sta tistics
Sorting
Most people think of means, standard deviations, and medians as the primary
descriptive statistics. They are useful summary quantities when the observations
represent values of a single variable. We purposely chose an example where they are
less appropriate, however, even when they are easily computable. There are better
ways to reveal the patterns in these data. Lets look at sorting as a way of uncovering
structure.
I was talking once with an FBI agent who had helped to uncover the Chicago
machines voting fraud scandal some years ago. He was a statistician, so I was curious
what statistical methods he used to prove the fraud. He replied, We sorted the voter
registration tape alphabetically by last name. Then we looked for duplicate names and
addresses. Sorting is one of the most basic and powerful data analysis techniques. The
stem-and-leaf plot, for example, is a sorted display.
We can sort on any numerical or character variable. It depends on our goal. We
began this chapter with a question: Are there differences between the patterns of arrests
of men and women in the United States in 1985? How about sorting the male and
female arrests separately? If we do this, we will get a list of crimes in order of
decreasing frequency within sex.
MALES
FEMALES
dui
larceny
drunk
drugs
disorderly
battery
burglary
assault
vandal
fraud
weapons
robbery
auto
sex
larceny
dui
fraud
disorderly
drugs
battery
runaway
drunk
vice
assault
burglary
forgery
vandal
weapons
I-6
Chapter 1
MALES
FEMALES
runaway
forgery
family
vice
rape
vagrancy
gambling
arson
murder
embezzle
auto
robbery
sex
family
gambling
embezzle
vagrancy
arson
murder
rape
You might want to connect similar crimes with lines. The number of crossings would
indicate differences in ranks.
Standardizing
This ranking is influenced by prevalence. The most frequent crimes occur at the top of
the list in both groups. Comparisons within crimes are obscured by this influence. Men
committed almost 100 times as many rapes as women, for example, yet rape is near the
bottom of both lists. If we are interested in contrasting the sexes on patterns of crime
while holding prevalence constant, we must standardize the data. There are several
ways to do this. You may have heard of standardized test scores for aptitude tests.
These are usually produced by subtracting means and then dividing by standard
deviations. Another method is simply to divide by row or column totals. For the crime
data, we will divide by totals within rows (each crime). Doing so gives us the
proportion of each arresting crime committed by men or women. The total of these two
proportions will thus be 1.
Now, a contrast between men and women on this standardized value should reveal
variations in arrest patterns within crime type. By subtracting the female proportion
from the male, we will highlight primarily male crimes with positive values and female
crimes with negative. Next, sort these differences and plot them in a simple graph. The
following shows the result:
I-7
Introduc tion to Sta tistics
Now we can see clear contrasts between males and females in arrest patterns. The
predominantly aggressive crimes appear at the top of the list. Rape now appears where
it belongsan aggressive, rather than sexual, crime. A few crimes dominated by
females are at the bottom.
Inferential Statistics
We often want to do more than describe a particular sample. In order to generalize,
formulate a policy, or test a hypothesis, we need to make an inference. Making an
inference implies that we think a model describes a more general population from
which our data have been randomly sampled. Sometimes it is difficult to imagine a
population from which you have gathered data. A population can be all possible
voters, all possible replications of this experiment, or all possible moviegoers.
When you make inferences, you should have a population in mind.
What Is a Population?
We are going to use inferential methods to estimate the mean age of the unusual
population contained in the 1980 edition of Whos Who in America. We could enter all
73,500 ages into a SYSTAT file and compute the mean age exactly. If it were practical,
this would be the preferred method. Sometimes, however, a sampling estimate can be
more accurate than an entire census. For example, biases are introduced into large
censuses from refusals to comply, keypunch or coding errors, and other sources. In
I-8
Chapter 1
these cases, a carefully constructed random sample can yield less-biased information
about the population.
This an unusual population because it is contained in a book and is therefore finite.
We are not about to estimate the mean age of the rich and famous. After all, Spy
magazine used to have a regular feature listing all of the famous people who are not in
Whos Who. And bogus listings may escape the careful fact checking of the Whos Who
research staff. When we get our estimate, we might be tempted to generalize beyond
the book, but we would be wrong to do so. For example, if a psychologist measures
opinions in a random sample from a class of college sophomores, his or her
conclusions should begin with the statement, College sophomores at my university
think If the word people is substituted for college sophomores, it is the
experimenters responsibility to make clear that the sample is representative of the
larger group on all attributes that might affect the results.
chosen).
n Close your eyes, flip the pages of the book, and point to a name (Tversky and others
have done research that shows that humans cannot behave randomly).
n Randomly pick the first letter of the last name and randomly choose from the
names beginning with that letter (there are more names beginning with C, for
example, than with I).
The way to pick randomly from a book, file, or any finite population is to assign a
number to each name or case and then pick a sample of numbers randomly. You can
use SYSTAT to generate a random number between 1 and 73,500, for example, with
the expression:
1 + INT(73500URN)
I-9
Introduc tion to Sta tistics
There are too many pages in Whos Who to use this method, however. As a short cut,
I randomly generated a page number and picked a name from the page using the
random number generator. This method should work well provided that each page has
approximately the same number of names (between 19 and 21 in this case). The sample
is shown below:
AGE
60
74
39
78
66
63
45
56
65
51
52
59
67
48
36
34
68
50
51
47
81
56
49
58
58
SEX
male
male
female
male
male
male
male
male
male
male
male
male
male
male
female
female
male
male
male
male
male
male
male
male
male
AGE
38
44
49
62
76
51
51
75
65
41
67
50
55
45
49
58
47
55
67
58
76
70
69
46
60
SEX
female
male
male
male
female
male
male
male
female
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
I-10
Chapter 1
Specifying a Model
To make an inference about age, we need to construct a model for our population:
a = +
This model says that the age ( a ) of someone we pick from the book can be described
by an overall mean age ( ) plus an amount of error ( ) specific to that person and due
to random factors that are too numerous and insignificant to describe systematically.
Notice that we use Greek letters to denote things that we cannot observe directly and
Roman letters for those that we do observe. Of the unobservables in the model, is
called a parameter, and , a random variable. A parameter is a constant that helps to
describe a population. Parameters indicate how a model is an instance of a family of
models for similar populations. A random variable varies like the tossing of a coin.
There are two more parameters associated with the random variable but not
appearing in the model equation. One is its mean ( ),which we have rigged to be 0,
and the other is its standard deviation ( or simply ). Because a is simply the sum
of (a constant) and (a random variable), its standard deviation is also .
In specifying this model, we assume the following:
n The model is true for every member of the population.
n The error, plus or minus, that helps determine one population members age is
Estimating a Model
Because we have not sampled the entire population, we cannot compute the parameter
values directly from the data. We have only a small sample from a much larger
population, so we can estimate the parameter values only by using some statistical
method on our sample data. When our three assumptions are appropriate, the sample
mean will be a good estimate of the population mean. Without going into all of the
details, the sample estimate will be, on average, close to the values of the mean in the
population.
I-11
Introduc tion to Sta tistics
We can use various methods in SYSTAT to estimate the mean. One way is to
specify our model using Linear Regression. Select AGE and add it to the Dependent
list. With commands:
REGRESSION
MODEL AGE=CONSTANT
This model says that AGE is a function of a constant value ( ). The rest is error ( ).
Another method is to compute the mean from the Basic Statistics routines. The result
is shown below:
AGE
N OF CASES
MEAN
STANDARD DEV
STD. ERROR
50
56.700
11.620
1.643
Our best estimate of the mean age of people in Whos Who is 56.7 years.
Confidence Intervals
Our estimate seems reasonable, but it is not exactly correct. If we took more samples
of size 50 and computed estimates, how much would we expect them to vary? First, it
should be plain without any mathematics to see that the larger our sample, the closer
will be our sample estimate to the true value of in the population. After all, if we
could sample the entire population, the estimates would be the true values. Even so, the
variation in sample estimates is a function only of the sample size and the variation of
the ages in the population. It does not depend on the size of the population (number of
people in the book). Specifically, the standard deviation of the sample mean is the
standard deviation of the population divided by the square root of the sample size. This
standard error of the mean is listed on the output above as 1.643. On average, we
would expect our sample estimates of the mean age to vary by plus or minus a little
more than one and a half years, assuming samples of size 50.
If we knew the shape of the sampling distribution of mean age, we would be able to
complete our description of the accuracy of our estimate. There is an approximation
that works quite well, however. If the sample size is reasonably large (say, greater than
25), then the mean of a simple random sample is approximately normally distributed.
This is true even if the population distribution is not normal, provided the sample size
is large.
I-12
Chapter 1
Density
0.3
0.2
0.1
0.0
50
55
60
Mean Age
65
We have drawn the graph so that the central area comprises 95% of all the area under
the curve (from about 53.5 to 59.9). From this normal approximation, we have built a
95% symmetric confidence interval that gives us a specific idea of the variability of
our estimate. If we did this entire procedure againsample 50 names, compute the
mean and its standard error, and construct a 95% confidence interval using the normal
approximationthen we would expect that 95 intervals out of a hundred so
constructed would cover the real population mean age. Remember, population mean
age is not necessarily at the center of the interval that we just constructed, but we do
expect the interval to be close to it.
Hypothesis Testing
From the sample mean and its standard error, we can also construct hypothesis tests on
the mean. Suppose that someone believed that the average age of those listed in Whos
Who is 61 years. After all, we might have picked an unusual sample just through the
luck of the draw. Lets say, for argument, that the population mean age is 61 and the
standard deviation is 11.62. How likely would it be to find a sample mean age of 56.7?
If it is very unlikely, then we would reject this null hypothesis that the population mean
is 61. Otherwise, we would fail to reject it.
I-13
Introduc tion to Sta tistics
There are several ways to represent an alternative hypothesis against this null
hypothesis. We could make a simple alternative value of 56.7 years. Usually, however,
we make the alternative compositethat is, it represents a range of possibilities that do
not include the value 61. Here is how it would look:
H0: = 61 (null hypothesis)
HA: 61 (alternative hypothesis)
We would reject the null hypothesis if our sample value for the mean were outside of
a set of values that a population value of 61 could plausibly generate. In this context,
plausible means more probable than a conventionally agreed upon critical level for
our test. This value is usually 0.05. A result that would be expected to occur fewer than
five times in a hundred samples is considered significant and would be a basis for
rejecting our null hypothesis.
Constructing this hypothesis test is mathematically equivalent to sliding the normal
distribution in the above figure to center over 61. We then look at the sample value 56.7
to see if it is outside of the middle 95% of the area under the curve. If so, we reject the
null hypothesis.
Density
0.3
0.2
0.1
56.7
0.0
50
55
60
Mean Age
65
The following t test output shows a p value (probability) of 0.012 for this test. Because
this value is lower than 0.05, we would reject the null hypothesis that the mean age is
61. This is equivalent to saying that the value of 61 does not appear in the 95%
confidence interval.
I-14
Chapter 1
56.700
11.620
Ho: Mean =
95.00% CI
df =
49
61.000
53.398 to
t =
Prob =
60.002
-2.617
0.012
The mathematical duality between confidence intervals and hypothesis testing may
lead you to wonder which is more useful. The answer is that it depends on the context.
Scientific journals usually follow a hypothesis testing model because their null
hypothesis value for an experiment is usually 0 and the scientist is attempting to reject
the hypothesis that nothing happened in the experiment. Any rejection is usually taken
to be interesting, even when the sample size is so large that even tiny differences from
0 will be detected.
Those involved in making decisionsepidemiologists, business people,
engineersare often more interested in confidence intervals. They focus on the size
and credibility of an effect and care less whether it can be distinguished from 0. Some
statisticians, called Bayesians, go a step further and consider statistical decisions as a
form of betting. They use sample information to modify prior hypotheses. See Box and
Tiao (1973) or Berger (1985) for further information on Bayesian statistics.
Checking Assumptions
Now that we have finished our analyses, we should check some of the assumptions we
made in doing them. First, we should examine whether the data look normally
distributed. Although sample means will tend to be normally distributed even when the
population isnt, it helps to have a normally distributed population, especially when we
do not know the population standard deviation. The stem-and-leaf plot gives us a quick
idea:
Stem and leaf plot of variable:
Minimum:
Lower hinge:
Median:
Upper hinge:
Maximum:
34.000
49.000
56.000
66.000
81.000
3
4
3
689
4
14
4 H 556778999
5
0011112
5 M 556688889
6
0023
6 H 55677789
7
04
7
5668
8
1
AGE
, N =
50
I-15
Introduc tion to Sta tistics
There is another plot, called a dot histogram (dit) plot which looks like a stem-and-leaf
plot. We can use different symbols to denote males and females in this plot, however,
to see if there are differences in these subgroups. Although there are not enough
females in the sample to be sure of a difference, it is nevertheless a good idea to
examine it. The dot histogram reveals four of the six females to be younger than
everyone else.
A better test of normality is to plot the sorted age values against the corresponding
values of a mathematical normal distribution. This is called a normal probability plot.
If the data are normally distributed, then the plotted values should fall approximately on
a straight line. Our data plot fairly straight. Again, different symbols are used for the
males and females. The four young females appear in the bottom left corner of the plot.
Does this possible difference in ages by gender invalidate our results? No, but it
suggests that we might want to examine the gender differences further to see whether
or not they are significant.
I-16
Chapter 1
References
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nd ed. New York:
Springer Verlag.
Box, G. E. P. and Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading,
Mass.: Addison-Wesley.
Chapter
2
Bootstrapping and Sampling
Leland Wilkinson and Laszlo Engelman
Statistical Background
Bootstrap (Efron and Tibshirani, 1993) is the most recent and most powerful of a
variety of strategies for producing estimates of parameters in samples taken from
unknown probability distributions. Efron and LePage (1992) summarize the problem
most succinctly. We have a set of real-valued observations x 1 , ,x n independently
sampled from an unknown probability distribution F. We are interested in estimating
some parameter by using the information in the sample data with an estimator
= t ( x ) . Some measure of the estimates accuracy is as important as the estimate
itself; we want a standard error of and, even better, a confidence interval on the true
value .
I-17
I-18
Chapter 2
Classical statistical methods provide a powerful way of handling this problem when
F is known and is simplewhen , for example, is the mean of the normal
distribution. Focusing on the standard error of the mean, we have:
(F)
-------------n
2
se { x ;F } =
(x x)
2
( F ) =
i=1
--------------------------(n 1)
we have:
n
(x x)
se ( x ) =
i=1
--------------------------n( n 1)
Parametric methods often work fairly well even when the distribution is contaminated
or only approximately known because the central limit theorem shows that sums of
independent random variables with finite variances tend to be normal in large samples
even when the variables themselves are not normal. But problems arise for estimates
more complicated than a meanmedians, sample correlation coefficients, or
eigenvalues, especially in small or medium-sized samples and even, in some cases, in
large samples.
Strategies for approaching this problem nonparametrically have involved using
to obtain information needed for the standard error
the empirical distribution F
estimate. One approach is Tukeys jackknife (Tukey, 1958), which is offered in
SAMPLE=JACKKNIFE. Tukey proposed computing n subsets of ( x 1 , , x n ), each
consisting of all of the cases except the ith deleted case (for i = 1 , , n ). He
produced standard errors as a function of the n estimates from these subsets.
Another approach has involved subsampling, usually via simple random samples.
This option is offered in SAMPLE=SIMPLE. A variety of researchers in the 1950s and
1960s explored these methods empirically (for example, Block, 1960; see Noreen,
I-19
Bootstrapping and Sampling
1989, for others). This method amounts to a Monte Carlo study in which the sample is
treated as the population. It is also closely related to methodology for permutation tests
(Fisher, 1935; Dwass, 1957; Edginton, 1980).
The bootstrap (Efron, 1979) has been the focus of most recent theoretical research.
F is defined as:
we have:
se { x, F } =
(x x)
-----------------n
is to sample from
The computer algorithm for getting the samples for generating F
( x 1 , ,x n ) with replacement. Efron and other researchers have shown that the
general procedure of generating samples and computing estimates yields data
on which we can make useful inferences. For example, instead of computing only
and its standard error, we can do histograms, densities, order statistics (for symmetric
and asymmetric confidence intervals), and other computations on our estimates. In
other words, there is much to learn from the bootstrap sample distributions of the
estimates themselves.
There are some concerns, however. The naive bootstrap computed this way (with
SAMPLE=BOOT and STATS for computing means and standard deviations) is not
especially good for long-tailed distributions. It is also not suited for time-series or
stochastic data. See LePage and Billard (1992) for recent research on and solutions to
some of these problems. There are also several simple improvements to the naive
boostrap. One is the pivot, or bootstrap-t method, discussed in Efron and Tibshirani
(1993). This is especially useful for confidence intervals on the mean of an unknown
distribution. Efron (1982) discusses other applications. There are also refinements
based on correction for bias in the bootstrap sample itself (DiCiccio and Efron, 1996).
In general, however, the naive bootstrap can help you get better estimates of
standard errors and confidence intervals than many large-sample approximations, such
as Fishers z transformation for Pearson correlations or Wald tests for coefficients in
I-20
Chapter 2
nonlinear models. And in cases in which no good approximations are available (see
some of the examples below), the bootstrap is the only way to go.
Bootstrapping in SYSTAT
Bootstrap Main Dialog Box
No dialog box exists for performing bootstrapping; therefore, you must use SYSTATs
command language. To do a bootstrap analysis, simply add the sample type to the
command that initiates model estimation (usually ESTIMATE).
Using Commands
The syntax is:
ESTIMATE / SAMPLE=BOOT(m,n)
SIMPLE(m,n)
JACK
The arguments m and n stand for the number of samples and the sample size of each
sample. The parameter n is optional and defaults to the number of cases in the file.
The BOOT option generates samples with replacement, SIMPLE generates samples
without replacement, and JACK generates a jackknife set.
Usage Considerations
Types of data. Bootstrapping works on procedures with rectangular data only.
Print options. It is best to set PRINT=NONE; otherwise, you will get 16 miles of output.
If you want to watch, however, set PRINT=LONG and have some fun.
Quick Graphs. Bootstrapping produces no Quick Graphs. You use the file of bootstrap
estimates and produce the graphs you want. See the examples.
Saving files. If you are doing this for more than entertainment (watching output fly by),
save your data into a file before you use the ESTIMATE / SAMPLE command. See the
examples.
BY groups. By all means. Are you a masochist?
I-21
Bootstrapping and Sampling
Case frequencies. Yes, FREQ=<variable> works. This feature does not use extra
memory.
Case weights. Use case weighting if it is available in a specific module.
Examples
A few examples will serve to illustrate bootstrapping. They cover only a few of the
statistical modules, however. We will focus on the tools you can use to manipulate
output and get the summary statistics you need for bootstrap estimates.
Example 1
Linear Models
This example involves the famous Longley (1967) regression data. These real data
were collected by James Longley at the Bureau of Labor Statistics to test the limits of
regression software. The predictor variables in the data set are highly collinear, and
several coefficients of variation are extremely large. The input is:
USE LONGLEY
GLM
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
SAVE BOOT / COEF
ESTIMATE / SAMPLE=BOOT(2500,16)
OUTPUT TEXT1
USE LONGLEY
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
ESTIMATE
USE BOOT
STATS
STATS X(1..6)
OUTPUT *
BEGIN
DEN X(1..6) / NORM
DEN X(1..6)
END
Notice that we save the coefficients into the file BOOT. We request 2500 bootstrap
samples of size 16 (the number of cases in the file). Then we fit the Longley data with
a single regression to compare the result to our bootstrap. Finally, we use the bootstrap
I-22
Chapter 2
file and compute basic statistics on the bootstrap estimated regression coefficients. The
OUTPUT command is used to save this part of the output to a file. We should not use it
earlier in the program unless we want to save the output for the 2500 regressions. To
view the bootstrap distributions, we create histograms on the coefficients to see their
distribution.
The resulting output is:
Variables in the SYSTAT Rectangular file are:
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
TOTAL
Dep Var: TOTAL
N: 16
Multiple R: 0.998
POPULATN
TIME
Coefficient
Std Error
-3482258.635
15.062
-0.036
-2.020
-1.033
-0.051
1829.151
890420.384
84.915
0.033
0.488
0.214
0.226
455.478
.
0.007
0.001
0.030
0.279
0.003
0.001
P(2 Tail)
-3.911
0.177
-1.070
-4.136
-4.822
-0.226
4.016
0.004
0.863
0.313
0.003
0.001
0.826
0.003
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
1.84172E+08
6 3.06954E+07
330.285
0.000
Residual
836424.056
9
92936.006
--------------------------------------------------------------------------------------------------------------------------------Durbin-Watson D Statistic
First Order Autocorrelation
2.559
-0.348
X(1)
2500
-816.248
1312.052
20.648
128.301
X(2)
2500
-0.846
0.496
-0.049
0.064
X(3)
2500
-12.994
7.330
-2.214
0.903
X(4)
2500
-8.864
2.617
-1.118
0.480
X(5)
2500
-2.591
3142.235
1.295
62.845
X(6)
2499
-5050.438
12645.703
1980.382
980.870
I-23
Bootstrapping and Sampling
1500
0.2
0.1
200
0
-1000
0.3
500
1000
0.0
2000
0
-1.0
0.8
-0.5
0
-20
-10
0.8
0.4
0.3
500
0.2
1500
0.6
0.5
Count
1000
1000
0.4
0.3
500
0.2
0.1
0
0.0
5
0.6
0.5
1000
0.4
0.3
500
0.2
0.1
0.1
0
-1000
1000 2000
X(5)
0.0
10
3000
0.0
4000
0.5
1500
0.7
0.6
0
X(3)
2000
1500
Count
0.2
0.1
0.0
0.5
0.0
0.7
X(4)
0.3
500
X(2)
2000
-5
0.4
0.1
X(1)
0
-10
1000
Count
Count
0.2
400
0.4
600
1000
0.5
0.3
0.6
0.5
800
1500
Count
0.4
Count
1000
0.6
0
-10000
10000
0.0
20000
X(6)
The bootstrapped standard errors are all larger than the normal-theory standard errors.
The most dramatically different are the ones for the POPULATN coefficient (62.845
versus 0.226). It is well known that multicollinearity leads to large standard errors for
regression coefficients, but the bootstrap makes this even clearer.
Normal curves have been superimposed on the histograms, showing that the
coefficients are not normally distributed. We have run a relatively large number of
samples (2500) to reveal these long-tailed distributions. Were these data to be analyzed
formally, it would take a huge number of samples to get useful standard errors.
Beaton, Rubin, and Barone (1976) used a randomization technique to highlight this
problem. They added a uniform random extra digit to Longleys data so that their data
sets rounded to Longleys values and found in a simulation that the variance of the
simulated coefficient estimates was larger in many cases than the miscalculated
solutions from the poorer designed regression programs.
I-24
Chapter 2
Example 2
Spearman Rank Correlation
This example involves law school data from Efron and Tibshirani (1993). They use
these data to illustrate the usefulness of the bootstrap for calculating standard errors on
the Pearson correlation. There are similar calculations for a 95% confidence interval
on the Spearman correlation.
The bootstrap estimates are saved into a temporary file. The file format is
CORRELATION, meaning that 1000 correlation matrices will be saved, stacked on top
of each other in the file. Consequently, we need BASIC to sift through and delete every
odd line (the diagonal of the matrix). We also have to remember to change the file type
to RECTANGULAR so that we can sort and do other things later. Another approach
would have been to use the rectangular form of the correlation output:
SPEARMAN LSAT*GPA
Next, we reuse the new file and sort the correlations. Finally, we print the nearest
values to the percentiles. Following is the input:
CORR
GRAPH NONE
USE LAW
RSEED=54321
SAVE TEMP
SPEARMAN LSAT GPA / SAMPLE=BOOT(1000,15)
BASIC
USE TEMP
TYPE=RECTANGULAR
IF CASE<>2*INT(CASE/2) THEN DELETE
SAVE BLAW
RUN
USE BLAW
SORT LSAT
IF CASE=975 THEN PRINT 95% CI Upper:,LSAT
IF CASE=25 THEN PRINT 95% CI Lower:,LSAT
OUTPUT TEXT2
RUN
OUTPUT *
DENSITY LSAT
0.476
0.953
I-25
Bootstrapping and Sampling
The histogram of the entire file shows the overall shape of the distribution. Notice its
asymmetry.
200
0.2
Count
100
0.1
50
0
0.0
0.2
0.4
0.6 0.8
LSAT
1.0
150
0.0
1.2
Example 3
Confidence Interval on a Median
We will use the STATS module to compute a 95% confidence interval on the median
(Efron, 1979). The input is:
STATS
GRAPH NONE
USE OURWORLD
SAVE TEMP
STATS LIFE_EXP / MEDIAN,SAMPLE=BOOT(1000,57)
BASIC
USE TEMP
SAVE TEMP2
IF STATISTC$<>Median THEN DELETE
RUN
USE TEMP2
SORT LIFE_EXP
IF CASE=975 THEN PRINT 95% CI Upper:,LIFE_EXP
IF CASE=25 THEN PRINT 95% CI Lower:,LIFE_EXP
OUTPUT TEXT3
RUN
OUTPUT *
DENSITY LIFE_EXP
I-26
Chapter 2
63.000
71.000
600
0.6
500
0.5
400
0.4
300
0.3
200
0.2
100
0.1
0
50
60
70
LIFE_EXP
Count
0.0
80
Keep in mind that we are using the naive bootstrap method here, trusting the
unmodified distribution of the bootstrap sample to set percentiles. Looking at the
bootstrap histogram, we can see that the distribution is skewed and irregular. There are
improvements that can be made in these estimates. Also, we have to be careful about
how we interpret a confidence interval on a median.
Example 4
Canonical Correlations: Using Text Output
Most statistics can be bootstrapped by saving into SYSTAT files, as shown in the
examples. Sometimes you may want to search through bootstrap output for a single
number and compute standard errors or graphs for that statistic. The following example
uses SETCOR to compute the distribution of the two canonical correlations relating the
I-27
Bootstrapping and Sampling
species to measurements in the Fisher Iris data. The same correlations are computed in
the DISCRIM procedure. Following is the input:
SETCOR
USE IRIS
MODEL SPECIES=SEPALLEN..PETALWID
CATEGORY SPECIES
OUTPUT TEMP
ESTIMATE / SAMPLE=BOOT(500,150)
OUTPUT *
BASIC
GET TEMP
INPUT A$,B$
LET R1=.
LET R2=.
LET FOUND=.
IF A$=Canonical AND B$=correlations ,
THEN LET FOUND=CASE
IF LAG(FOUND,2)<>. THEN FOR
LET R1=VAL(A$)
LET R2=VAL(B$)
NEXT
IF R1=. AND R2=. THEN DELETE
SAVE CC
RUN
EXIT
USE CC
DENSITY R1 R2 / DIT
Notice how the BASIC program searches through the output file TEMP.DAT for the
words Canonical correlations at the beginning of a line. Two lines later, the actual
numbers are in the output, so we use the LAG function to check when we are at that
point after having located the string. Then we convert the printed values back to
numbers with the VAL() function. If you are concerned with precision, use a larger
format for the output. Finally, we delete unwanted rows and save the results into the
file CC. From that file, we plot the two canonical correlations. For fun, we do a dot
histogram (dit) plot.
I-28
Chapter 2
Notice the stripes in the plot on the left. These reveal the three-digit rounding we
incurred by using the standard FORMAT=3.
Computation
Computations are done by the respective statistical modules. Sampling is done on the
data.
Algorithms
Bootstrapping and other sampling is implemented via a one-pass algorithm that does
not use extra storage for the data. Samples are generated using the SYSTAT uniform
random number generator. It is always a good idea to reset the seed when running a
problem so that you can be certain where the random number generator started if it
becomes necessary to replicate your results.
Missing Data
Cases with missing data are handled by the specific module.
I-29
Bootstrapping and Sampling
References
Beaton, A. E., Rubin, D. B., and Barone, J. L. (1976). The acceptability of regression
solutions: Another look at computational accuracy. Journal of the American Statistical
Association, 71, 158168.
Block, J. (1960). On the number of significant findings to be expected by chance.
Psychometrika, 25, 369380.
DiCiccio, T. J. and Efron, B. (1966). Bootstrap confidence intervals. Statistical Science, 11,
189228.
Dwass, M. (1957). Modified randomization sets for nonparametric hypotheses. Annals of
Mathematical Statistics, 29, 181187.
Edginton, E. S. (1980). Randomization tests. New York: Marcel Dekker.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of
Statistics, 7, 126.
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Vol. 38 of
CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Penn.:
SIAM.
Efron, B. and LePage, R. (1992). Introduction to bootstrap. In R. LePage and L. Billard
(eds.), Exploring the Limits of Bootstrap. New York: John Wiley & Sons, Inc.
Efron, B. and Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Fisher, R. A. (1935). The design of experiments. London: Oliver & Boyd.
Longley, J. W. (1967). An appraisal of least squares for the electronic computer from the
point of view of the user. Journal of the American Statistical Association, 62, 819841.
Noreen, E. W. (1989). Computer intensive methods for testing hypotheses: An introduction.
New York: John Wiley & Sons, Inc.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of
Mathematical Statistics, 29, 614.
Chapter
3
Classification and
Regression Trees
Leland Wilkinson
The TREES module computes classification and regression trees. Classification trees
include those models in which the dependent variable (the predicted variable) is
categorical. Regression trees include those in which it is continuous. Within these
types of trees, the TREES module can use categorical or continuous predictors,
depending on whether a CATEGORY statement includes some or all of the predictors.
For any of the models, a variety of loss functions is available. Each loss function
is expressed in terms of a goodness-of-fit statisticthe proportion of reduction in
2
error (PRE). For regression trees, this statistic is equivalent to the multiple R . Other
loss functions include the Gini index, twoing (Breiman et al., 1984), and the phi
coefficient.
TREES produces graphical trees called mobiles (Wilkinson, 1995). At the end of
each branch is a density display (box plot, dot plot, histogram, etc.) showing the
distribution of observations at that point. The branches balance (like a Calder mobile)
at each node so that the branch is level, given the number of observations at each end.
The physical analogy is most obvious for dot plots, in which the stacks of dots (one
for each observation) balance like marbles in bins.
TREES can also produce a SYSTAT BASIC program to code new observations
and predict the dependent variable. This program can be saved to a file and run from
the command window or submitted as a program file.
Statistical Background
Trees are directed graphs beginning with one node and branching to many. They are
fundamental to computer science (data structures), biology (classification),
psychology (decision theory), and many other fields. Classification and regression
I-31
I-32
Chapter 3
trees are used for prediction. In the last two decades, they have become popular as
alternatives to regression, discriminant analysis, and other procedures based on
algebraic models. Tree-fitting methods have become so popular that several
commercial programs now compete for the attention of market researchers and others
looking for software.
Different commercial programs produce different results with the same data,
however. Worse, some programs provide no documentation or supporting materials to
explain their algorithms. The result is a marketplace of competing claims, jargon, and
misrepresentation. Reviews of these packages (for example, Levine, 1991; Simon,
1991) use words like sorcerer, magic formula, and wizardry to describe the
algorithms and express frustration at vendors scant documentation. Some vendors, in
turn, have represented tree programs as state-of-the-art artificial intelligence
procedures capable of discovering hidden relationships and structures in databases.
Despite the marketing hyperbole, most of the now-popular tree-fitting algorithms
have been around for decades. The modern commercial packages are mainly
microcomputer ports (with attractive interfaces) of the mainframe programs that
originally implemented these algorithms. Warnings of abuse of these techniques are
not new either (for example, Einhorn, 1972; Bishop, Fienberg, and Holland, 1975).
Originally proposed as automatic procedures for detecting interactions among
variables, tree-fitting methods are actually closely related to classical cluster analysis
(Hartigan, 1975).
This introduction will attempt to sort out some of the differences between
algorithms and illustrate their use on real data. In addition, tree analyses will be
compared to discriminant analysis and regression.
I-33
Classification and Regression Trees
The top node contains the entire sample. Each remaining node contains a subset of
the sample in the node directly above it. Furthermore, each node contains the sum of
the samples in the nodes connected to and directly below it. The tree thus splits
samples.
Each node can be thought of as a cluster of objects, or cases, that is to be split by
further branches in the tree. The numbers in parentheses below the terminal nodes
show how many cases are incorrectly classified by the tree. A similar tree data structure
is used for representing the results of single and complete linkage and other forms of
hierarchical cluster analysis (Hartigan, 1975). Tree prediction models add two
ingredients: the predictor and predicted variables labeling the nodes and branches.
GRADE POINT AVERAGE
n=727
>3.47
<3.47
342
385
MCAT VERBAL
MCAT VERBAL
<555
REJECT
(9)
93
249 >555
MCAT QUANTITATIVE
<655
>655
122
REJECT
(19)
INTERVIEW
(49)
127
REJECT
INTERVIEW
(45)
(46)
The tree is binary because each node is split into only two subsamples. Classification
or regression trees do not have to be binary, but most are. Despite the marketing claims
of some vendors, nonbinary, or multibranch, trees are not superior to binary trees. Each
is a permutation of the other, as shown in the figure below.
The tree on the left (ternary) is not more parsimonious than that on the right (binary).
Both trees have the same number of parameters, or split points, and any statistics
associated with the tree on the left can be converted trivially to fit the one on the right.
A computer program for scoring either tree (IF ... THEN ... ELSE) would look identical.
For display purposes, it is often convenient to collapse binary trees into multibranch
trees, but this is not necessary.
I-34
Chapter 3
123
123
23
Some programs that do multibranch splits do not allow further splitting on a predictor
once it has been used. This has an appealing simplicity. However, it can lead to
unparsimonious trees. It is unnecessary to make this restriction before fitting a tree.
The figure below shows an example of this problem. The upper right tree classifies
objects on an attribute by splitting once on shape, once on fill, and again on shape. This
allows the algorithm to separate the objects into only four terminal nodes having
common values. The upper left tree splits on shape and then only on fill. By not
allowing any other splits on shape, the tree requires five terminal nodes to classify
correctly. This problem cannot be solved by splitting first on fill, as the lower left tree
shows. In general, restricting splits to only one branch for each predictor results in
more terminal nodes.
1
shape
1
shape
1
fill
fill
3
shape
4
fill
2 2
shape
I-35
Classification and Regression Trees
Regression Trees
Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a
quantitative variable. They called the method Automatic Interaction Detection
(AID). The algorithm performs stepwise splitting. It begins with a single cluster of
cases and searches a candidate set of predictor variables for a way to split the cluster
into two clusters. Each predictor is tested for splitting as follows: sort all the n cases on
the predictor and examine all n 1 ways to split the cluster in two. For each possible
split, compute the within-cluster sum of squares about the mean of the cluster on the
dependent variable. Choose the best of the n 1 splits to represent the predictors
contribution. Now do this for every other predictor. For the actual split, choose the
predictor and its cut point that yields the smallest overall within-cluster sum of squares.
Categorical predictors require a different approach. Since categories are unordered,
all possible splits between categories must be considered. For deciding on one split of
k
k categories into two groups, this means that 2 1 possible splits must be considered.
Once a split is found, its suitability is measured on the same within-cluster sum of
squares as for a quantitative predictor.
Morgan and Sonquist called their algorithm AID because it naturally incorporates
interaction among predictors. Interaction is not correlation. It has to do, instead, with
conditional discrepancies. In the analysis of variance, interaction means that a trend
within one level of a variable is not parallel to a trend within another level of the same
variable. In the ANOVA model, interaction is represented by cross-products between
predictors. In the tree model, it is represented by branches from the same node that
have different splitting predictors further down the tree.
I-36
Chapter 3
The figure below shows a tree without interactions on the left and with interactions on
the right. Because interaction trees are a natural by-product of the AID splitting
algorithm, Morgan and Sonquist called the procedure automatic. In fact, AID trees
without interactions are quite rare for real data, so the procedure is indeed automatic. To
search for interactions using stepwise regression or ANOVA linear modeling, we would
p
have to generate 2 interactions among p predictors and compute partial correlations for
every one of them in order to decide which ones to include in our formal model.
B
.
B
C
C
E
Classification Trees
Regression trees parallel regression/ANOVA modeling, in which the dependent
variable is quantitative. Classification trees parallel discriminant analysis and
algebraic classification methods. Kass (1980) proposed a modification to AID called
CHAID for categorized dependent and independent variables. His algorithm
incorporated a sequential merge-and-split procedure based on a chi-square test
statistic. Kass was concerned about computation time (although this has since proved
an unnecessary worry), so he decided to settle for a suboptimal split on each predictor
instead of searching for all possible combinations of the categories. Kasss algorithm
is like sequential crosstabulation. For each predictor:
n Crosstabulate the m categories of the predictor with the k categories of the
dependent variable.
2 k subtable is least
significantly different on a chi-square test and merge these two categories.
n If the chi-square test statistic is not significant according to a preset critical value,
repeat this merging process for the selected predictor until no nonsignificant chisquare is found for a subtable.
n Choose the predictor variable whose chi-square is the largest and split the sample
into m l subsets, where l is the number of categories resulting from the merging
process on that predictor.
I-37
Classification and Regression Trees
The CHAID algorithm saves computer time, but it is not guaranteed to find the splits
that predict best at a given step. Only by searching all possible category subsets can we
do that. CHAID is also limited to categorical predictors, so it cannot be used for
quantitative or mixed categorical-quantitative models, as in the figure on p. 33.
Nevertheless, it is an effective way to search heuristically through rather large tables
quickly.
Note: Within the computer science community, there is a categorical splitting literature
that often does not cite the statistical work and is, in turn, not frequently cited by
statisticians (although this has changed in recent years). Quinlan (1986, 1992), the best
known of these researchers, developed a set of algorithms based on information theory.
These methods, called ID3, iteratively build decision trees based on training samples
of attributes.
I-38
Chapter 3
considerably higher than for the training sample on which it was constructed.
Whenever possible, data should be reserved for cross-validation.
Loss Functions
Different loss functions are appropriate for different forms of data. TREES offers a
variety of functions that are scaled as proportional reduction in error (PRE) statistics.
This allows you to try different loss functions on a problem and compare their
predictive validity.
For regression trees, the most appropriate loss functions are least squares, trimmed
mean, and least absolute deviations. Least-squares loss yields the classic AID tree. At
each split, cases are classified so that the within-group sum of squares about the mean
of the group is as small as possible. The trimmed mean loss works the same way but
first trims 20% of outlying cases (10% at each extreme) in a splittable subset before
computing the mean and sum of squares. It can be useful when you expect outliers in
subgroups and dont want them to influence the split decisions. LAD loss computes
least absolute deviations about the mean rather than squares. It, too, gives less weight
to extreme cases in each potential group.
For classification trees, use the phi coefficient (the default), Gini index, or twoing.
2
The phi coefficient is n for a 2 k table formed by the split on k categories of
the dependent variable. The Gini index is a variance estimate based on all comparisons
of possible pairs of values in a subgroup. Finally, twoing is a word coined by Breiman
et al. to describe splitting k categories as if it were a two-category splitting problem.
For more information about the effects of Gini and twoing on computations, see
Breiman et al. (1984).
Geometry
Most discussions of trees versus other classifiers compare tree graphs and algebraic
equations. There is another graphic view of what a tree classifier does, however. If we
look at the cases embedded in the space of the predictor variables, we can ask how a
linear discriminant analysis partitions the cases and how a tree classifier partitions them.
I-39
Classification and Regression Trees
The figure below shows how cases are split by a linear discriminant analysis. There
are three subgroups of cases in this example. The cutting planes are positioned
approximately halfway between each pair of group centroids. Their orientation is
determined by the discriminant analysis. With three predictors and four groups, there
are six cutting planes, although only four planes show in the figure. The fourth group
is assumed to be under the bottom plane in the figure. In general, if there are g groups,
the linear discriminant model cuts them with g ( g 1 ) 2 planes.
The figure below shows how a tree-fitting algorithm cuts the same data. Only the
nearest subgroup (dark spots) shows; the other three groups are hidden behind the rear
and bottom cutting planes. Notice that the cutting planes are parallel to the axes. While
this would seem to restrict the discrimination compared to the more flexible angles
allowed the discriminant planes, the tree model allows interactions between variables,
which do not appear in the ordinary linear discriminant model. Notice, for example,
that one plane splits on the X variable, but the second plane that splits on the Y variable
cuts only the values to the left of the X partition. The tree model can continue to cut
any of these subregions separately, unlike the discriminant model, which can cut only
globally and with g ( g 1 ) 2
planes. This is a mixed blessing, however, since tree
methods, as we have seen, can over-fit the data. It is critical to test them on new
samples.
I-40
Chapter 3
Tree models are not usually related by authors to dimensional plots in this way, but
it is helpful to see that they have a geometric interpretation. Alternatively, we can
construct algebraic expressions for trees. They would require dummy variables for any
categorical predictors and interaction (or product) terms for every split whose
descendants (or lower nodes) did not involve the same variables on both sides.
Y
X
I-41
Classification and Regression Trees
Model selection and estimation are available in the main Trees dialog box:
Dependent. The variable you want to examine. The dependent variable should be
continuous or categorical numeric variables (for example, INCOME).
Independent(s). Select one or more continuous or categorical variables (grouping
variables).
Expand Model. Adds all possible sums and differences of the predictors to the model.
Loss. Select a loss function from the drop-down list.
n Least squares. The least squared loss (AID) minimizes the sum of the squared
deviation.
n Trimmed mean. The trimmed mean loss (TRIM) trims the extreme observations
dichotomous variables.
n Gini index. The Gini index loss measures inequality or dispersion.
n Twoing. The twoing loss function.
Display nodes as. Select the type of density display. The following types are available:
n Box plot. Plot that uses boxes to show a distribution shape, central tendency, and
variability.
n Dit plot. Dot histogram. Produces a density display that looks similar to a histogram.
jitters points randomly on a short vertical axis to keep points from colliding.
n Stripe. Places vertical lines at the location of data values along a horizontal data
impurity value.
I-42
Chapter 3
Stopping Criteria
The Stopping Criteria dialog box contains the parameters for controlling stopping.
Using Commands
After selecting a file with USE filename, continue with:
TREES
MODEL yvar = xvarlist / EXPAND
ESTIMATE / PMIN=d, SMIN=d, NMIN=n, NSPLIT=n,
LOSS=LSQ
TRIM
LAD
PHI
GINI
TWOING,
DENSITY=STRIPE
JITTER
DOT
DIT
BOX
I-43
Classification and Regression Trees
Usage Considerations
Types of data. TREES uses rectangular data only.
Print options. The default output includes the splitting history and summary statistics.
PRINT=LONG adds a BASIC program for classifying new observations. You can cut
and paste this BASIC program into a text window and run it in the BASIC module to
classify new data on the same variables for cross-validation and prediction.
Quick Graphs. TREES produces a Quick Graph for the fitted tree. The nodes may
contain text describing split parameters or they may contain density graphs of the data
being split. A dashed line indicates that the split is not significant.
Saving files. TREES does not save files. Use the BASIC program under PRINT=LONG
to classify your data, compute residuals, etc., on old or new data.
BY groups. TREES analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ = <variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in TREES.
Examples
The following examples illustrate the features of the TREES module. The first example
shows a classification tree for the Fisher-Anderson iris data set. The second example
is a regression tree on an example taken from Breiman et al. (1984), and the third is a
regression tree predicting the danger of a mammal being eaten by predators.
I-44
Chapter 3
Example 1
Classification Tree
This example shows a classification tree analysis of the Fisher-Anderson iris data set
featured in Discriminant Analysis. We use the Gini loss function and display a
graphical tree, or mobile, with dot histograms, or dit plots. The input is:
USE IRIS
LAB SPECIES/1=SETOSA,2=VERSICOLOR,3=VIRGINICA
TREES
MODEL SPECIES=SEPALLEN,SEPALWID,PETALLEN,PETALWID
ESTIMATE/LOSS=GINI,DENSITY=DIT
Fit
The PRE for the whole tree is 0.89 (similar to R for a regression model), which is not
bad. Before exulting, however, we should keep in mind that while Fisher chose the iris
data set to demonstrate his discriminant model on real data, it is barely worthy of the
effort. We can classify the data almost perfectly by looking at a scatterplot of petal
length against petal width.
The unique SYSTAT display of the tree is called a mobile (Wilkinson, 1995). The
dit plots are ideal for illustrating how it works. Imagine each case is a marble in a box
at each node. The mobile simply balances all of the boxes. The reason for doing this is
that we can easily see splits that cut only a few cases out of a group. These nodes will
hang out conspicuously. It is fairly evident in the first split, for example, which cuts the
population into half as many cases on the right (petal length less than 3) as on the left.
I-45
Classification and Regression Trees
This display has a second important characteristic that is different from other tree
displays. The mobile coordinates the polarity of the terminal nodes (red on color
displays) rather than the direction of the splits. This design has three consequences: we
can evaluate the distributions of the subgroups on a common scale, we can see the
direction of the splits on each splitting variable, and we can look at the distributions on
the terminal nodes from left to right to see how the whole sample is split on the
dependent variable.
The first consequence means that every box containing data is a miniature density
display of the subgroups values on a common scale (same limits and same direction).
We dont need to drill down on the data in a subgroup to see its distribution. It is
immediately apparent in the tree. If you prefer box plots or other density displays,
simply use
DENSITY = BOX
or another density as an ESTIMATE option. Dit plots are most suitable for classification
trees, however; because they spike at the category values, they look like bar charts for
categorical data. For continuous data, dit plots look like histograms. Although they are
my favorite density display for this purpose, they can be time consuming to draw on
large samples, so box plots are the default graphical display. If you omit DENSITY
altogether, you will get a text summary inside each box.
The second consequence of ordering the splits according to the polarity of the
dependent (rather than the independent) variable is that the direction of the split can be
recognized immediately by looking at which side (left or right) the split is displayed
on. Notice that PETALLEN < 3.000 occurs on the left side of the first split. This means
that the relation between petal length and species (coded 1..3) is positive. The same is
true for petal width within the second split group because the split banner occurs on the
left. Banners on the right side of a split indicate a negative relationship between the
dependent variable and the splitting variable within the group being split, as in the
regression tree examples.
The third consequence of ordering the splits is that we can look at the terminal nodes
from left to right and see the consequences of the split in order. In the present example,
notice that the three species are ordered from left to right in the same order that they
are coded. You can change this ordering for a categorical variable with the CATEGORY
and ORDER commands. Adding labels, as we did here, makes the output more
interpretable.
I-46
Chapter 3
Example 2
Regression Tree with Box Plots
This example shows a simple AID model. The data set is Boston housing prices, cited
in Belsley, Kuh, and Welsch (1980) and used in Breiman et al. (1984). We are
predicting median home values (MEDV) from a set of demographic variables. The
input is:
USE BOSTON
TREES
MODEL MEDV=CRIM..LSTAT
ESTIMATE/PMIN=.005,DENSITY=BOX
I-47
Classification and Regression Trees
Fit
0.453
0.422
0.505
0.382
0.405
0.380
0.337
0.227
The Quick Graph of the tree more clearly reveals the sample-size feature of the mobile
display. Notice that a number of the splits, because they separate out a few cases only,
are extremely unbalanced. This can be interpreted in two ways, depending on context.
On the one hand, it can mean that outliers are being separated so that subsequent splits
can be more powerful. On the other hand, it can mean that a split is wasted by focusing
on the outliers when further splits dont help to improve the prediction. The former
case appears to apply in our example. The first split separates out a few expensive
housing tracts (the median values have a positively skewed distribution for all tracts),
which makes subsequent splits more effective. The box plots in the terminal nodes are
narrow.
I-48
Chapter 3
Example 3
Regression Tree with Dit Plots
This example involves predicting the danger of a mammal being eaten by predators
(Allison and Cicchetti, 1976). The predictors are hours of dreaming and nondreaming
sleep, gestational age, body weight, and brain weight. Although the danger index has
only five values, we are treating it as a quantitative variable with meaningful numerical
values. The input is:
USE SLEEP
TREES
MODEL DANGER=BODY_WT,BRAIN_WT,
SLO_SLEEP,DREAM_SLEEP,GESTATE
ESTIMATE / DENSITY=DIT
I-49
Classification and Regression Trees
I-50
Chapter 3
The prediction is fairly good (PRE = 0.547). The Quick Graph of this tree illustrates
another feature of mobiles. The dots in each terminal node are assigned a separate
color. This way, we can follow their path up the tree each time they are merged. If the
prediction is perfect, the top density plot will have colored dots perfectly separated.
The extent to which the colors are mixed in the top plot is a visual indication of the
badness-of-fit of the model. The fairly good separation of colors for the sleep data is
quite clear on the computer screen or with color printing but less evident in a blackand-white figure.
Computation
Computations are in double precision.
Algorithms
TREES uses algorithms from Breiman et al. (1984) for its splitting computations.
Missing Data
Missing data are eliminated from the calculation of the loss function for each split
separately.
References
Allison, T. and Cicchetti, D. (1976). Sleep in mammals: Ecological and constitutional
correlates. Science, 194, 732734.
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying
influential data and sources of collinearity. New York: John Wiley & Sons, Inc.
Bishop, Y. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate analysis.
Cambridge, Mass.: MIT Press.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. I. (1984). Classification and
regression trees. Belmont, Calif.: Wadsworth.
Einhorn, H. (1972). Alchemy in the behavioral sciences. Public Opinion Quarterly, 3,
367378.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.
I-51
Classification and Regression Trees
Chapter
4
Cluster Analysis
Leland Wilkinson, Laszlo Engelman, James Corter, and Mark Coward
I-53
I-54
Chapter 4
Statistical Background
Cluster analysis is a multivariate procedure for detecting groupings in data. The objects
in these groups may be:
n Cases (observations or rows of a rectangular data file). For example, if health
indicators (numbers of doctors, nurses, hospital beds, life expectancy, etc.) are
recorded for countries (cases), then developed nations may form a subgroup or
cluster separate from underdeveloped countries.
n Variables (characteristics or columns of the data). For example, if causes of death
(cancer, cardiovascular, lung disease, diabetes, accidents, etc.) are recorded for
each U.S. state (case), the results show that accidents are relatively independent of
the illnesses.
n Cases and variables (individual entries in the data matrix). For example, certain
wines are associated with good years of production. Other wines have other years
that are better.
Types of Clustering
Clusters may be of two sorts: overlapping or exclusive. Overlapping clusters allow the
same object to appear in more than one cluster. Exclusive clusters do not. All of the
methods implemented in SYSTAT are exclusive.
There are three approaches to producing exclusive clusters: hierarchical,
partitioned, and additive trees. Hierarchical clusters consist of clusters that completely
contain other clusters that completely contain other clusters, and so on. Partitioned
clusters contain no other clusters. Additive trees use a graphical representation in
which distances along branches reflect similarities among the objects.
The cluster literature is diverse and contains many descriptive synonyms:
hierarchical clustering (McQuitty, 1960; Johnson, 1967); single linkage clustering
(Sokal and Sneath, 1963), and joining (Hartigan, 1975). Output from hierarchical
methods can be represented as a tree (Hartigan, 1975) or a dendrogram (Sokal and
Sneath, 1963). (The linkage of each object or group of objects is shown as a joining of
branches in a tree. The root of the tree is the linkage of all clusters into one set, and
the ends of the branches lead to each separate object.)
I-55
Cluster Analy sis
Standardizing Data
Before you compute a dissimilarity measure, you may need to standardize your data
across the measured attributes. Standardizing puts measurements on a common scale.
In general, standardizing makes overall level and variation comparable across
measurements. Consider the following data:
OBJECT
X1
X2
X3
X4
A
B
C
D
10
11
13
14
2
3
4
1
11
15
12
13
900
895
760
874
If we are clustering the four cases (A through D), variable X4 will determine almost
entirely the dissimilarity between cases, whether we use correlations or distances. If
we are clustering the four variables, whichever correlation measure we use will adjust
for the larger mean and standard deviation on X4. Thus, we should probably
I-56
Chapter 4
standardize within columns if we are clustering rows and use a correlation measure if
we are clustering columns.
In the example below, case A will have a disproportionate influence if we are
clustering columns.
OBJECT
X1
X2
X3
X4
A
B
C
D
410
1
10
12
311
3
11
13
613
2
12
13
514
4
10
11
We should probably standardize within rows before clustering columns. This requires
transposing the data before standardization. If we are clustering rows, on the other
hand, we should use a correlation measure to adjust for the larger mean and standard
deviation of case A.
These are not immutable laws. The suggestions are only to make you realize that
scales can influence distance and correlation measures.
Hierarchical Clustering
To understand hierarchical clustering, its best to look at an example. The following
data reflect various attributes of selected performance cars.
ACCEL
5.0
5.3
5.8
7.0
7.6
7.9
8.5
8.7
9.3
10.8
13.0
BRAKE
SLALOM
MPG
SPEED
NAME$
245
242
243
267
271
259
263
287
258
287
253
61.3
61.9
62.6
57.8
59.8
61.7
59.9
64.2
64.1
60.8
62.3
17.0
12.0
19.0
14.5
21.0
19.0
17.5
35.0
24.5
25.0
27.0
153
181
154
145
124
130
131
115
129
100
95
Porsche 911T
Testarossa
Corvette
Mercedes 560
Saab 9000
Toyota Supra
BMW 635
Civic CRX
Acura Legend
VW Fox GL
Chevy Nova
I-57
Cluster Analy sis
Cluster Displays
SYSTAT displays the output of hierarchical clustering in several ways. For joining
rows or columns, SYSTAT prints a tree. For matrix joining, it prints a shaded matrix.
Trees. A tree is printed with a unique ordering in which every branch is lined up such
that the most similar objects are closest to each other. If a perfect seriation (onedimensional ordering) exists in the data, the tree reproduces it. The algorithm for
ordering the tree is given in Gruvaeus and Wainer (1972). This ordering may differ
from that of trees printed by other clustering programs if they do not use a seriation
algorithm to determine how to order branches. The advantage of using seriation is most
apparent for single linkage clusterings.
If you join rows, the end branches of the tree are labeled with case numbers or
labels. If you join columns, the end branches of the tree are labeled with variable
names.
Direct display of a matrix. As an alternative to trees, SYSTAT can produce a shaded
display of the original data matrix in which rows and columns are permuted according
to an algorithm in Gruvaeus and Wainer (1972). Different characters represent the
magnitude of each number in the matrix (Ling, 1973). A legend showing the range of
data values that these characters represent appears with the display.
Cutpoints between these values and their associated characters are selected to
heighten contrast in the display. The method for increasing contrast is derived from
techniques used in computer pattern recognition, in which gray-scale histograms for
visual displays are modified to heighten contrast and enhance pattern detection. To
find these cutpoints, we sort the data and look for the largest gaps between adjacent
values. Tukeys gapping method (Wainer and Schacht, 1978) is used to determine how
many gaps (and associated characters) should be chosen to heighten contrast for a
given set of data. This procedure, time consuming for large matrices, is described in
detail in Wilkinson (1978).
If you have a course to grade and are looking for a way to find rational cutpoints in
the grade distribution, you might want to use this display to choose the cutpoints.
Cluster the n 1 matrix of numeric grades (n students by 1 grade) and let SYSTAT
choose the cutpoints. Only cutpoints asymptotically significant at the 0.05 level are
chosen. If no cutpoints are chosen in the display, give everyone an A, flunk them all,
or hand out numeric grades (unless you teach at Brown University or Hampshire
College).
I-58
Chapter 4
Clustering Rows
First, lets look at possible clusters of the cars in the example. Since the variables are
on such different scales, we will standardize them before doing the clustering. This will
give acceleration comparable influence to braking, for example. Then we select
Pearson correlations as the basis for dissimilarity between cars. The result is:
Cluster Tree
Corvette
Porsche 911T
Testarossa
Toyota Supra
Acura Legend
Chevy Nova
Civic CRX
VW Fox GL
Saab 9000
Mercedes 560
BMW 635
0.0
0.1
0.5
0.6
If you look at the correlation matrix for the cars, you will see how these clusters hang
together. Cars within the same cluster (for example, Corvette, Testarossa, Porsche)
generally correlate highly.
Porsche
Testa
Corv
Merc
Saab
Toyota
BMW
Civic
Acura
VW
Chevy
Porsche
Testa
Corv
Merc
Saab
1.00
0.94
0.94
0.09
0.51
0.24
0.32
0.50
0.05
0.96
0.73
1.00
0.87
0.21
0.52
0.43
0.10
0.73
0.10
0.93
0.70
1.00
0.24
0.76
0.40
0.56
0.39
0.30
0.98
0.49
1.00
0.66
0.38
0.85
0.52
0.98
0.08
0.53
1.00
0.68
0.63
0.26
0.77
0.70
0.13
I-59
Cluster Analy sis
Toyota
Toyota
BMW
Civic
Acura
VW
Chevy
1.00
0.25
0.30
0.53
0.35
0.03
BMW
1.00
0.50
0.79
0.39
0.06
Civic
1.00
0.35
0.55
0.32
Acura
1.00
0.16
0.54
VW
1.00
0.53
Clustering Columns
We can cluster the performance attributes of the cars more easily. Here, we do not need
to standardize within cars (by rows) because all of the values are comparable between
cars. Again, to give each variable comparable influence, we will use Pearson
correlations as the basis for the dissimilarities. The result based on the data
standardized by variable (column) is:
Cluster Tree
BRAKE
MPG
ACCEL
SLALOM
SPEED
0.0
0.2
0.4 0.6
0.8
Distances
1.0
1.2
I-60
Chapter 4
L
E
D
OM
AK PG CCE LAL PEE
S
BR
M
S
A
Testarossa
Porsche 911T
Corvette
Acura Legend
Toyota Supra
BMW 635
Saab 9000
Mercedes 560
VW Fox GL
Chevy Nova
Civic CRX
3
2
1
0
-1
-2
This figure displays the standardized data matrix itself with rows and columns
permuted to reveal clustering and each data value replaced by one of three symbols.
Notice that the rows are ordered according to overall performance, with the fastest cars
at the top.
Matrix clustering is especially useful for displaying large correlation matrices. You
may want to cluster the correlation matrix this way and then use the ordering to
produce a scatterplot matrix that is organized by the multivariate structure.
I-61
Cluster Analy sis
K-means clustering does not search through every possible partitioning of the data,
so it is possible that some other solution might have smaller within-groups sums of
squares. Nevertheless, it has performed relatively well on globular data separated in
several dimensions in Monte Carlo studies of cluster algorithms.
Because it focuses on reducing within-groups sums of squares, k-means clustering
is like a multivariate analysis of variance in which the groups are not known in
advance. The output includes analysis of variance statistics, although you should be
cautious in interpreting them. Remember, the program is looking for large F ratios in
the first place, so you should not be too impressed by large values.
Following is a three-group analysis of the car data. The clusters are similar to those
we found by joining. K-means clustering uses Euclidean distances instead of Pearson
correlations, so there are minor differences because of scaling. To keep the influences
of all variables comparable, we standardized the data before running the analysis.
Summary Statistics for
Variable
3 Clusters
Between SS
DF
Within SS
DF
F-Ratio
Prob
ACCEL
7.825
2
2.175
8
14.389
0.002
BRAKE
5.657
2
4.343
8
5.211
0.036
SLALOM
5.427
2
4.573
8
4.747
0.044
MPG
7.148
2
2.852
8
10.027
0.007
SPEED
7.677
2
2.323
8
13.220
0.003
------------------------------------------------------------------------------Cluster Number:
1
Members
Case
Statistics
Distance
Mercedes 560
Saab 9000
Toyota Supra
BMW 635
Variable
Minimum
Mean
Maximum
St.Dev.
0.60
0.31
0.49
0.16
|
ACCEL
-0.45
-0.14
0.17
0.23
|
BRAKE
-0.15
0.23
0.61
0.28
|
SLALOM
-1.95
-0.89
0.11
0.73
|
MPG
-1.01
-0.47
-0.01
0.37
|
SPEED
-0.34
0.00
0.50
0.31
------------------------------------------------------------------------------Cluster Number:
2
Members
Case
Statistics
Distance
Civic CRX
Acura Legend
VW Fox GL
Chevy Nova
Variable
Minimum
Mean
Maximum
St.Dev.
0.81
0.67
0.71
0.76
|
ACCEL
0.26
0.99
2.05
0.69
|
BRAKE
-0.53
0.62
1.62
1.00
|
SLALOM
-0.37
0.72
1.43
0.74
|
MPG
0.53
1.05
2.15
0.65
|
SPEED
-1.50
-0.91
-0.14
0.53
------------------------------------------------------------------------------Cluster Number:
3
Members
Case
Porsche 911T
Testarossa
Corvette
Statistics
Distance
Variable
Minimum
Mean
Maximum
St.Dev.
0.25
0.43
0.31
|
|
|
|
|
ACCEL
BRAKE
SLALOM
MPG
SPEED
-1.29
-1.22
-0.10
-1.40
0.82
-1.13
-1.14
0.23
-0.78
1.21
-0.95
-1.03
0.59
-0.32
1.94
0.14
0.08
0.28
0.45
0.52
I-62
Chapter 4
Additive Trees
Sattath and Tversky (1977) developed additive trees for modeling
similarity/dissimilarity data. Hierarchical clustering methods require objects in the
same cluster to have identical distances to each other. Moreover, these distances must
be smaller than the distances between clusters. These restrictions prove problematic for
similarity data, and as a result hierarchical clustering cannot fit this data well.
In contrast, additive trees use tree branch length to represent distances between
objects. Allowing the within-cluster distances to vary yields a tree diagram with
varying branch lengths. Objects within a cluster can be compared by focusing on the
horizontal distance along the branches connecting them. The additive tree for the car
data follows:
Additive Tree
Chevy
VW
Civic
Acura
Toyota
Corv
Testa
Porsche
Saab
BMW
Merc
I-63
Cluster Analy sis
Length
Child
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0.10
0.49
0.14
0.52
0.19
0.13
0.11
0.71
0.30
0.42
0.62
0.06
0.08
0.49
0.18
0.35
0.04
0.13
0.0
0.04
0.0
Porsche
Testa
Corv
Merc
Saab
Toyota
BMW
Civic
Acura
VW
Chevy
1,2
8,10
12,3
13,11
9,15
14,6
17,16
5,18
4,7
20,19
Each object is a node in the graph. In this example, the first 11 nodes represent the cars.
Other graph nodes correspond to groupings of the objects. Here, the 12th node
represents Porsche and Testa.
The distance between any two nodes is the sum of the (horizontal) lengths between
them. The distance between Chevy and VW is 0.62 + 0.08 + 0.42 = 1.12 . The
distance between Chevy and Civic is 0.62 + 0.08 + 0.71 = 1.41 . Consequently,
Chevy is more similar to VW than to Civic.
I-64
Chapter 4
You must select the elements of the data file to cluster (Join):
n Rows. Rows (cases) of the data matrix are clustered.
n Columns. Columns (variables) of the data matrix are clustered.
n Matrix. Rows and columns of the data matrix are clusteredthey are permuted to
I-65
Cluster Analy sis
n Single. Single linkage defines the distance between two objects or clusters as the
distance between the two closest members of those clusters. This method tends to
produce long, stringy clusters. If you use a SYSTAT file that contains a similarity
or dissimilarity matrix, you get clustering via Johnsons min method.
n Complete. Complete linkage uses the most distant pair of objects in two clusters to
cluster centroid) as the reference point for distances to other objects or clusters.
n Average. Average linkage averages all distances between pairs of objects in
clusters, with adjustments for covariances, to decide how far apart the clusters are.
For some data, the last four methods cannot produce a hierarchical tree with strictly
increasing amalgamation distances. In these cases, you may see stray branches that do
not connect to others. If this happens, you should consider Single or Complete linkage.
For more information on these problems, see Fisher and Van Ness (1971). These
reviewers concluded that these and other problems made Centroid, Average, Median,
and Ward (as well as k-means) inadmissible clustering procedures. In practice and
in Monte Carlo simulations, however, they sometimes perform better than Single and
Complete linkage, which Fisher and Van Ness considered admissible. Milligan
(1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation
of clustering algorithms. Consult his paper for further details.
In addition, the following options can be specified:
Distance. Specifies the distance metric used to compare clusters.
Polar. Produces a polar (circular) cluster tree.
Save cluster identifier variable. Saves cluster identifiers to a SYSTAT file. You can
specify the number of clusters to identify for the saved file. If not specified, two
clusters are identified.
I-66
Chapter 4
Clustering Distances
Both hierarchical clustering and k-means clustering allow you to select the type of
distance metric to use between objects. From the Distance drop-down list, you can
select:
n Gamma. Distances are computed using 1 minus the Goodman-Kruskal gamma
correlation coefficient. Use this metric with rank order or ordinal scales. Missing
values are excluded from computations.
n Pearson. Distances are computed using 1 minus the Pearson product-moment
correlation coefficient for each pair of objects. Use this metric for quantitative
variables. Missing values are excluded from computations.
n RSquared. Distances are computed using 1 minus the square of the Pearson
product-moment correlation coefficient for each pair of objects. Use this metric
with quantitative variables. Missing values are excluded from computations.
n Euclidean. Clustering is computed using normalized Euclidean distance (root mean
squared distances). Use this metric with quantitative variables. Missing values are
excluded from computations.
n Minkowski. Clustering is computed using the pth root of the mean pth powered
distances of coordinates. Use this metric for quantitative variables. Missing values
are excluded from computations. Use the Power text box to specify the value of p.
n Chisquare. Distances are computed as the chi-square measure of independence of
rows and columns on 2-by-n frequency tables, formed by pairs of cases (or
variables). Use this metric when the data are counts of objects or events.
n Phisquare. Distances are computed as the phi-square (chi-square/total) measure on
2-by-n frequency tables, formed by pairs of cases (or variables). Use this metric
when the data are counts of objects or events.
n Percent (available for hierarchical clustering only). Clustering uses a distance
increment in within sum of squares of deviations, if the case (or variable) would
belong to a cluster. The case (or variable) is moved into the cluster that minimizes
the within sum of squares of deviations. Use this metric with quantitative variables.
Missing values are excluded from computations.
I-67
Cluster Analy sis
I-68
Chapter 4
I-69
Cluster Analy sis
Using Commands
For the hierarchical tree method:
CLUSTER
USE filename
IDVAR var$
PRINT
SAVE filename / NUMBER=n DATA
JOIN varlist / POLAR DISTANCE=metric
LINKAGE=method
POWER=p
I-70
Chapter 4
Usage Considerations
Types of data. Hierarchical Clustering works on either rectangular SYSTAT files or
files containing a symmetric matrix, such as those produced with Correlations. KMeans works only on rectangular SYSTAT files. Additive Trees works only on
symmetric (similarity or dissimilarity) matrices.
Print options. Using PRINT=LONG for Hierarchical Clustering yields an ASCII
representation of the tree diagram (instead of the Quick Graph). This option is useful
if you are joining more than 100 objects.
Quick Graphs. Cluster analysis includes Quick Graphs for each procedure. Hierarchical
Clustering and Additive Trees have tree diagrams. For each cluster, K-Means displays
a profile plot of the data and a display of the variable means and standard deviations.
To omit Quick Graphs, specify GRAPH NONE.
Saving files. CLUSTER saves cluster indices as a new variable.
BY groups. CLUSTER analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Labeling output. For Hierarchical Clustering and K-Means, be sure to consider using ID
Variable (on the Data menu) for labeling the output.
I-71
Cluster Analy sis
Examples
Example 1
K-Means Clustering
The data in the file SUBWORLD are a subset of cases and variables from the
OURWORLD file:
URBAN
BIRTH_RT
DEATH_RT
B_TO_D
BABYMORT
GDP_CAP
LIFEEXPM
LIFEEXPF
EDUC
HEALTH
MIL
LITERACY
The distributions of the economic variables (GDP_CAP, EDUC, HEALTH, and MIL)
are skewed with long right tails, so these variables are analyzed in log units.
This example clusters countries (cases). The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
KMEANS urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ,
mil health / NUMBER=4
I-72
Chapter 4
F-ratio
16.5065
81.2260
38.4221
75.8869
114.4464
60.5932
67.4881
64.2893
50.4730
73.1215
51.9470
28.7997
I-73
Cluster Analy sis
GDP_CAP
GDP_CAP
BIRTH_RT
BIRTH_RT
BABYMORT
BABYMORT
LIFEEXPF
LIFEEXPF
Index of Case
Index of Case
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
DEATH_RT
B_TO_D
B_TO_D
URBAN
-3
URBAN
-2
-1
-3
-2
-1
GDP_CAP
GDP_CAP
BIRTH_RT
BIRTH_RT
BABYMORT
BABYMORT
LIFEEXPF
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
LIFEEXPM
DEATH_RT
DEATH_RT
B_TO_D
B_TO_D
URBAN
-3
Index of Case
Index of Case
URBAN
-2
-1
-3
-2
-1
I-74
Chapter 4
GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN
GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN
GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN
GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN
For each variable, cluster analysis compares the between-cluster mean square (Between
SS/df) to the within-cluster mean square (Within SS/df) and reports the F-ratio. However,
do not use these F ratios to test significance because the clusters are formed to characterize
differences. Instead, use these statistics to characterize relative discrimination. For
example, the log of gross domestic product (GDP_CAP) and BIRTH_RT are better
discriminators between countries than URBAN or DEATH_RT. For a good graphical view
of the separation of the clusters, you might rotate the data using the three variables with
the highest F ratios.
Following the summary statistics, for each cluster, cluster analysis prints the
distance from each case (country) in the cluster to the center of the cluster. Descriptive
statistics for these countries appear on the right. For the first cluster, the standard scores
for LITERACY range from 0.54 to 0.75 with an average of 0.72. B_TO_D ranges from
1.09 to 0.46. Thus, for these predominantly European countries, literacy is well
above the average for the sample and the birth-to-death ratio is below average. In
cluster 2, LITERACY ranges from 2.27 to 0.76 for these five countries, and B_TO_D
ranges from 0.38 to 0.25. Thus, the countries in cluster 2 have a lower literacy rate
I-75
Cluster Analy sis
and a greater potential for population growth than those in cluster 1. The fourth cluster
(Iraq and Libya) has an average birth-to-death ratio of 2.01, the highest among the four
clusters.
Cluster Profiles
The variables in cluster profile plots are ordered by the F ratios. The vertical line under
each cluster number indicates the grand mean across all data. A variable mean within
each cluster is marked by a dot. The horizontal lines indicate one standard deviation
above or below the mean. The countries in cluster 1 have above average means of gross
domestic product, life expectancy, literacy, and urbanization, and spend considerable
money on health care and the military, while the means of their birth rates, infant
mortality rates, and birth-to-death ratios are low. The opposite is true for cluster 2.
Example 2
Hierarchical Clustering: Clustering Cases
This example uses the SUBWORLD data (see the k-means example for a description)
to cluster cases. The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
JOIN urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ mil health
I-76
Chapter 4
Cluster
containing
-----------Belgium
Denmark
UK
WGermany
Sweden
France
Italy
Canada
Argentina
Austria
Poland
Czechoslov
ElSalvador
Ecuador
Chile
Uruguay
Somalia
Cuba
Brazil
Guatemala
Peru
Haiti
Colombia
Panama
Iraq
Guinea
Afghanistan
Libya
Ethiopia
Were joined
at distance
-----------0.0869
0.1109
0.1127
0.1275
0.1606
0.1936
0.1943
0.2112
0.2154
0.2364
0.2411
0.2595
0.3152
0.3155
0.3704
0.3739
0.3974
0.4030
0.4172
0.4210
0.4433
0.4743
0.5160
0.5560
0.5704
0.5832
0.5969
0.8602
0.9080
No. of members
in new cluster
-------------2
3
4
5
6
7
8
9
2
10
2
12
2
3
3
4
2
16
4
5
6
3
7
23
2
2
5
25
30
I-77
Cluster Analy sis
The numerical results consist of the joining history. The countries at the top of the
panel are joined first at a distance of 0.087. The last entry represents the joining of the
largest two clusters to form one cluster of all 30 countries. Switzerland is in one of the
clusters and Ethiopia is in the other.
The clusters are best illustrated using a tree diagram. Because the example joins
rows (cases) and uses COUNTRY as an ID variable, the branches of the tree are labeled
with countries. If you join columns (variables), then variable names are used. The scale
for the joining distances is printed at the bottom. Notice that Iraq and Libya, which
form their own cluster as they did in the k-means example, are the second-to-last
cluster to link with others. They join with all the countries listed above them at a
distance of 0.583. Finally, at a distance of 0.908, the five countries at the bottom of the
display are added to form one large cluster.
Polar Dendrogram
Adding the POLAR option to JOIN yields a polar dendrogram.
I-78
Chapter 4
Example 3
Hierarchical Clustering: Clustering Variables
This example joins columns (variables) instead of rows (cases) to see which variables
cluster together. The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ,
STANDARDIZE / SD
JOIN urban birth_rt
lifeexpf
educ mil
Cluster
containing
-----------LIFEEXPM
GDP_CAP
HEALTH
LITERACY
BIRTH_RT
LIFEEXPF
EDUC
URBAN
BABYMORT
DEATH_RT
B_TO_D
Were joined
at distance
-----------0.1444
0.2390
0.2858
0.3789
0.3859
0.4438
0.4744
0.5414
0.8320
0.8396
1.5377
No. of members
in new cluster
-------------2
2
3
3
2
6
7
8
3
4
12
I-79
Cluster Analy sis
The scale at the bottom of the tree for the distance (1r ) ranges from 0.0 to 1.5. The
smallest distance is 0.011thus, the correlation of LIFEEXPM with LIFEEXPF is
0.989.
Example 4
Hierarchical Clustering: Clustering Variables and Cases
To produce a shaded display of the original data matrix in which rows and columns are
permuted according to an algorithm in Gruvaeus and Wainer (1972), use the MATRIX
option. Different shadings or colors represent the magnitude of each number in the
matrix (Ling, 1973).
If you use the MATRIX option with Euclidean distance, be sure that the variables are
on comparable scales because both rows and columns of the matrix are clustered.
Joining a matrix containing inches of annual rainfall and annual growth of trees in feet,
for example, would split columns more by scales than by covariation. In cases like this,
you should standardize your data before joining.
The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
JOIN urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ,
mil health / MATRIX
I-80
Chapter 4
4
3
2
1
0
-1
-2
-3
This clustering reveals three groups of countries and two groups of variables. The
countries with more urban dwellers and literate citizens, longest life-expectancies,
highest gross domestic product, and most expenditures on health care, education, and
the military are on the top left of the data matrix; countries with the highest rates of
death, infant mortality, birth, and population growth (see B_TO_D) are on the lower
right. You can also see that, consistent with the k-means and join examples, Iraq and
Libya spend much more on military, education, and health than their immediate
neighbors.
I-81
Cluster Analy sis
Example 5
Hierarchical Clustering: Distance Matrix Input
This example clusters a matrix of distances. The data, stored as a dissimilarity matrix
in the CITIES data file, are airline distances in hundreds of miles between 10 global
cities. The data are adapted from Hartigan (1975).
The input is:
CLUSTER
USE cities
JOIN berlin bombay capetown chicago london,
montreal newyork paris sanfran seattle
Cluster
containing
-----------LONDON
MONTREAL
PARIS
NEWYORK
SANFRAN
CHICAGO
SEATTLE
BERLIN
CAPETOWN
Were joined
at distance
-----------2.0000
3.0000
5.0000
7.0000
7.0000
17.0000
33.0000
39.0000
51.0000
No. of members
in new cluster
-------------2
2
3
3
2
5
8
9
10
I-82
Chapter 4
The tree is printed in seriation order. Imagine a trip around the globe to these cities.
SYSTAT has identified the shortest path between cities. The itinerary begins at San
Francisco, leads to Seattle, Chicago, New York, and so on, and ends in Capetown.
Note that the CITIES data file contains the distances between the cities; SYSTAT
did not have to compute those distances. When you save the file, be sure to save it as
a dissimilarity matrix.
This example is used both to illustrate direct distance input and to give you an idea
of the kind of information contained in the order of the SYSTAT cluster tree. For
distance data, the seriation reveals shortest paths; for typical sample data, the seriation
is more likely to replicate in new samples so that you can recognize cluster structure.
Example 6
Additive Trees
This example uses the ROTHKOPF data file. The input is:
CLUSTER
USE rothkopf
ADD a .. z
=
=
=
=
0.0609
0.3985
0.8412
0.7880
Node
Length
Child
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
23.3958
15.3958
14.8125
13.3125
24.1250
34.8370
15.9167
27.8750
25.6042
19.8333
13.6875
28.6196
21.8125
22.1875
19.0833
14.1667
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
I-83
Cluster Analy sis
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
18.9583
21.4375
28.0000
23.8750
23.0000
27.1250
21.5625
14.6042
17.1875
18.0417
16.9432
15.3804
15.7159
19.5833
26.0625
23.8426
6.1136
17.1750
18.8068
13.7841
15.6630
8.8864
4.5625
1.7000
8.7995
4.1797
1.1232
5.0491
2.4670
4.5849
2.6155
2.7303
0.0
3.8645
0.0
Q
R
S
T
U
V
W
X
Y
Z
1,
2,
3,
4,
5,
7,
8,
10,
13,
17,
18,
19,
27,
29,
33,
39,
12,
34,
42,
30,
32,
6,
45,
46,
50,
9
24
25
11
20
15
22
16
14
26
23
21
35
36
38
31
28
40
41
43
44
37
48
47
49
(SYSTAT also displays the raw data, as well as the model distances.)
Additive Tree
W
R
F
U
S
V
H
T
E
N
M
I
A
Z
Q
Y
C
P
J
O
G
X
B
L
K
D
I-84
Chapter 4
Computation
Algorithms
JOIN follows the standard hierarchical amalgamation method described in Hartigan
(1975). The algorithm in Gruvaeus and Wainer (1972) is used to order the tree.
KMEANS follows the algorithm described in Hartigan (1975). Modifications from
Hartigan and Wong (1979) improve speed. There is an important difference between
SYSTATs KMEANS algorithm and that of Hartigan (or implementations of Hartigans
in BMDP, SAS, and SPSS). In SYSTAT, seeds for new clusters are chosen by finding
the case farthest from the centroid of its cluster. In Hartigans algorithm, seeds forming
new clusters are chosen by splitting on the variable with largest variance.
Missing Data
In cluster analysis, all distances are computed with pairwise deletion of missing values.
Since missing data are excluded from distance calculations by pairwise deletion, they
do not directly influence clustering when you use the MATRIX option for JOIN. To use
the MATRIX display to analyze patterns of missing data, create a new file in which
missing values are recoded to 1, and all other values, to 0. Then use JOIN with MATRIX
to see whether missing values cluster together in a systematic pattern.
References
Campbell, D. T. and Fiske, D. W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81105.
Fisher, L. and Van Ness, J. W. (1971). Admissible clustering procedures. Biometrika, 58,
91104.
Gower, J. C. (1967). A comparison of some methods of cluster analysis. Biometrics, 23,
623637.
Gruvaeus, G. and Wainer, H. (1972). Two additions to hierarchical cluster analysis. The
British Journal of Mathematical and Statistical Psychology, 25, 200206.
Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review,
139150.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241254.
I-85
Cluster Analy sis
Ling, R. F. (1973). A computer generated aid for cluster analysis. Communications of the
ACM, 16, 355361.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate
observations. 5th Berkeley symposium on mathematics, statistics, and probability, Vol.
1, 281298.
McQuitty, L. L. (1960). Hierarchical syndrome analysis. Educational and Psychological
Measurement, 20, 293303.
Milligan, G. W. (1980). An examination of the effects of six types of error perturbation on
fifteen clustering algorithms. Psychometrika, 45, 325342.
Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319345.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic
relationships. University of Kansas Science Bulletin, 38, 14091438.
Sokal, R. R. and Sneath, P. H. A. (1963). Principles of numerical taxonomy. San Francisco:
W. H. Freeman and Company.
Wainer, H. and Schacht, S. (1978). Gappint. Psychometrika, 43, 203212.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236244.
Wilkinson, L. (1978). Permuting a matrix to a simple structure. Proceedings of the
American Statistical Association.
Chapter
5
Conjoint Analysis
Leland Wilkinson
Conjoint analysis fits metric and nonmetric conjoint measurement models to observed
data. It is designed to be a general additive model program using a simple
optimization procedure. As such, conjoint analysis can handle measurement models
not normally amenable to other specialized conjoint programs.
Statistical Background
Conjoint measurement (Luce and Tukey, 1964; Krantz, 1964; Luce, 1966; Tversky,
1967; Krantz and Tversky, 1971) is an axiomatic theory of measurement that defines
the conditions under which there exist measurement scales for two or more variables
that jointly define a common scale under an additive composition rule. This theory
became the basis for a group of related numerical techniques for fitting additive
models, called conjoint analysis (Green and Rao, 1971; Green, Carmone, and Wind,
1972; Green and DeSarbo, 1978; Green and Srinivasan, 1978, 1990; Louviere, 1988,
1994). For an interesting historical comment on Sir Ronald Fishers appropriate
scores method for fitting additive models, see Heiser and Meulman (1995).
To see how conjoint analysis is based on additive models, well first graph an
additive table and then examine a multiplicative table to encounter one example of a
non-additive table. Then well consider the problem of computing margins of a
general table based on an additive model.
I-87
I-88
Chapter 5
Additive Tables
The following is an additive table. Notice that any cell (in roman) is the sum of the
corresponding row and column marginal values (in italic).
1
A common way to represent a two-way table like this is with a graph. I made a file
(PCONJ.SYD) containing all possible ordered pairs of the row and column indices. Then
I formed Y values by adding the indices:
USE PCONJ
LET Y=A+B
LINE Y*A/GROUP=B,OVERLAY
The following graph of the additive table shows a plot of Y (the values in the cells)
against A (rows) stratified by B (columns) in the legend. Notice that the lines are
parallel.
I-89
Conjoint Analy sis
The following contour plot of the additive table shows the result. Notice that the lines
in the contour plot are parallel for additive tables. Furthermore, although I used a
quadratic smoother, the contours are linear because I used a simple linear combination
of A and B to make Y.
5
8
2
5
1
0
0
2
B
Multiplicative Tables
Following is a multiplicative table. Notice that any cell is the product of the
corresponding marginal values. We commonly encounter these tables in cookbooks
(for sizing recipes) or in, well, multiplication tables. These tables are one instance of
two-way tables that are not additive.
1
12
I-90
Chapter 5
And the following figure shows the contour plot for the multiplicative model. Notice,
again, that the contours are not parallel.
Multiplicative tables and graphs may be pleasing to look at, but theyre not simple. We
all learned to add before multiplying. Scientists often simplify multiplicative functions
by logging them, since logs of products are sums of logs. This is also one of the reasons
we are told to be suspicious of fan-fold interactions (as in the line graph of the
multiplicative table) in the analysis of variance. If we can log the variables and remove
them (usually improving the residuals in the process), we should do so because it
leaves us with a simple linear model.
I-91
Conjoint Analy sis
b2
b3
a4
1.38
2.07
2.48
a3
1.10
1.79
2.20
a2
.69
1.38
1.79
a1
.00
.69
1.10
The following figure shows a solution. The values for a are a 1 = 0. , a 2 = 0.69 ,
a 3 = 1.10 , and a 4 = 1.38 . The values for b are b 1 = 0. , b 2 = 0.69 , and
b 3 = 1.10 .
I-92
Chapter 5
I-93
Conjoint Analy sis
Finally, a commercial industry supplying the practical tools for conjoint studies has
produced a variety of software packages. Oppewal (1995) reviews some of these. In
many cases, more efforts are devoted to card decks and other stimulus materials
management than to the actual analysis of the models. CONJOINT in SYSTAT
represents the opposite end of the spectrum from these approaches. CONJOINT
presents methods for fitting these models that are inspired more by Luce and Tukeys
and Green and Raos original theoretical formulations than by the practical
requirements of data collection. The primary goal of SYSTAT CONJOINT is to provide
tools for scaling small- to moderate-sized data sets in which additive models can
simplify the presentation of data. Metric and nonmetric loss functions are available for
exploring the effects of nonlinearity on scaling. The examples highlight this
distinction.
I-94
Chapter 5
models.
Save file. Saves parameter estimates into filename.SYD.
I-95
Conjoint Analy sis
Using Commands
To request a conjoint analysis:
CONJOINT
MODEL depvarlist = indvarlist
ESTIMATE / ITERATIONS=n CONVERGENCE=d ,
LOSS = STRESS
TAU ,
REGRESSION = MONOTONIC
LINEAR
LOG
POWER ,
POLARITY = POSITIVE
NEGATIVE
Usage Considerations
Types of data. CONJOINT uses rectangular data only.
Print options. The output is standard for all print options.
Quick Graphs. Quick Graphs produced by CONJOINT are utility functions for each
predictor variable in the model.
Saving files. CONJOINT saves parameter estimates as one case into a file if you precede
ESTIMATE with SAVE.
BY groups. CONJOINT analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in CONJOINT.
I-96
Chapter 5
Examples
Example 1
Choice Data
The classical application of conjoint analysis is to product choice. The following
example from Green and Rao (1971) shows how to fit a nonmetric conjoint model to
some typical choice data. The input is:
CONJOINT
USE BRANDS
MODEL RESPONSE=DESIGN$..GUARANT$
ESTIMATE / POLARITY=NEGATIVE
I-97
Conjoint Analy sis
Iteration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Loss
0.5389079
0.4476390
0.3170808
0.1746641
0.1285278
0.1050734
0.0877708
0.0591691
0.0407008
0.0166571
0.0101404
0.0058237
0.0013594
0.0006314
0.0001157
0.0000065
0.0000000
0.0000000
0.0000000
0.0000000
Bissell
-0.122
NO
-0.131
Glory
-0.226
YES
-0.102
K2R
-0.195
NO
-0.039
I-98
Chapter 5
Shepard Diagram
1
1.0
Measure
Joint Score
0.5
0
0.0
-1
-0.5
10
Data
15
-1.0
20
1.0
1.0
0.5
0.5
Measure
Measure
-2
0
0.0
-0.5
1.19
1.39
PRICE
1.59
0.0
Bissell
Glory
BRAND$
-1.0
K2R
1.0
1.0
0.5
0.5
0.0
-0.5
-1.0
B
DESIGN$
-0.5
Measure
Measure
-1.0
0.0
-0.5
NO
YES
SEAL$
-1.0
NO
YES
GUARANT$
I-99
Conjoint Analy sis
The fitting method chosen for this example is the default nonmetric loss using
Kruskals STRESS statistic. This is the same method used in the MONANOVA
program (Kruskal and Carmone, 1969). Although the minimization algorithm differs
from that program, the result should be comparable.
The iterations converged to a perfect fit (LOSS = 0). That is, there exists a set of
parameter estimates such that their sums fit the observed data perfectly when Kendalls
tau-b is used to measure fit. This rarely occurs with real data.
The parameter estimates are scaled to have zero sum and unit sum of squares. There
is a single goodness-of-fit value for this example because there is one response.
The root-mean-square deleted goodness-of-fit values are the goodness of fit when
each respective parameter is set to zero. This serves as an informal test of sensitivity.
The lowest value for this example is for the B parameter, indicating that the estimate
for B cannot be changed without substantially affecting the overall goodness of fit.
The Shepard diagram displays the goodness of fit in a scatterplot. The Data axis
represents the observed data values. The Joint Score axis represents the values of the
combined parameter estimates. For example, if we have parameters a1, a2, a3 and b1,
b2, then every case measured on, say, a2 and b1 will be represented by a point in the
plot whose ordinate (y value) is a2 + b1. This example involves only one condition per
card or case, so that the Shepard diagram has no duplicate values on the y axis.
Conjoint analysis can easily handle duplicate measurements either with multiple
dependent variables (multiple subjects exposed to common stimuli) or with duplicate
values for the same subject (replications).
The fitted jagged line is the best fitting monotonic regression of these fitted values
on the observed data. For a similar diagram, see the Multidimensional Scaling chapter
in SYSTAT 10 Statistics II. And note carefully the warnings about degenerate
solutions and other problems.
You may want to try this example with REGRESSION = LINEAR to see how the
results compare. The linear fit yields an almost perfect Pearson correlation. This also
means that GLM (MGLH) can produce nearly the same estimates:
GLM
MODEL RESPONSE = CONSTANT + DESIGN$..GUARANT$
CATEGORY DESIGN$..GUARANT$
PRINT LONG
ESTIMATE
The PRINT LONG statement causes GLM to print the least-squares estimates of the
marginal means that, for an additive model, are the parameters we seek. The GLM
parameter estimates will differ from the ones printed here only by a constant and
scaling parameter. Conjoint analysis always scales parameter estimates to have zero
I-100
Chapter 5
sum and unit sum of squares. This way, they can be thought of as utilities over the
experimental domainsome negative, some positive.
Example 2
Word Frequency
The data set WORDS contains the most frequently used words in American English
(Carroll et al., 1971). Three measures have been added to the data. The first is the (most
likely) part of speech (PART$). The second is the number of letters (LETTERS) in the
word. The third is a measure of the meaning (MEANING$). This admittedly informal
measure represents the amount of harm done to comprehension (1 = a little, 4 = a lot)
by omitting the word from a sentence. While linguists may argue over these
classifications, they do reveal basic differences. Instead of using a measure of
frequency, we will work with the rank order itself to see if there is enough information
to fit a model. This time, we will maximize Kendalls tau-b directly.
Following is the input:
USE WORDS
CONJOINT
LET RANK=CASE
MODEL RANK = LETTERS PART$ MEANING
ESTIMATE / LOSS=TAU,POLARITY=NEGATIVE
I-101
Conjoint Analy sis
Maximum Iterations:
Iteration
1
2
3
4
5
6
7
8
Loss
0.2042177
0.1988071
0.1897893
0.1861822
0.1843787
0.1825751
0.1825751
0.1825751
50
Max parameter change
0.0955367
0.0911670
0.0708985
0.0308284
0.0259976
0.0131758
0.0000175
0.0000000
LETTERS(4)
-0.270
verb
-0.162
adjective
-0.119
MEANING(1)
0.749
adverb
-0.273
MEANING(2)
-0.121
I-102
Chapter 5
Shepard Diagram
1.5
1.0
1.0
Measure
Joint Score
0.5
0.5
0.0
-0.5
-0.5
-1.0
0
10
20
Data
30
-1.0
40
1.0
1.0
0.5
0.5
Measure
Measure
0.0
0.0
-0.5
2
3
LETTERS
0.0
-0.5
-1.0
je
Ad
ive
ct
n
n
n
rb
ou
tio
itio
ve
on
nc
os
Ad
Pr
ep
nju
Pr
Co
rb
Ve
-1.0
2
3
MEANING
PART$
The Shepard diagram reveals a slightly curvilinear relationship between the data and
the fitted values. We can parameterize that relationship by refitting the model as
follows:
ESTIMATE / REGRESSION=POWER,POLARITY=NEGATIVE
SYSTAT will then print Computed Exponent: 1.392. We will further examine this type
of power function in the Box-Cox example.
The output tells us that, in general, shorter words are higher on the list, adverbs are
lower, and prepositions are higher. Also, the most frequently occurring words are
generally the most disposable. These statements must be made in the context of the
model, however. To the extent that the separate statements are inaccurate when the
data are examined separately for each, the additive model is violated. This is another
I-103
Conjoint Analy sis
way of saying that the additive model is appropriate when there are no interactions or
configural effects. Incidentally, when these data are analyzed with GLM using the
(inverse transformed) word frequencies themselves rather than rank order in the list,
the conclusions are substantially the same.
Example 3
Box-Cox Model
Box and Cox (1964) devised a maximum likelihood estimator for the exponent in the
following model:
E { y ( ) } = X
where X is a matrix of known values, is a vector of unknown parameters associated
with the transformed observations, and the residuals of the model are assumed to be
normally distributed and independent. The transformation itself is assumed to take the
following form:
%K y 1
=&
K'log( y)
y( )
( 0)
( = 0 )
I-104
Chapter 5
This program produces an estimate of 0.750 for lambda, with a 95% Wald confidence
interval of (-1.181, -0.319). This is in agreement with the results in the original paper.
Box and Cox recommend rounding the exponent to 1 because of its natural
interpretation (rate of dying from poison). In general, it is wise to round such
transformations to interpretable values such as ... 1, 0.5, 0, 0.5, 2 ... to facilitate the
interpretation of results.
The Box-Cox procedure is based on a specific model that assumes normality in the
transformed data and that focuses on the dependent variable. We might ask whether it
is worthwhile to examine transformations of this sort without assuming normality and
resorting to maximum likelihood for our answer. This is especially appropriate if our
general method is to find an optimal estimate of the exponent and then round it to
the nearest interpretable value based on a confidence interval. Indeed, two discussants
of the Box and Cox paper, John Hartigan and John Tukey, asked just that.
The conjoint model offers one approach to this question. Specifically, we can use a
power function relating the y data values to the predictor variables in our model and
see how it converges.
Following is the input:
USE BOXCOX
CONJOINT
MODEL Y=POISON TREATMEN
ESTIMATE / REGRESS=POWER
I-105
Conjoint Analy sis
Loss
0.1977795
0.1661894
0.1594770
0.1571216
0.1562271
0.1559910
0.1559285
0.1559166
0.1559135
0.1559131
0.1559129
0.1559134
0.1559129
0.1559130
0.1559127
Computed Exponent:
TREATMEN(1)
0.423
TREATMEN(2)
-0.414
TREATMEN(3)
0.133
I-106
Chapter 5
Shepard Diagram
1.0
0.5
Measure
Joint Score
-1
0.0
0.0
-0.5
0.5
1.0
1.5
-1.0
Data
2
3
POISON
1.0
Measure
0.5
0.0
-0.5
-1.0
2
3
4
TREATMENT
On each iteration, CONJOINT transforms the observed (y) values by the current
estimate of the exponent, regresses them on the currently weighted X variables (using
the conjoint parameter estimates), and computes the loss from the residuals of that
regression. Over iterations, this loss is minimized and we get to view the final fit in the
plotted Shepard diagram.
The CONJOINT program produced an estimate of 1.015 for the exponent. Draper and
Hunter (1969) reanalyzed the poison data using several criteria suggested in the
discussion to Box and Coxs paper and elsewhere (minimizing interaction F ratio,
maximizing main-effects F ratios, and minimizing Levenes test for heterogeneity of
within-group variances). They found the best exponent to be in the neighborhood of 1.
I-107
Conjoint Analysis
Example 4
Employment Discrimination
The following table shows the mean salaries (SALNOW) of employees at a Chicago
bank. These data are from the BANK.SYD data set used in many SYSTAT manuals.
The bank was involved in a discrimination lawsuit, and the focus of our interest is
whether we can represent the salaries by a simple additive model. At the time these
data were collected, there were no black females with a graduate school education
working at the bank. The education variable records the highest level reached.
High School
College
Grad School
White Males
11735
16215
28251
Black Males
11513
13341
20472
White Females
9600
13612
11640
Black Females
8874
10278
Lets regress beginning salary (SALBEG) and current salary (SALNOW) on the gender
and education data. To represent our model, we will code the categories with integers:
for gender/race, 1=black females, 2=white females, 3=black males, 4=white males; for
education, 1=high school, 2=college, 3=grad school. These codings order the salaries
for both racial/gender status and educational levels.
Following is the input:
USE BANK
IF SEX=1 AND MINORITY=1 THEN LET GROUP=1
IF SEX=1 AND MINORITY=0 THEN LET GROUP=2
IF SEX=0 AND MINORITY=1 THEN LET GROUP=3
IF SEX=0 AND MINORITY=0 THEN LET GROUP=4
LET EDUC=1
IF EDLEVEL>12 THEN LET EDUC=2
IF EDLEVEL>16 THEN LET EDUC=3
LABEL GROUP / 1=Black_Females,2=White_Females,
3=Black_Males,4=White_Males
LABEL EDUC / 1=High_School,2=College,3=Grad_School
CONJOINT
MODEL SALBEG,SALNOW=GROUP EDUC
ESTIMATE / REGRESS=POWER
I-108
Chapter 5
Loss
0.3932757
0.3734472
0.3631769
0.3606965
0.3589525
0.3585654
0.3584647
0.3584328
0.3584239
0.3584233
0.3584215
0.3584225
0.3584253
0.3584253
0.3584231
0.3584189
Computed Exponent:
GROUP(4)
0.144
EDUC(1)
-0.356
EDUC(2)
-0.010
I-109
Conjoint Analy sis
Shepard Diagram
2
1.0
1.0
0.5
0.5
Joint Score
Joint Score
0.0
-0.5
0.0
-0.5
-1.0
-1.0
0
00
10
0
00
20
0
00
30
0
00
40
0
00
50
0
00
60
0
00
10
0
00
20
1.0
1.0
0.5
0.5
0.0
-0.5
0
00
40
0
00
50
0
00
60
0.0
-0.5
-1.0
ac
Bl
0
00
30
Data
Measure
Measure
Data
-1.0
k_
le
les
les
ma
Ma
Ma
Fe
k_
e_
e_
hit
l ac
hi t
B
W
W
le
ma
Fe
GROUP
_
gh
Hi
h
Sc
l
oo
ll
Co
e
eg
ad
Gr
c
_S
o
ho
EDUC
I-110
Chapter 5
LSALN
9.441
-1
Regression coefficients
B = (XX)
XY
LSALB
LSALN
CONSTANT
8.829
9.531
0.576
0.653
0.723
0.722
E
G
0.558
0.351
Multiple correlations
LSALB
0.817
LSALN
0.789
LSALN
0.622
I-111
Conjoint Analy sis
(Leverage =
0.128)
Univariate F Tests
Effect
SS
df
MS
LSALB
Error
0.275
19.628
1
470
0.275
0.042
6.596
0.011
LSALN
Error
0.109
28.219
1
470
0.109
0.060
1.818
0.178
0.986
3.447
df =
2, 469
Prob =
0.033
Pillai Trace =
F-Statistic =
0.014
3.447
df =
2, 469
Prob =
0.033
Hotelling-Lawley Trace =
F-Statistic =
0.015
3.447
df =
2, 469
Prob =
0.033
Ordered Scatterplots
Finally, lets use SYSTAT to produce scatterplots of beginning and current salary
ordered by the conjoint coefficients. The SYSTAT code to do this can be found in the
file CONJO4.SYC. The spacing of the scatterplots should tell the story.
I-112
Chapter 5
The story is mainly in this graph: regardless of educational level, minorities and
women received lower salaries. There are a few exceptions to the general pattern, but
overall the bank had reason to settle the lawsuit.
Computation
All computations are in double precision.
Algorithms
CONJOINT uses a direct search optimization method to minimize the loss function.
This enables minimization of Kendalls tau. There is no guarantee that the program will
find the global minimum of tau, so it is wise to try several regression types and the
STRESS loss to be sure that they all reach approximately the same neighborhood.
I-113
Conjoint Analy sis
Missing Data
Missing values are processed by omitting them from the loss function.
References
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society, Series B, 26, 211252.
Brogden, H. E. (1977). The Rasch model, the law of comparative judgment and additive
conjoint measurement. Psychometrika, 42, 631634.
Carmone, F. J., Green, P. E., and Jain, A. K. (1978). Robustness of conjoint analysis: Some
Monte Carlo results. Journal of Marketing Research, 15, 300303.
Carroll, J. B., Davies, P., and Richmond, B. (1971). The word frequency book. Boston,
Mass.: Houghton, Mifflin.
Carroll, J. D. and Green, P. E. (1995). Psychometric methods in marketing research: Part I,
conjoint analysis. Journal of Marketing Research, 32, 385391.
Crowe, G. (1980). Conjoint measurements design considerations. PMRS Journal, 1, 813.
DeLeew, J., Young, F. W., and Takane, Y. (1976). Additive structure in qualitative data:
An alternating least squares method with optimal scaling features. Psychometrika, 41,
471503.
Draper, N. R. and Hunter, W. G. (1969). Transformations: Some examples revisited.
Technometrics, 11, 2340.
Emery, D. R. and Barron, F. H. (1979). Axiomatic and numerical conjoint measurement:
An evaluation of diagnostic efficacy. Psychometrika, 44, 195210.
Green, P. E., Carmone, F. J., and Wind, Y. (1972). Subjective evaluation models and
conjoint measurement. Behavioral Science, 17, 288299.
Green, P. E. and DeSarbo, W. S. (1978). Additive decomposition of perceptions data via
conjoint analysis. Journal of Consumer Research, 5, 5865.
Green, P. E. and Rao, V. R. (1971). Conjoint measurement for quantifying judgmental data.
Journal of Marketing Research, 8, 355363.
Green, P. E. and Srinivasan, V. (1978). Conjoint analysis in consumer research: Issues and
outlook. Journal of Consumer Research, 5, 103123.
Green, P. E. and Srinivasan, V. (1990). Conjoint analysis in marketing: New developments
with implications for research and practice. Journal of Marketing, 54, 319.
Heiser, W. J. and Meulman, J. J. (1995). Nonlinear methods for the analysis of
homogeneity and heterogeneity. In W. J. Krzanowski (ed.), Recent advances in
descriptive multivariate analysis, 5189. Oxford: Clarendon Press.
I-114
Chapter 5
Chapter
6
Correlations, Similarities, and
Distance Measures
Leland Wilkinson, Laszlo Engelman, and Rick Marcantonio
I-115
I-116
Chapter 6
Statistical Background
SYSTAT computes many different measures of the strength of association between
variables. The most popular measure is the Pearson correlation, which is appropriate
for describing linear relationships between continuous variables. However, CORR
offers a variety of alternative measures of similarity and distance appropriate if the data
are not continuous.
Lets look at an example. The following data, from the CARS file, are taken from
various issues of Car and Driver and Road & Track magazine. They are the car
enthusiasts equivalent of Consumer Reports performance ratings. The cars rated
include some of the most expensive and exotic cars in the world (for example, Ferrari
Testarossa) as well as some of the least expensive but sporty cars (for example, Honda
Civic CRX). The attributes measured are 060 m.p.h. acceleration, braking distance in
feet from 600 m.p.h., slalom times (speed over a twisty course), miles per gallon, and
top speed in miles per hour.
ACCEL
BRAKE
SLALOM
MPG
SPEED
NAME$
5.0
5.3
5.8
7.0
7.6
7.9
8.5
8.7
9.3
10.8
13.0
245
242
243
267
271
259
263
287
258
287
253
61.3
61.9
62.6
57.8
59.8
61.7
59.9
64.2
64.1
60.8
62.3
17.0
12.0
19.0
14.5
21.0
19.0
17.5
35.0
24.5
25.0
27.0
153
181
154
145
124
130
131
115
129
100
95
Porsche 911T
Testarossa
Corvette
Mercedes 560
Saab 9000
Toyota Supra
BMW 635
Civic CRX
Acura Legend
VW Fox GL
Chevy Nova
I-117
Correlations, Similarities, and Distance Measures
SPEED
MPG
SLALOM BRAKE
ACCEL
A convenient summary that shows the relationships between the performance variables
is to arrange them in a matrix. A matrix is a rectangular array. We can put any sort of
numbers in the cells of the matrix, but we will focus on measures of association. Before
doing that, however, lets examine a graphical matrix, the scatterplot matrix
(SPLOM).
ACCEL
BRAKE SLALOM
MPG
SPEED
This matrix shows the histograms of each variable on the diagonal and the scatterplots
(x-y plots) of each variable against the others. For example, the scatterplot of
acceleration versus braking is at the top of the matrix. Since the matrix is symmetric,
only the bottom half is shown. In other words, the plot of acceleration versus braking
is the same as the transposed scatterplot of braking versus acceleration.
I-118
Chapter 6
ACCEL
BRAKE
SLALOM
MPG
SPEED
Number of Observations:
ACCEL
BRAKE
1.000
0.466
0.176
0.651
-0.908
1.000
-0.097
0.622
-0.665
SLALOM
1.000
0.597
-0.115
MPG
SPEED
1.000
-0.768
1.000
11
Try superimposing in your mind the correlation matrix on the SPLOM. The Pearson
correlation for acceleration versus braking is 0.466. This correlation is positive and
moderate in size. On the other hand, the correlation between acceleration and speed is
negative and quite large (0.908). You can see in the lower left corner of the SPLOM
that the points cluster around a downward sloping line. In fact, all of the correlations
of speed with the other variables are negative, which makes sense since greater speed
implies greater performance. The same is true for slalom performance, but this is
clouded by the fact that some small but slower cars like the Honda Civic CRX are
extremely agile.
Keep in mind that the Pearson correlation measures linear predictability. Do not
assume that a Pearson correlation near 0 implies no relationship between variables.
Many nonlinear associations (U- and S-shaped curves, for example) can have Pearson
correlations of 0.
I-119
Correlations, Similarities, and Distance Measures
ACCEL
BRAKE
SLALOM
MPG
SPEED
Number of observations:
ACCEL
BRAKE
SLALOM
MPG
SPEED
1.000
0.501
0.245
0.815
-0.891
1.000
-0.305
0.502
-0.651
1.000
0.487
-0.109
1.000
-0.884
1.000
11
It is often useful to compute both a Spearman and Pearson matrix on the same data.
The absolute difference between the two can reveal unusual features. For example, the
greatest difference for our data is on the slalom-braking correlation. This is because the
Honda Civic CRX is so fast through the slalom, despite its inferior brakes, that it
attenuates the Pearson correlation between slalom and braking. The Spearman
correlation reduces its influence.
I-120
Chapter 6
Euclidean and city-block distance measures have been widely available in software
packages for many years; Bray-Curtis and QSK are less common. For each pair of
variables,
x ik x jk
Bray-Curtis = -------------------------------x ik +
xj
SK = 1 ---
1
1
x ik
x jk
where i and j are variables and k is cases. After an extensive computer simulation study,
Faith, Minchin, and Belbin (1987) concluded that BC and QSK were effective as
robust measures in terms of both rank and linear correlation. The use of these
measures is similar to that for Correlations (Pearson, Covariance, and SSCP), except
the EM, Prob, Bonferroni, Dunn-Sidak, and Hadi options are not available.
I-121
Correlations, Similarities, and Distance Measures
xj
xi
1
0
1
a
c
0
b
d
a+b
c+d
a+c b+d
a
a+b+c+d
a
--------------------a+b+c
a+d
-----------------------------a+b+c+d
a
----------------------------a + 2(b + c)
a+d
--------------------------------------
S2 = ------------------------------
S3 =
S4 =
S5 =
S6 =
a + 2(b + c) + d
-3
-3
Y0
19
17
I-122
Chapter 6
A large proportion of the observations fall in the upper right and lower left quadrants
because the relationship is positive (the Pearson correlation is approximately 0.70).
Correspondingly, if there were a strong negative relationship, the points would
concentrate in the upper left and lower right quadrants. If the original observations are
no longer available but you do have the frequency counts for the four quadrants, try a
tetrachoric correlation.
The computations for the tetrachoric correlation begin by finding estimates of the
inverse cumulative marginal distributions:
Transposed Data
You can use CORR to compute measures of association on the rows or columns of your
data. Simply transpose the data and then use CORR. This makes sense when you want
to assess similarity between rows. We might be interested in identifying similar cars
from our performance measures, for example. Recall that you cannot transpose a file
that contains character data.
When you compute association measures across rows, however, be sure that the
variables are on comparable scales. Otherwise, a single variable will influence most of
the association. With the cars data, braking and speed are so large that they would
almost uniquely determine the similarity between cars. Consequently, we standardized
the data before transposing them. That way, the correlations measure the similarities
comparably across attributes.
I-123
Correlations, Similarities, and Distance Measures
PORSCHE
FERRARI
CORVETTE
MERCEDES
SAAB
TOYOTA
BMW
HONDA
ACURA
VW
CHEVY
TOYOTA
BMW
HONDA
ACURA
VW
CHEVY
PORSCHE
FERRARI
1.000
0.940
0.939
0.093
-0.506
0.238
-0.319
-0.504
-0.046
-0.962
-0.731
CORVETTE
MERCEDES
SAAB
1.000
0.868
0.212
-0.523
0.429
-0.095
-0.730
-0.102
-0.928
-0.698
1.000
-0.240
-0.760
0.402
-0.557
-0.393
0.298
-0.980
-0.491
1.000
0.664
-0.379
0.854
-0.519
-0.978
0.079
-0.532
1.000
-0.680
0.634
0.265
-0.770
0.704
-0.131
TOYOTA
BMW
HONDA
ACURA
VW
1.000
-0.247
-0.298
0.533
-0.353
-0.034
1.000
-0.500
-0.788
0.391
-0.064
1.000
0.349
0.552
0.320
1.000
-0.156
0.536
1.000
0.525
CHEVY
CHEVY
1.000
Number of observations:
I-124
Chapter 6
n After ranking, select the same number of cases with small ranks as before but add
the case with the next largest rank and repeat the process, each time updating the
covariance matrix, computing and sorting new distances, and increasing the
subsample size by one.
n Continue adding cases until the entering one exceeds an internal limit based on a
chi-square statistic (see Hadi, 1994). The cases remaining (not entered) are
identified as outliers.
n Use the cases that are not identified as outliers to compute the measure requested
Correlations in SYSTAT
Correlations Main Dialog Box
To open the Correlations dialog box, from the menus choose:
Statistics
Correlations
Simple
Variables. Available only if One is selected for Sets. All selected variables are
correlated with all other variables in the list, producing a triangular correlation matrix.
I-125
Correlations, Similarities, and Distance Measures
Rows. Available only if Two is selected for Sets. Selected variables are correlated with
all column variables, producing a rectangular matrix.
Columns. Available only if Two is selected for Sets. Selected variables are correlated
with all row variables, producing a rectangular matrix.
Sets. One set creates a single, triangular correlation matrix of all variables in the
Variable(s) list. Two sets creates a rectangular matrix of variables in the Row(s) list
correlated with variables in the Column(s) list.
Listwise. Listwise deletion of missing data. Any case with missing data for any variable
in the list is excluded.
Pairwise. Pairwise deletion of missing data. Only cases with missing data for one of
the variables in the pair being correlated are excluded.
Save file. Saves the correlation matrix to a file.
Types. Type of data or measure. You can select from a variety of distance measures, as
well as measures for continuous data, rank-order data, and binary data.
Pearson correlations vary between 1 and +1. A value of 0 indicates that neither of
two variables can be predicted from the other by using a linear equation. A Pearson
correlation of 1 or 1 indicates that one variable can be predicted perfectly by a
linear function of the other.
n Covariance. Produces a covariance matrix.
n SSCP. Produces a sum of cross-products matrix. If the Pairwise option is chosen,
I-126
Chapter 6
Kulczynski measure.
n Euclidean. Produces a matrix of Euclidean distances normalized by the sample size.
n City. Produces a matrix of city-block, or first-power, distances (sum of absolute
coefficients.
n Jaccard (S3). Produces a matrix of Jaccards dichotomy coefficients.
n Simple matching (S4). Produces a matrix of simple matching dichotomy
coefficients.
n Anderberg (S5). Produces a matrix of Anderbergs dichotomy coefficients.
I-127
Correlations, Similarities, and Distance Measures
Correlations Options
To specify options for correlations, click Options in the Correlations dialog box.
normal sample. For the contaminated normal, SYSTAT assumes that the
distribution is a mixture of two normal distributions (same mean, different
variances) with a specified probability of contamination. The Probability value is
the probability of contamination (for example, 0.10), and Variance is the variance
I-128
Chapter 6
Using Commands
First, specify your data with USE filename. Then, type CORR and choose your measure
and type:
Full matrix
Portion of matrix
QSK
MU2
S4
SSCP
EUCLIDEAN
TAU
S5
CITY
TETRA
S6
SPEARMAN
S2
PEARSON
For PEARSON, COVARIANCE, and SSCP, the following options are available:
EM
T=df
NORMAL=n1,n2
ITER=n
CONV=n
HADI
TOL=n
I-129
Correlations, Similarities, and Distance Measures
Usage Considerations
Types of data. CORR uses rectangular data only.
Print options. With PRINT=LONG, SYSTAT prints the mean of each variable. In
addition, for EM estimation, SYSTAT prints an iteration history, missing value
patterns, Littles MCAR test, and mean estimates.
Quick Graphs. CORR includes a SPLOM (matrix of scatterplots) where the data in each
plot correspond to a value in the matrix.
Saving files. CORR saves the correlation matrix or other measure computed. SYSTAT
automatically defines the type of file as CORR, DISS, COVA, SSCP, SIMI, or RECT.
BY groups. CORR analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is available in CORR.
Examples
Example 1
Pearson Correlations
This example uses data from the OURWORLD file that contains records (cases) for 57
countries. We are interested in correlations among variables recording the percentage
of the population living in cities, birth rate, gross domestic product per capita, dollars
expended per person for the military, ratio of birth rates to death rates, life expectancy
(in years) for males and females, percentage of the population who can read, and gross
national product per capita in 1986. The input is:
CORR
USE ourworld
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86
I-130
Chapter 6
URBAN BIRTH_RT
1.000
-0.800
1.000
0.625
-0.762
0.597
-0.672
-0.307
0.511
0.776
-0.922
0.801
-0.949
0.800
-0.930
0.592
-0.689
LITERACY
1.000
0.611
GDP_CAP
MIL
1.000
0.899
-0.659
0.664
0.704
0.637
0.964
1.000
-0.607
0.582
0.619
0.562
0.873
1.000
-0.211
-0.265
-0.274
-0.560
1.000
0.989
0.911
0.633
1.000
0.935
0.665
GNP_86
1.000
GNP_86
LITERACY
LIFEEXPF
LIFEEXPM
B_TO_D
MIL
GDP_CAP
BIRTH_RT
URBAN
Number of observations: 49
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
The correlations for all pairs of the nine variables are shown here. The bottom of the
output panel shows that the sample size is 49, but the data file has 57 countries. If a
country has one or more missing values, SYSTAT, by default, omits all of the data for
the case. This is called listwise deletion.
The Quick Graph is a matrix of scatterplots with one plot for each entry in the
correlation matrix and histograms of the variables on the diagonal. For example, the
plot of BIRTH_RT against URBAN is at the top left under the histogram for URBAN.
I-131
Correlations, Similarities, and Distance Measures
If linearity does not hold for your variables, your results may be meaningless. A
good way to assess linearity, the presence of outliers, and other anomalies is to
examine the plot for each pair of variables in the scatterplot matrix. The relationships
between GDP_CAP and BIRTH_RT, B_TO_D, LIFEEXPM, and LIFEEXPF do not
appear to be linear. Also, the points in the MIL versus GDP_CAP and GNP_86 versus
MIL displays clump in the lower left corner. It is not wise to use correlations for
describing these relations.
(Using the command language, press F9 to retrieve the previous PEARSON statement
instead of retyping it.)
The output is:
Pearson correlation matrix
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
1.00
-0.61
0.58
0.62
0.56
0.87
1.00
-0.21
-0.26
-0.27
-0.56
1.00
0.99
0.91
0.63
1.00
0.93
0.67
1.00
0.61
1.00
Number of observations: 49
Notice that while the top row of variable names is truncated to fit within the field
specification, the row names remain complete.
I-132
Chapter 6
URBAN
0.776
0.801
0.800
0.592
BIRTH_RT
-0.922
-0.949
-0.930
-0.689
GDP_CAP
0.664
0.704
0.637
0.964
MIL
0.582
0.619
0.562
0.873
B_TO_D
-0.211
-0.265
-0.274
-0.560
Number of observations: 49
These correlations correspond to the lower left corner of the first matrix.
Example 2
Transformations
If relationships between variables appear nonlinear, using a measure of linear
association is not advised. Fortunately, transformations of the variables may yield
linear relationships. You can then use the linear relation measures, but all conclusions
regarding the relationships are relative to the transformed variables instead of the
original variables.
In the Pearson correlations example, we observed nonlinear relationships involving
GDP_CAP, MIL, and GNP_86. Here we log transform these variables and compare the
resulting correlations to those for the untransformed variables. The input is:
CORR
USE ourworld
LET (gdp_cap,mil,gnp_86) = L10(@)
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86
I-133
Correlations, Similarities, and Distance Measures
BIRTH_RT
25.9592
LIFEEXPF
70.5714
GDP_CAP
3.3696
LITERACY
74.7265
MIL
1.6954
GNP_86
3.2791
B_TO_D
2.8855
BIRTH_RT
GDP_CAP
MIL
B_TO_D
1.0000
-0.9189
-0.8013
0.5106
-0.9218
-0.9488
-0.9302
-0.8786
LIFEEXPF
1.0000
0.8947
-0.5293
0.8599
0.8954
0.8337
0.9736
LITERACY
1.0000
-0.5374
0.7267
0.7634
0.7141
0.8773
GNP_86
1.0000
-0.2113
-0.2648
-0.2737
-0.4411
1.0000
0.9350
0.8861
1.0000
0.8404
1.0000
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
GNP_86
LITERACY
LIFEEXPF
LIFEEXPM
B_TO_D
MIL
GDP_CAP
BIRTH_RT
URBAN
Number of observations: 49
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
I-134
Chapter 6
In the scatterplot matrix, linearity has improved in the plots involving GDP_CAP, MIL,
and GNP_86. Look at the difference between the correlations before and after
transformation.
Transformation
no
yes
0.625
0.762
0.664
0.704
0.637
0.764
0.919
0.860
0.895
0.834
gdp_cap vs.
urban
birth_rt
lifeexpm
lifeexpf
literacy
Transformation
no
yes
0.597
0.672
0.582
0.619
0.562
0.680
0.801
0.727
0.763
0.714
mil vs.
urban
birth_rt
lifeexpm
lifeexpf
literacy
Transformation
no
yes
0.592
0.689
0.633
0.665
0.611
0.775
0.879
0.861
0.886
0.840
gnp_86 vs.
urban
birth_rt
lifeexpm
lifeexpf
literacy
After log transforming the variables, linearity has improved in the plots, and many of
the correlations are stronger.
Example 3
Missing Data: Pairwise Deletion
To specify pairwise deletion, the input is:
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR
BIRTH_RT
26.351
LIFEEXPF
70.123
GDP_CAP
3.372
LITERACY
73.563
MIL
1.775
GNP_86
3.293
B_TO_D
2.873
BIRTH_RT
GDP_CAP
MIL
B_TO_D
1.000
-0.895
-0.687
0.535
-0.892
-0.924
-0.930
-0.881
1.000
0.857
-0.472
0.854
0.891
0.832
0.974
1.000
-0.377
0.696
0.721
0.646
0.881
1.000
-0.172
-0.230
-0.291
-0.455
URBAN
1.000
-0.781
0.778
0.683
-0.248
0.796
0.816
0.807
0.775
I-135
Correlations, Similarities, and Distance Measures
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
1.000
0.989
0.911
0.863
LIFEEXPF
LITERACY
GNP_86
1.000
0.937
0.888
1.000
0.842
1.000
BIRTH_RT
GDP_CAP
MIL
B_TO_D
57
57
56
57
57
57
57
50
LIFEEXPF
57
56
57
57
57
57
50
LITERACY
56
56
56
56
56
50
GNP_86
57
57
57
57
50
57
57
50
57
50
50
URBAN
56
56
56
55
56
56
56
56
49
LIFEEXPM
57
57
57
50
The sample size for each variable is reported as the diagonal of the pairwise frequency
table; sample sizes for complete pairs of cases are reported off the diagonal. There are
57 countries in this sample56 reported the percentage living in cities (URBAN), and
50 reported the gross national product per capita in 1986 (GNP_86). There are 49
countries that have values for both URBAN and GNP_86.
The means are printed because we specified PRINT=LONG. Since pairwise deletion
is requested, all available values are used to compute each meanthat is, these means
are the same as those computed by the Statistics procedure.
Example 4
Missing Data: EM Estimation
This example uses the same variables used in the transformations example. To specify
EM estimation, the input is:
CORR
USE ourworld
LET (gdp_cap,mil,gnp_86) = L10(@)
IDVAR = country$
GRAPH = NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm,
lifeexpf literacy gnp_86 / EM
I-136
Chapter 6
No.of
Cases
49
1
6
1
Iteration
--------1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Maximum Error
------------1.092328
1.023878
0.643113
0.666125
0.857590
2.718236
0.728468
0.196577
0.077590
0.034510
0.016278
0.007986
0.004050
0.002120
0.001145
0.000637
-2*log(likelihood)
-----------------24135.483249
7625.491302
6932.605472
6691.458724
6573.199525
6538.852550
6531.689766
6530.369252
6530.167056
6530.159651
6530.176410
6530.190050
6530.198695
6530.203895
6530.207008
6530.208887
35.757
df =
23
prob = 0.044
EM estimate of means
URBAN
53.152
LIFEEXPM
65.088
BIRTH_RT
26.351
LIFEEXPF
70.123
GDP_CAP
3.372
LITERACY
73.563
MIL
1.754
GNP_86
3.284
B_TO_D
2.873
BIRTH_RT
GDP_CAP
MIL
B_TO_D
1.000
-0.895
-0.697
0.535
-0.892
-0.924
-0.930
-0.831
LIFEEXPF
1.000
0.863
-0.472
0.854
0.891
0.832
0.968
LITERACY
1.000
-0.357
0.713
0.738
0.668
0.874
GNP_86
1.000
-0.172
-0.230
-0.291
-0.342
1.000
0.937
0.885
1.000
0.828
1.000
URBAN
1.000
-0.782
0.779
0.700
-0.259
0.796
0.816
0.808
0.796
LIFEEXPM
1.000
0.989
0.911
0.863
SYSTAT prints missing-value patterns for the data. Forty-nine cases in the sample are
complete (an X is printed for each of the nine variables). Periods are inserted where
data are missing. The value of the first variable, URBAN, is missing for one case, while
the value of the last variable, GNP_86, is missing for six cases. The last row of the
pattern indicates that the values of the fourth variable, MIL, and the last variable,
GNP_86, are both missing for one case.
I-137
Correlations, Similarities, and Distance Measures
Littles MCAR (missing completely at random) test has a probability less than 0.05,
indicating that we reject the hypothesis that the nine missing values are randomly
missing. This test has limited power when the sample of incomplete cases is small and
it also offers no direct evidence on the validity of the MAR assumption.
Example 5
Probabilities Associated with Correlations
To request the usual (uncorrected) probabilities for a correlation matrix using pairwise
deletion:
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR PROB
Matrix of Probabilities
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
URBAN
0.0
0.000
0.000
0.000
0.065
0.000
0.000
0.000
0.000
LIFEEXPM
0.0
0.000
0.000
0.000
BIRTH_RT
GDP_CAP
MIL
B_TO_D
0.0
0.000
0.000
0.000
0.000
0.000
0.000
0.000
LIFEEXPF
0.0
0.000
0.000
0.000
0.000
0.000
0.000
LITERACY
0.0
0.004
0.000
0.000
0.000
0.000
GNP_86
0.0
0.202
0.085
0.028
0.001
0.0
0.000
0.000
0.0
0.000
0.0
The p values that are appropriate for making statements regarding one specific
correlation are shown here. By themselves, these values are not very informative.
These p values are pseudo-probabilities because they do not reflect the number of
correlations being tested. If pairwise deletion is used, the problem is even worse,
although many statistics packages print probabilities as if they meant something in this
case, too.
I-138
Chapter 6
SYSTAT computes the Bartlett chi-square test whenever you request probabilities
for more than one correlation. This tests a global hypothesis concerning the
significance of all of the correlations in the matrix
(2p + 1) ln R
2 = N 1
where N is the total sample size (or the smallest sample size for any pair in the matrix
if pairwise deletion is used), p is the number of variables, and |R| is the determinant of
the correlation matrix. This test is sensitive to non-normality, and the test statistic is
only asymptotically distributed (for large samples) as chi-square. Nevertheless, it can
serve as a guideline.
If the Bartlett test is not significant, dont even look at the significance of individual
correlations. In this example, the test is significant, which indicates that there may be
some real correlations among the variables. The Bartlett test is sensitive to nonnormality and can be used only as a guide. Even if the Bartlett test is significant, you
cannot accept the nominal p values as the true family probabilities associated with each
correlation.
I-139
Correlations, Similarities, and Distance Measures
URBAN
0.0
0.000
0.000
0.000
1.000
0.000
0.000
0.000
0.000
LIFEEXPM
0.0
0.000
0.000
0.000
BIRTH_RT
GDP_CAP
MIL
B_TO_D
0.0
0.000
0.000
0.001
0.000
0.000
0.000
0.000
LIFEEXPF
0.0
0.000
0.008
0.000
0.000
0.000
0.000
LITERACY
0.0
0.150
0.000
0.000
0.000
0.000
GNP_86
0.0
1.000
1.000
1.000
0.032
0.0
0.000
0.000
0.0
0.000
0.0
Compare these results with those for the 36 tests using uncorrected probabilities.
Notice that some correlations, such as those for B_TO_D with MIL, LITERACY, and
GNP_86, are no longer significant.
URBAN
0.0
0.000
0.000
0.000
1.000
0.000
0.000
0.000
0.000
BIRTH_RT
GDP_CAP
MIL
B_TO_D
0.0
0.000
0.000
0.001
0.000
0.000
0.000
0.000
0.0
0.000
0.008
0.000
0.000
0.000
0.000
0.0
0.248
0.000
0.000
0.000
0.000
0.0
1.000
1.000
1.000
0.537
I-140
Chapter 6
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
0.000
0.000
0.000
0.000
LIFEEXPF
LITERACY
GNP_86
0.0
0.000
0.000
0.0
0.000
0.000
Example 6
Hadi Robust Outlier Detection
If only one or two variables have outliers among many well behaved variables, the
outliers may be masked. Lets look for outliers among four variables. The input is:
USE ourworld
CORR
LET (gdp_cap, mil) = L10(@)
GRAPH = NONE
PRINT = LONG
IDVAR = country$
PEARSON gdp_cap mil b_to_d literacy / HADI
PLOT GDP_CAP*B_TO_D*LITERACY / SPIKE XGRID YGRID AXES=BOOK,
SCALE=L SYMBOL=GROUP$ SIZE= 1.250 ,1.250 ,1.250
B_TO_D
2.533
LITERACY
88.183
MIL
B_TO_D
LITERACY
1.000
-0.753
0.642
1.000
-0.698
1.000
GDP_CAP
1.000
0.860
-0.839
0.729
Number of observations: 56
I-141
Correl ati ons, Si m il ari ti es, an d Di stance Me asures
N
N
I
I
IN
N
N
E
EE
E
E
E
N
EE
E
E
N
N
NN
N E EE
N
NN
N N
I N
N
I I
NI I
II
I I
I
Fifteen countries are identified as outliers. We suspect that the sample may not be
homogeneous so we request a plot labeled by GROUP$. The panel is set to
PRINT=LONG; the country names appear because we specified COUNTRY$ as an ID
variable. The correlations at the end of the output are computed using the 30 or so cases
that are not identified as outliers.
In the plot, we see that Islamic countries tend to fall between New World and
European countries with respect to birth-to-death ratio and have the lowest literacy.
European countries have the highest literacy and GDP_CAP values.
I-142
Chapter 6
For clarity, we edited the following output by moving the panels of means to the end:
The following results are for:
GROUP$
= Europe
These 1 outliers are identified:
Case
Distance
------------ -----------Portugal
5.72050
HADI estimated correlation matrix
GDP_CAP
MIL
B_TO_D
LITERACY
GDP_CAP
1.000
0.474
-0.092
0.259
MIL
B_TO_D
LITERACY
1.000
-0.173
0.263
1.000
0.136
1.000
Number of observations: 20
The following results are for:
GROUP$
= Islamic
HADI estimated correlation matrix
MIL
B_TO_D
LITERACY
1.000
0.882
0.605
1.000
0.649
1.000
MIL
B_TO_D
LITERACY
1.000
-0.287
0.561
1.000
-0.045
1.000
LITERACY
98.316
LITERACY
36.733
LITERACY
79.957
GDP_CAP
MIL
B_TO_D
LITERACY
GDP_CAP
1.000
0.877
0.781
0.600
Number of observations: 15
The following results are for:
GROUP$
= NewWorld
HADI estimated correlation matrix
GDP_CAP
MIL
B_TO_D
LITERACY
GDP_CAP
1.000
0.674
-0.246
0.689
Number of observations: 21
When computations are done separately for each group, Portugal is the only outlier,
and the within-groups correlations differ markedly from group to group and from those
for the complete sample. By scanning the means, we also see that the centroids for the
three groups are quite different.
I-143
Correlations, Similarities, and Distance Measures
Example 7
Spearman Correlations
As an example, we request Spearman correlations for the same data used in the Pearson
correlation and Tranformations examples. It is often useful to compute both a
Spearman and a Pearson matrix using the same data. The absolute difference between
the two can reveal unusual features such as outliers and highly skewed distributions.
The input is:
USE ourworld
CORR
GRAPH = NONE
SPEARMAN urban birth_rt gdp_cap mil b_to_d,
lifeexpm lifeexpf literacy gnp_86 / PAIR
URBAN
1.000
-0.749
0.777
0.678
-0.381
0.731
0.771
0.760
0.767
LIFEEXPM
1.000
0.965
0.813
0.834
BIRTH_RT
GDP_CAP
MIL
B_TO_D
1.000
-0.874
-0.670
0.689
-0.856
-0.902
-0.868
-0.847
LIFEEXPF
1.000
0.848
-0.597
0.834
0.910
0.882
0.973
LITERACY
1.000
-0.498
0.633
0.709
0.696
0.867
GNP_86
1.000
-0.410
-0.501
-0.576
-0.543
1.000
0.866
0.901
1.000
0.909
1.000
Note that many of these correlations are closer to the Pearson correlations for the logtransformed data than they are to the correlations for the raw data.
Example 8
S2 and S3 Coefficients
The choice among the binary S measures depends on what you want to state about your
variables. In this example, we request S2 and S3 to study responses made by 256
subjects to a depression inventory (Afifi and Clark, 1984). These data are stored in the
SURVEY2 data file that has one record for each respondent with answers to 20
questions about depression. Each subject was asked, for example, Last week, did you
cry less than 1 day (code 0), 1 to 2 days (code 1), 3 to 4 days (code 2), or 5 to 7 days
(code 3)? The distributions of the answers appear to be Poisson, so they are not
I-144
Chapter 6
The result is true (1) when the behavior or feeling is present or false (0) when it is
absent. We use SYSTATs shortcut notation to do this for 7 of the 20 questions. For
each pair of feelings or behaviors, S2 indicates the proportion of subjects with both,
and S3 indicates the proportion of times both occurred given that one occurs. To
perform this example:
USE survey2
CORR
LET (blue,depress,cry,sad,no_eat,getgoing,talkless) = @ <> 0
GRAPH = NONE
S2 blue depress cry sad no_eat getgoing talkless
S3 blue depress cry sad no_eat getgoing talkless
BLUE
0.254
0.207
0.090
0.188
0.117
0.180
0.117
GETGOING
0.520
0.172
DEPRESS
CRY
SAD
NO_EAT
0.422
0.113
0.313
0.129
0.309
0.156
TALKLESS
0.133
0.117
0.051
0.086
0.059
0.391
0.137
0.258
0.145
0.246
0.152
0.098
DEPRESS
CRY
SAD
NO_EAT
1.000
0.257
0.625
0.239
0.488
0.305
TALKLESS
1.000
0.288
0.155
0.152
0.183
1.000
0.273
0.395
0.294
1.000
0.248
0.248
0.246
BLUE
1.000
0.442
0.303
0.410
0.306
0.303
0.306
GETGOING
1.000
0.289
1.000
I-145
Correlations, Similarities, and Distance Measures
80
20
28
128
For S2, the result is 80/256 = 0.313; for S3, 80/128 = 0.625.
Example 9
Tetrachoric Correlation
As an example, we use the bivariate normal data in the SYSTAT data file named
TETRA. The input is:
USE tetra
FREQ = count
CORR
TETRA x y
X
1.000
0.810
Y
1.000
Number of observations: 45
Computation
All computations are implemented in double precision.
Algorithms
The computational algorithms use provisional means, sums of squares, and crossproducts (Spicer, 1972). Starting values for the EM algorithm use all available values
(see Little and Rubin, 1987, p. 42).
I-146
Chapter 6
For the rank-order coefficients (Gamma, Mu2, Spearman, and Tau), keep in mind
that these are time consuming. Spearman requires sorting and ranking the data before
doing the same work done by Pearson. The Gamma and Mu2 items require
computations between all possible pairs of observations. Thus, their computing time is
combinatoric.
Missing Data
If you have missing data, CORR can handle them in three ways: listwise deletion,
pairwise deletion, and EM estimation. Listwise deletion is the default. If there are
missing data and pairwise deletion is used, SYSTAT displays a table of frequencies
between all possible pairs of variables after the correlation matrix.
Pairwise deletion takes considerably more computer time because the sums of
cross-products for each pair must be saved in a temporary disk file. If you use the
pairwise deletion to compute an SSCP matrix, the sums of squares and cross-products
are weighted by N/n, where N is the number of cases in the whole file and n is the
number of cases with nonmissing values in a given pair.
See Chapter II-1 for a complete discussion of handling missing values.
References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Faith, D. P., Minchin, P., and Belbin, L. (1987). Compositional dissimilarity as a robust
measure of ecological distance. Vegetatio, 69, 5768.
Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for crossclassification. Journal of the American Statistical Association, 49, 732764.
Gower, J. C. (1985). Measures of similarity, dissimilarity, and distance. In Kotz, S. and
Johnson, N. L. Encyclopedia of Statistical Sciences, vol. 5. New York: John Wiley &
Sons, Inc.
Hadi, A. S. (1994). A modification of a method for the detection of outliers in multivariate
samples. In Journal of the Royal Statistical Society, Series (B), 56, No. 2.
Little, R. J. A. and Rubin, D. B. (1987). Statistical analyses with missing data. New York:
John Wiley & Sons, Inc.
Shye, S., ed. (1978). Theory construction and data analysis in the behavioral sciences. San
Francisco: Jossey-Bass, Inc.
Chapter
7
Correspondence Analysis
Leland Wilkinson
Statistical Background
Correspondence analysis is a method for decomposing a table of data into row and
column coordinates that can be displayed graphically. With this technique, a two-way
table can be represented in a two-dimensional graph with points for rows and
columns. These coordinates are computed with a Singular Value Decomposition
(SVD), which factors a matrix into the product of three matrices: a collection of left
singular vectors, a matrix of singular values, and a collection of right singular
vectors. Greenacre (1984) is the most comprehensive reference. Hill (1974) and
Jobson (1992) cover the major topics more briefly.
o ij e ij
1
z ij = -------- ---------------
N
e ij
I-147
I-148
Chapter 7
where N is the sum of the table counts for all n ij , o ij is the observed count for cell ij,
and e ij is the expected count for cell ij based on an independence model. The second
2
term in this equation is a cells contribution to the test-for-independence statistic.
2
Thus, the sum of the squared z ij over all cells in the table is the same as N .
Finally, the row mass for row i is n i . N and the column mass for column j is n .j N .
The next step is to compute the matrix of cross-products from this matrix of
deviates:
S = ZZ
This S matrix has t = min ( r 1 ,c 1 ) nonzero eigenvalues, where r and c are the
row and column dimensions of the original table, respectively. The sum of these
2
eigenvalues is N (which is termed total inertia). It is this matrix that is
decomposed as follows:
S = UDV
where U is a matrix of row vectors, V is a matrix of column vectors, and D is a diagonal
matrix of the eigenvalues. The coordinates actually plotted are standardized from U
(for rows), so that
t
N =
2
i=1
( ni N
)
2
ij
j=1
S = ZZ
which is rescaled and decomposed with a singular value decomposition, as before. See
Jobson (1992) for further information.
I-149
Correspondence Analy sis
You can specify one of two methods for handling missing data:
n Pairwise deletion. Pairwise deletion examines each pair of variables and uses all
I-150
Chapter 7
Using Commands
First, specify your data with USE filename. For a simple correspondence analysis,
continue with:
CORAN
MODEL depvar=indvar
ESTIMATE
If data are aggregated and there is a variable in the file representing frequency of
profiles, use FREQ to identify that variable.
Usage Considerations
Types of data. CORAN uses rectangular data only.
Print options. There are no print options.
Quick Graphs. Quick Graphs produced by CORAN are correspondence plots for the
simple or multiple models.
Saving files. For simple correspondence analysis, CORAN saves the row variable
coordinates in DIM(1)...DIM(N) and the column variable coordinates in
FACTOR(1)...FACTOR(N), where the subscript indicates the dimension number. For
multiple correspondence analysis, DIM(1)...DIM(N) contain the variable coordinates
and FACTOR(1)...FACTOR(N) contain the case coordinates. Label information is
saved to LABEL$.
BY groups. CORAN analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=variable increases the number of cases by the FREQ variable.
Case weights. WEIGHT is not available in CORAN.
I-151
Correspondence Analy sis
Examples
The examples begin with a simple correspondence analysis of a two-way table from
Greenacre (1984). This is followed by a multiple correspondence analysis example.
Example 1
Correspondence Analysis (Simple)
Here we illustrate a simple correspondence analysis model. The data comprise a
hypothetical smoking survey in a company (Greenacre, 1984). Notice that we use
value labels to describe the categories in the output and plot. The FREQ command
codes the cell frequencies. The input is:
USE SMOKE
LABEL STAFF / 1=Sr.Managers,2=Jr.Managers,3=Sr.Employees,
4=Jr.Employees,5=Secretaries
LABEL SMOKE / 1=None,2=Light,3=Moderate,4=Heavy
FREQ=FREQ
CORAN
MODEL STAFF=SMOKE
ESTIMATE
Sum
0.085
(Total Inertia)
Mass
0.057
0.093
0.264
0.456
0.130
Quality
0.893
0.991
1.000
1.000
0.999
Inertia
0.003
0.012
0.038
0.026
0.006
Factor 1
0.066
-0.259
0.381
-0.233
0.201
Factor 2
0.194
0.243
0.011
-0.058
-0.079
I-152
Chapter 7
Factor 1
0.003
0.084
0.512
0.331
0.070
Factor 2
0.214
0.551
0.003
0.152
0.081
Factor 1
0.092
0.526
0.999
0.942
0.865
Factor 2
0.800
0.465
0.001
0.058
0.133
Mass
0.316
0.233
0.321
0.130
Quality
1.000
0.984
0.983
0.995
Inertia
0.049
0.007
0.013
0.016
Factor 1
0.654
0.031
0.166
0.150
Factor 2
0.029
0.463
0.002
0.506
Factor 1
0.994
0.327
0.982
0.684
Factor 2
0.006
0.657
0.001
0.310
Factor 1
0.393
-0.099
-0.196
-0.294
Factor 2
0.030
-0.141
-0.007
0.198
I-153
Correspondence Analy sis
For the simple correspondence model, CORAN prints the basic statistics and
eigenvalues of the decomposition. Next are the row and column coordinates, with
mass, quality, and inertia values. Mass equals the marginal total divided by the grand
total. Quality is a measure (between 0 and 1) of how well a row or column point is
represented by the first two factors. It is a proportion-of-variance statistic. See
Greenacre (1984) for further information. Inertia is a rows (or columns) contribution
to the total inertia. Contributions to the factors and squared correlations with the factors
are the last reported statistics.
Example 2
Multiple Correspondence Analysis
This example uses automobile accident data in Alberta, Canada, reprinted in Jobson
(1992). The categories are ordered with the ORDER command so that the output will
show them in increasing order of severity. The data are in tabular form, so we use the
FREQ command. The input is:
USE ACCIDENT
FREQ=FREQ
ORDER INJURY$ / SORT=None,Minimal,Minor,Major
ORDER DRIVER$ / SORT=Normal,Drunk
ORDER SEATBELT$ / SORT=Yes,No
CORAN
MODEL INJURY$,DRIVER$,SEATBELT$
ESTIMATE
FREQ
(Total Inertia)
----------------------------------
I-154
Chapter 7
Variable Coordinates
Name
None
Minimal
Minor
Major
Normal
Drunk
Yes
No
Mass
0.303
0.018
0.012
0.001
0.313
0.020
0.053
0.280
Quality
0.351
0.251
0.552
0.544
0.496
0.496
0.279
0.279
Factor 1
0.029
0.111
0.141
0.056
0.027
0.414
0.187
0.036
Factor 2
0.000
0.113
0.375
0.478
0.000
0.003
0.026
0.005
Factor 1
0.350
0.131
0.163
0.063
0.493
0.493
0.249
0.249
Factor 2
0.001
0.120
0.389
0.481
0.003
0.003
0.031
0.031
Factor 1
0.825
-0.779
-0.110
-1.713
-0.443
-2.047
-1.441
-3.045
0.825
-0.779
-0.110
-1.713
-0.443
-2.047
-1.441
-3.045
0.825
-0.779
-0.110
-1.713
-0.443
-2.047
-1.441
0.825
-0.779
-0.110
-1.713
-0.443
-1.441
-3.045
0.082
-1.521
Factor 2
-0.219
-0.349
-1.063
-1.193
1.676
1.547
-6.558
-6.687
-0.219
-0.349
-1.063
-1.193
1.676
1.547
-6.558
-6.687
-0.219
-0.349
-1.063
-1.193
1.676
1.547
-6.558
-0.219
-0.349
-1.063
-1.193
1.676
-6.558
-6.687
0.057
-0.073
Case coordinates
Name
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Inertia
0.031
0.315
0.322
0.332
0.020
0.313
0.280
0.053
Factor 1
0.189
-1.523
-2.134
-3.962
0.179
-2.758
1.143
-0.217
Factor 2
0.008
-1.454
3.294
-10.976
0.014
-0.211
-0.402
0.076
I-155
Correspondence Analy sis
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
-0.853
-2.456
-1.186
-2.790
-2.184
-3.788
0.082
-1.521
-0.853
-2.456
-1.186
-2.790
-2.184
-3.788
0.082
-1.521
-0.853
-2.456
-1.186
-2.790
-2.184
-3.788
0.082
-1.521
-0.853
-2.456
-1.186
-2.790
-2.184
-3.788
-0.787
-0.916
1.953
1.823
-6.281
-6.411
0.057
-0.073
-0.787
-0.916
1.953
1.823
-6.281
-6.411
0.057
-0.073
-0.787
-0.916
1.953
1.823
-6.281
-6.411
0.057
-0.073
-0.787
-0.916
1.953
1.823
-6.281
-6.411
This time, we get case coordinates instead of column coordinates. These are not
included in the following Quick Graph because the focus of the graph is on the tabular
variables and we dont want to clutter the display. If you want to plot case coordinates,
cut and paste them into the editor and plot them directly.
Following is the Quick Graph:
I-156
Chapter 7
The graph reveals a principal axis of major versus minor injuries. This axis is related
to drunk driving and seat belt use.
Computation
All computations are in double precision.
Algorithms
CORAN uses a singular value decomposition of the cross-products matrix computed
from the data.
Missing Data
Cases with missing data are deleted from all analyses.
References
Greenacre, M. J. (1984). Theory and applications of correspondence analysis. New York:
Academic Press.
Hill, M. O. (1974). Correspondence analysis: A neglected multivariate method. Applied
Statistics, 23, 340354.
Jobson, J. D. (1992). Applied multivariate data analysis, Vol. II: Categorical and
multivariate methods. New York: Springer-Verlag.
Chapter
8
Crosstabulation
I-157
I-158
Chapter 8
Statistical Background
Tables report results as counts or the number of cases falling in specific categories or
cross-classifications. Categories may be unordered (democrat, republican, and
independent), ordered (low, medium, and high), or formed by defining intervals on a
continuous variable like AGE (child, teen, adult, and elderly).
Making Tables
There are many formats for displaying tabular data. Lets examine basic layouts for
counts and percentages.
One-Way Tables
Here is an example of a table showing the number of people of each gender surveyed
about depression at UCLA in 1980.
Female
Male
+---------------+
|
152
104 |
+---------------+
Total
256
The categorical variable producing this table is SEX$. Sometimes, you may define
categories as intervals of a continuous variable. Here is an example showing the 256
people broken down by age.
18 to 30 30 to 45 46 to 60 Over 60
+-------------------------------------+
|
79
80
64
33 |
+-------------------------------------+
Total
256
Two-Way Tables
A crosstabulation is a table that displays one cell for every combination of values on
two or more categorical variables. Here is a two-way table that crosses the gender and
age distributions of the tables above.
18 to
30 to
46 to
Over
Total
30
45
60
60
Female
Male
+-------------------+
|
49
30 |
|
48
32 |
|
38
26 |
|
17
16 |
+-------------------+
152
104
Total
79
80
64
33
256
I-159
Crosstabulation
This crosstabulation shows relationships between age and gender, which were
invisible in the separate tables. Notice, for example, that the sample contains a large
number of females below the age of 46.
30
45
60
60
Total
N
Female
Male
+-------------------+
| 62.025
37.975 |
| 60.000
40.000 |
| 59.375
40.625 |
| 51.515
48.485 |
+-------------------+
59.375
40.625
152
104
Total
100.000
100.000
100.000
100.000
79
80
64
33
100.000
256
Here we see that as age increases, the sample becomes more evenly dispersed across
the two genders.
On the other hand, if we are interested in the overall distribution of age for each
gender, we might want to standardize within columns:
18 to
30 to
46 to
Over
Total
N
30
45
60
60
Female
Male
+-------------------+
| 32.237
28.846 |
| 31.579
30.769 |
| 25.000
25.000 |
| 11.184
15.385 |
+-------------------+
100.000 100.000
152
104
Total
30.859
31.250
25.000
12.891
N
79
80
64
33
100.000
256
I-160
Chapter 8
One-Way Tables
A model for these data might be that the proportion of the males and females is equal
in the population. The null hypothesis corresponding to the model is:
H: pmales= pfemales
The sampling model for testing this hypothesis requires that a population contains
equal numbers of males and females and that each member of the population has an
equal chance of being chosen. After choosing each person, we identify the person as
male or female. There is no other category possible and one person cannot fit under
both categories (exhaustive and mutually exclusive).
There is an exact way to reject our null hypothesis (called a permutation test). We
can tally every possible sample of size 256 (including one with no females and one
with no males). Then we can sort our samples into two piles: samples in which there
are between 40.625% and 59.375% percent females and samples in which there are
not. If the latter pile is extremely small relative to the former, we can reject the null
hypothesis.
Needless to say, this would be a tedious undertakingparticularly on a
microcomputer. Fortunately, there is an approximation using a continuous probability
distribution that works quite well. First, we need to calculate the expected count of
males and females, respectively, in a sample of size 256 if p is 0.5. This is 128, or half
the sample N. Next, we subtract the observed counts from these expected counts,
square them, and divide by the expected:
If our assumptions about the population and the structure of the table are correct, then
this statistic will be distributed as a mathematical chi-square variable. We can look up
I-161
Crosstabulation
the area under the tail of the chi-square statistic beyond the sample value we calculate
and if this area is small (say, less than 0.05), we can reject the null hypothesis.
To look up the value, we need a degrees of freedom (df) value. This is the number
of independent values being added together to produce the chi-square. In our case, it is
1, since the observed proportion of men is simply 1 minus the observed proportion of
women. If there were three categories (men, women, other?), then the degrees of
freedom would be 2. Anyway, if you look up the value 9 with one degree of freedom
in your chi-square table, you will find that the probability of exceeding this value is
exceedingly small. Thus, we reject our null hypothesis that the proportion of males
equals the proportion of females in the population.
This chi-square approximation is good only for large samples. A popular rule of
thumb is that the expected counts should be greater than 5, although they should be
even greater if you want to be comfortable with your test. With our sample, the
difference between the approximation and the exact result is negligible. For both, the
probability is small.
Our hypothesis test has an associated confidence interval. You can use SYSTAT to
compute this interval on the population data. Here is the result:
95 percent approximate confidence intervals scaled as cell percents
Values for SEX$
Female
Male
+-----------------+
| 66.150 47.687 |
| 52.064 33.613 |
+-----------------+
The lower limit for each gender is on the bottom; the upper limits are on the top. Notice
that these two intervals do not overlap.
Two-Way Tables
The most familiar test available for two-way tables is the Pearson chi-square test for
independence of table rows and columns. When the table has only two rows or two
columns, the chi-square test is also a test for equality of proportions. The concept of
interaction in a two-way frequency table is similar to the one in analysis of variance. It
is easiest to see in an example. Schachter (1959) randomly assigned 30 subjects to one
of two groups: High Anxiety (17 subjects), who were told that they would be
experiencing painful shocks, and Low Anxiety (13 subjects), who were told that they
would experience painless shocks. After the assignment, each subject was given the
I-162
Chapter 8
choice of waiting alone or with the other subjects. The following tables illustrate two
possible outcomes of this study.
No Interaction
WAIT
ANXIETY
High
Low
Alone
8
6
Together
9
7
Interaction
WAIT
Alone
5
9
Together
12
4
Notice in the table on the left that the number choosing to wait together relative to those
choosing to wait alone is similar for both High and Low Anxiety groups. In the table on
the right, however, more of the High Anxiety group chose to wait together.
We are interpreting these numbers relatively, so we should compute row
percentages to understand the differences better. Here are the same tables standardized
by rows:
No Interaction
WAIT
ANXIETY
High
Low
Alone
47.1
46.1
Together
52.8
53.8
Interaction
WAIT
Alone
29.4
69.2
Together
70.6
30.8
Now we can see that the percentages are similar in the two rows in the table on the left
(No Interaction) and quite different in the table on the right (Interaction). A simple
graph reveals these differences even more strongly. In the following figure, the No
Interaction row percentages are plotted on the left.
I-163
Crosstabulation
Notice that the lines cross in the Interaction plot, showing that the rows differ. There
is almost complete overlap in the No Interaction plot.
Now, in the one-way table example above, we tested the hypothesis that the cell
proportions were equal in the population. We can test an analogous hypothesis in this
contextthat each of the four cells contains 25 percent of the population. The problem
with this assumption is that we already know that Schachter randomly assigned more
people to the High Anxiety group. In other words, we should take the row marginal
percentages (or totals) as fixed when we determine what proportions to expect in the
cells from a random model.
Our No Interaction model is based on these fixed marginals. In fact, we can fix
either the row or column margins to compute a No Interaction model because the total
number of subjects is fixed at 30. You can verify that the row and column sums in the
above tables are the same.
Now we are ready to compute our chi-square test of interaction (often called a test
of independence) in the two-way table by using the No Interaction counts as expected
counts in our chi-square formula above. This time, our degrees of freedom are still 1
because the marginal counts are fixed. If you know the marginal counts, then one cell
count determines the remaining three. In general, the degrees of freedom for this test
are (rows 1) times (columns 1).
Here is the result of our chi-square test. The chi-square is 4.693, with a p of 0.03.
On this basis, we reject our No Interaction hypothesis.
ANXIETY (rows) by WAIT$ (columns)
Alone Together
+-------------------+
High |
5.000
12.000 |
Low |
9.000
4.000 |
+-------------------+
Total
14.000
16.000
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
McNemar Symmetry Chi-square
Yates corrected Chi-square
Fisher exact test (two-tail)
Total
17.000
13.000
30.000
Value
4.693
4.810
0.429
3.229
df
1.000
1.000
1.000
1.000
Prob
0.030
0.028
0.513
0.072
0.063
Actually, we cheated. The program computed the expected counts from the observed
data. These are not exactly the ones we showed you in the No Interaction table. They
differ by rounding error in the first decimal place. You can compute them exactly. The
popular method is to multiply the total row count times the total column count
corresponding to a cell and dividing by the total sample size. For the upper left cell,
this would be 17*14/30 = 7.93.
I-164
Chapter 8
There is one other interesting problem with these data. The chi-square is only an
approximation and it does not work well for small samples. Although these data meet
the minimum expected count of 5, they are nevertheless problematic. Look at the
Fishers exact test result in the output above. Like our permutation test above, which
was so cumbersome for large data files, Fishers test counts all possible outcomes
exactly, including the ones that produce interaction greater than what we observed. The
Fisher exact test p value is not significant (0.063). On this basis, we could not reject
the null hypothesis of no interaction, or independence.
Yates chi-square test in the output is an attempt to adjust the Pearson chi-square
statistic for small samples. While it has come into disfavor for being unnecessarily
conservative in many instances, nevertheless, the Yates p value is consistent with
Fishers in this case (0.072). Likelihood-ratio chi-square is an alternative to the
Pearson chi-square and is used as a test statistic for log linear models.
Selecting a Test or Measure
Other tests and measures are appropriate for specific table structures and also depend
on whether or not the categories of the factor are ordered. We use 2 2 to denote a
table with two rows and two columns, and r c for a table with r rows and c columns.
The Pearson and likelihood-ratio chi-square statistics apply to r c tables
categories need not be ordered.
McNemars test of symmetry is used for r r square tables (the number of rows
equals the number of columns). This structure arises when the same subjects are
measured twice as in a paired comparisons t test (say before and after an event) or when
subjects are paired or matched (cases and controls). So the row and column categories
are the same, but they are measured at different times or circumstances (like the paired
t) or for different groups of subjects (cases and controls). This test ignores the counts
along the diagonal of the table and tests whether the counts in cells above the diagonal
differ from those below the diagonal. A significant result indicates a greater change in
one direction than another. (The counts along the diagonal are for subjects who did not
change.)
The table structure for Cohens kappa looks like that of McNemars in that the row
and column categories are the same. But here the focus shifts to the diagonal: Are the
counts along the diagonal significantly greater than those expected by chance alone?
Because each subject is classified or rated twice, kappa is a measure of interrater
agreement.
Another difference between McNemar and Kappa is that the former is a test with
a chi-square statistic, degrees of freedom, and an associated p value, while the latter is
I-165
Crosstabulation
I-166
Chapter 8
Crosstabulations in SYSTAT
One-Way Frequency Tables Main Dialog Box
To open the One-Way Frequency Tables dialog box, from the menus choose:
Statistics
Tables
Crosstabs
One-way
One-way frequency tables provides frequency counts, percentages, tests, etc., for
single table factors or categorical variables.
n Tables. Tables can include frequency counts, percentages, and confidence
this category in the same fashion as the other categories. In addition, you can
display output in a listing format instead of a tabular display. The listing includes
counts, cumulative counts, percentages, and cumulative percentages.
I-167
Crosstabulation
n Save last table as data file. Saves the table for the last variable in the Variable(s) list
Two-way frequency tables crosstabulate one or more categorical row variables with a
categorical column variable.
n Row variable(s). The variables displayed in the rows of the crosstabulation. Each
I-168
Chapter 8
n Options. You can include counts and percentages for cases with missing data. In
addition, you can display output in a listing format instead of a tabular display. The
listing includes counts, cumulative counts, percentages, and cumulative
percentages for each combination of row and column variable categories.
n Save last table as data file. Saves the crosstabulation of the column variable with
the last variable in the row variable(s) list as a SYSTAT data file. For each cell of
the table, SYSTAT saves a record with the cell frequency and the row and column
category values.
Pearson chi-square. For tables with any number of rows and columns, tests for
independence of the row and column variables.
2 x 2 tables. For tables with two rows and two columns, available tests are:
n Yates corrected chi-square. Adjusts the Pearson chi-square statistic for small
samples.
n Fishers exact test. Counts all possible outcomes exactly. When the expected cell
sizes are small (less than 5), use this test as an alternative to the Pearson chi-square.
I-169
Crosstabulation
the counts above the table diagonal differ from those below the diagonal. Small
probability values indicate a greater change in one direction.
n Cohens kappa. Commonly used to measure agreement between two judges rating
the same objects. Tests whether the diagonal counts are larger than expected.
Values of kappa greater than 0.75 indicate strong agreement beyond chance, values
between 0.40 and 0.79 indicate fair to good, and values below 0.40 indicate poor
agreement.
r x c tables, unordered levels. For tables with any number of rows or columns with no
assumed category order, available tests are:
n Phi. A chi-square based measure of association. Values may exceed 1.
n Cramrs V. A measure of association based on the chi-square. The value ranges
between 0 and 1, with 0 indicating independence between the row and column
variables and values close to 1 indicating dependence between the variables.
n Contingency coefficient. A measure of association based on the chi-square. Similar
indicate the proportional reduction in error when values of one variable are used to
predict values of the other variable. Values near 0 indicate that the row variable is
no help in predicting the column variable.
n Likelihood-ratio chi-square. An alternative to the Pearson chi-square, primarily
I-170
Chapter 8
n Spearmans rho. Similar to the Pearson correlation coefficient, but uses the ranks of
association between two ordinal variables that range between 1 and +1, differing
only in the method of dealing with ties. Values close to 0 indicate little or no
relationship.
n Somers d. An asymmetric measure of association between two ordinal variables
I-171
Crosstabulation
each value of each strata variable. If strata are crossed, a separate crosstabulation
is produced for each unique combination of strata variable values. For example, if
you have two strata variables, each with five categories, Separate will produce 10
tables and Crossed will produce 25 tables.
n Options. You can include counts and percentages for cases with missing data and
save the last table produced as a SYSTAT data file. In addition, you can display
output in a listing format, including percentages and cumulative percentages,
instead of a tabular display.
n Display. You can display frequencies, total percentages, row percentages, and
column percentages. Furthermore, you can use the Mantel-Haenszel test for 2 2
subtables to test for an association between two binary variables while controlling
for another variable.
Using Commands
For one-way tables in XTAB, specify:
XTAB
USE filename
PRINT / FREQ CHISQ LIST PERCENT
TABULATE varlist / CONFI=n MISS
ROWPCT
COLPCT
I-172
Chapter 8
Usage Considerations
Types of data. There are two ways to organize data for tables:
n The usual cases-by-variables rectangular data file
n Cell counts with cell identifiers
For example, you may want to analyze the following table reflecting application results
by gender for business schools:
Admitted
Denied
Male
420
90
Female
150
25
GENDER$
female
male
male
STATUS$
admit
deny
admit
female
male
deny
admit
Instead of entering one case for each of the 685 applicants, you could use the second
method to enter four cases:
GENDER$
male
male
female
female
STATUS$
admit
deny
admit
deny
COUNT
420
90
150
25
For this method, the cell counts in the third column are identified by designating
COUNT as a FREQUENCY variable.
Print options. Three levels of output are available. Statistics produced depend on the
dimensionality of the table. PRINT SHORT yields frequency tables for all tables and
Pearson chi-square for one-way and two-way tables. The MEDIUM length yields all
statistics appropriate for the dimensionality of a two-way or multiway table. LONG
I-173
Crosstabulation
adds expected cell values, deviates, and standardized deviates to the SHORT and
MEDIUM output.
Quick Graphs. Frequency tables produce no Quick Graphs.
Saving files. You can save the frequency counts to a file. For two-way tables, cell
values, deviates, and standardized deviates are also saved.
BY groups. Use of a BY variable yields separate frequency tables (and corresponding
statistics) for each level of the BY variable.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. XTAB uses the FREQUENCY variable to duplicate cases. This is the
preferred method of input when the data are aggregated.
Case weights. WEIGHT is available for frequency tables.
Examples
Example 1
One-Way Tables
This example uses questionnaire data from a community survey (Afifi and Clark,
1984). The SURVEY2 data file includes a record (case) for each of the 256 subjects in
the sample. We request frequencies for gender, marital status, and religion. The values
of these variables are numbers, so we add character identifiers for the categories. The
input is:
USE survey2
XTAB
LABEL sex
LABEL marital
/ 1=Male, 2=Female
/ 1=Never, 2=Married, 3=Divorced,
4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / FREQ
TABULATE sex marital religion
If the words male and female were stored in the variable SEX$, you would omit LABEL
and tabulate SEX$ directly. If you omit LABEL and specify SEX, the numbers would
label the output.
I-174
Chapter 8
n When using the Label dialog box, you can omit quotation marks around category
names. With commands, you can omit them if the name has no embedded blanks
or symbols (the name, however, is displayed in uppercase letters).
The output follows:
Frequencies
Values for SEX
Male Female
+---------------+
|
104
152 |
+---------------+
Total
256
Frequencies
Values for MARITAL
Never
Married Divorced Separated
+-----------------------------------------+
|
73
127
43
13 |
+-----------------------------------------+
Total
256
Frequencies
Values for RELIGION
Protestant
Catholic
Jewish
None
Other
+--------------------------------------------------------+
|
133
46
23
52
2 |
+--------------------------------------------------------+
Total
256
In this sample of 256 subjects, 152 are females, 127 are married, and 133 are
Protestants.
List Layout
List layout produces an alternative layout for the same information. Percentages and
cumulative percentages are part of the display. The input is:
USE survey2
XTAB
LABEL sex
LABEL marital
/ 1=Male, 2=Female
/ 1=Never, 2=Married, 3=Divorced,
4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / LIST
TABULATE sex marital religion
PRINT
I-175
Crosstabulation
You can also use TABULATE varlist / LIST as an alternative to PRINT NONE / LIST. The
output follows:
Count
104.
152.
Cum
Count
104.
256.
Cum
Pct
Pct SEX
40.6 40.6 Male
59.4 100.0 Female
Count
73.
127.
43.
13.
Cum
Count
73.
200.
243.
256.
Cum
Pct
Pct
28.5 28.5
49.6 78.1
16.8 94.9
5.1 100.0
MARITAL
Never
Married
Divorced
Separated
Count
133.
46.
23.
52.
2.
Cum
Count
133.
179.
202.
254.
256.
Cum
Pct
Pct
52.0 52.0
18.0 69.9
9.0 78.9
20.3 99.2
.8 100.0
RELIGION
Protestant
Catholic
Jewish
None
Other
Almost 60% (59.4) of the subjects are female, approximately 50% (49.6) are married,
and more than half (52%) are Protestants.
Example 2
Two-Way Tables
This example uses the SURVEY2 data to crosstabulate marital status against religion.
The input is:
USE survey2
XTAB
LABEL marital
Never
Married
Divorced
Separated
Total
Protestant
Catholic
Jewish
None
Other
+--------------------------------------------------------+
|
29
16
8
20
0 |
|
75
21
11
19
1 |
|
21
6
3
13
0 |
|
8
3
1
0
1 |
+--------------------------------------------------------+
133
46
23
52
2
Total
73
127
43
13
256
I-176
Chapter 8
In the sample of 256 people, 73 never married. Of the people that have never married,
29 are Protestants (the cell in the upper left corner), and none are in the Other category
(their religion is not among the first four categories). The Totals (or marginals) along
the bottom row and down the far right column are the same as the values displayed for
one-way tables.
Note that LABEL and SELECT remain in effect until you turn them off. If you request
several different tables, use SELECT to ensure that the same cases are used in all tables.
The subset of cases selected via LABEL applies only to those tables that use the
variables specified with LABEL. To turn off the LABEL specification for RELIGION, for
example, specify:
LABEL religion
We continue from the last table, eliminating the last category codes for MARITAL and
RELIGION:
SELECT marital <> 4 AND religion <> 6
TABULATE marital * religion
SELECT
Total
73
126
43
242
I-177
Crosstabulation
List Layout
Following is the panel for marital status crossed with religious preference:
USE survey2
XTAB
LABEL marital / 1=Never, 2=Married, 3=Divorced
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None
PRINT NONE / LIST
TABULATE marital * religion
PRINT
Cum
Count
29.
45.
53.
73.
148.
169.
180.
199.
220.
226.
229.
242.
Cum
Pct
Pct
12.0 12.0
6.6 18.6
3.3 21.9
8.3 30.2
31.0 61.2
8.7 69.8
4.5 74.4
7.9 82.2
8.7 90.9
2.5 93.4
1.2 94.6
5.4 100.0
MARITAL
Never
Never
Never
Never
Married
Married
Married
Married
Divorced
Divorced
Divorced
Divorced
RELIGION
Protestant
Catholic
Jewish
None
Protestant
Catholic
Jewish
None
Protestant
Catholic
Jewish
None
Example 3
Frequency Input
Crosstabs, like other SYSTAT procedures, reads cases-by-variables data from a
SYSTAT file. However, if you want to analyze a table from a report or a journal article,
you can enter the cell counts directly. This example uses counts from a four-way table
of a breast cancer study of 764 women. The data are from Morrison et al. (1973), cited
in Bishop, Fienberg, and Holland (1975). There is one record for each of the 72 cells
in the table, with the count (NUMBER) of women in the cell and codes or category
names to identify their age group (under 50, 50 to 69, and 70 or over), treatment center
(Tokyo, Boston, or Glamorgan), survival status (dead or alive), and tumor diagnosis
(minimal inflammation and benign, maximum inflammation and benign, minimal
inflammation and malignant, and maximum inflammation and malignant). This
example illustrates how to form a two-way table of AGE by CENTER$.
I-178
Chapter 8
Value
74.039
Total
253
221
290
764
df
4.000
Prob
0.000
Of the 764 women studied, 290 were treated in Tokyo. Of these women, 151 were in
the youngest age group, and 19 were in the 70 or over age group.
Example 4
Missing Category Codes
You can choose whether or not to include a separate category for missing codes. For
example, if some subjects did not check male or female on a form, there would be
three categories for SEX$: male, female, and blank (missing). By default, when values
of a table factor are missing, SYSTAT does not include a category for missing values.
In the OURWORLD data file, some countries did not report the GNP to the United
Nations. In this example, we include a category for missing values, and we followed
this request with a table that omits the category for missing. The input follows:
USE ourworld
XTAB
TABULATE group$ * gnp$ / MISS
LABEL gnp$ / D=Developed, U=Emerging
TABULATE group$ * gnp$
I-179
Crosstabul ation
Total
20
16
21
57
Frequencies
GROUP$ (rows) by GNP$ (columns)
Developed Emerging
+---------------------+
Europe |
17
0 |
Islamic |
4
10 |
NewWorld |
15
5 |
+---------------------+
Total
36
15
Total
17
14
20
51
List Layout
To create a listing of the counts in each cell of the table:
PRINT / LIST
TAB group$ * gnp$
PRINT
Cum
Count
17.
21.
31.
46.
51.
Cum
Pct
Pct
33.3 33.3
7.8 41.2
19.6 60.8
29.4 90.2
9.8 100.0
GROUP$
Europe
Islamic
Islamic
NewWorld
NewWorld
GNP$
Developed
Developed
Emerging
Developed
Emerging
Example 5
Percentages
Percentages are helpful for describing categorical variables and interpreting relations
between table factors. Crosstabs prints tables of percentages in the same layout as
described for frequency counts. That is, each frequency count is replaced by the
percentage. Percentages are computed by dividing each cell frequency by:
I-180
Chap te r 8
n
n
n
In this example, we request all three percentages using the following input:
USE ourworld
XTAB
LABEL gnp$ / 'D'='Developed', 'U'='Emerging'
PRINT NONE / ROWP COLP PERCENT
TABULATE group$ * gnp$
Total
33.333
27.451
39.216
17
14
20
100.000
51
Row percents
GROUP$ (rows) by GNP$ (columns)
Developed Emerging
+---------------------+
Europe | 100.000
0.0
|
Islamic |
28.571
71.429 |
NewWorld |
75.000
25.000 |
+---------------------+
Total
70.588
29.412
N
36
15
Total
100.000
100.000
100.000
N
17
14
20
100.000
51
Column percents
GROUP$ (rows) by GNP$ (columns)
Developed Emerging
+---------------------+
Europe |
47.222
0.0
|
Islamic |
11.111
66.667 |
NewWorld |
41.667
33.333 |
+---------------------+
Total
100.000
100.000
N
36
15
Total
33.333
27.451
39.216
17
14
20
100.000
51
I-181
Crosstabulation
Missing Categories
Notice how the row percentages change when we include a category for the missing
GNP:
PRINT NONE / ROWP
LABEL gnp$ / =Missing, D=Developed, U=Emerging
TABULATE group$ * gnp$
PRINT
Total
100.000
100.000
100.000
N
20
16
21
100.000
57
Here we see that 62.5% of the Islamic nations are classified as emerging. However,
from the earlier table of row percentages, it might be better to say that among the
Islamic nations reporting the GNP, 71.43% are emerging.
Example 6
Multiway Tables
When you have three or more table factors, Crosstabs forms a series of two-way tables
stratified by all combinations of values of the third, fourth, and so on, table factors. The
order in which you choose the table factors determines the layout. Your input can be
the usual cases-by-variables data file or the cell counts with category values.
The input is:
USE cancer
XTAB
FREQ = number
LABEL age
/
I-182
Chapter 8
The last two factors selected (CENTER$ and AGE) define two-way tables. The levels
of the first two factors define the strata. After the table is run, we edited the output and
moved the four tables for SURVIVE$ = Dead next to those for Alive.
Frequencies
CENTER$ (rows) by AGE (columns)
SURVIVE$
= Alive
TUMOR$
= MinBengn
Under 50 50 to 69 70 & Over
+-------------------------------+
Tokyo |
68
46
6 |
Boston |
24
58
26 |
Glamorgn |
20
39
11 |
+-------------------------------+
Total
112
143
43
SURVIVE$
TUMOR$
120
108
70
298
Total
15
4
6
25
47
44
55
146
= Alive
= MaxMalig
Total
48
15
22
85
90
Total
5
2
0
7
Total
20
23
33
76
= Dead
= MaxMalig
List Layout
To create a listing of the counts in each cell of the table:
PRINT / LIST
TABULATE survive$ * center$ * age * tumor$
19
45
26
= Dead
= MinMalig
SURVIVE$
TUMOR$
Total
= Dead
= MaxBengn
SURVIVE$
TUMOR$
Total
= Dead
= MinBengn
SURVIVE$
TUMOR$
= Alive
= MinMalig
SURVIVE$
TUMOR$
Total
= Alive
= MaxBengn
SURVIVE$
TUMOR$
SURVIVE$
TUMOR$
Total
16
12
9
37
I-183
Crosstabulation
Cum
Count
68.
77.
103.
128.
174.
179.
199.
217.
223.
224.
225.
230.
254.
265.
269.
327.
330.
348.
358.
384.
385.
400.
401.
421.
422.
438.
446.
485.
489.
516.
526.
537.
538.
550.
554.
561.
564.
573.
577.
586.
588.
597.
608.
611.
613.
614.
621.
627.
633.
653.
655.
663.
666.
684.
693.
696.
703.
719.
722.
734.
748.
751.
758.
761.
764.
Cum
Pct
Pct
8.9
8.9
1.2 10.1
3.4 13.5
3.3 16.8
6.0 22.8
.7 23.4
2.6 26.0
2.4 28.4
.8 29.2
.1 29.3
.1 29.5
.7 30.1
3.1 33.2
1.4 34.7
.5 35.2
7.6 42.8
.4 43.2
2.4 45.5
1.3 46.9
3.4 50.3
.1 50.4
2.0 52.4
.1 52.5
2.6 55.1
.1 55.2
2.1 57.3
1.0 58.4
5.1 63.5
.5 64.0
3.5 67.5
1.3 68.8
1.4 70.3
.1 70.4
1.6 72.0
.5 72.5
.9 73.4
.4 73.8
1.2 75.0
.5 75.5
1.2 76.7
.3 77.0
1.2 78.1
1.4 79.6
.4 80.0
.3 80.2
.1 80.4
.9 81.3
.8 82.1
.8 82.9
2.6 85.5
.3 85.7
1.0 86.8
.4 87.2
2.4 89.5
1.2 90.7
.4 91.1
.9 92.0
2.1 94.1
.4 94.5
1.6 96.1
1.8 97.9
.4 98.3
.9 99.2
.4 99.6
.4 100.0
SURVIVE$
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
CENTER$
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
AGE
Under 50
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
TUMOR$
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
I-184
Chapter 8
The 35 cells for the women who survived are listed first (the cell for Boston women
under 50 years old with MaxBengn tumors is empty). In the Cum Pct column, we see
that these women make up 72.5% of the sample. Thus, 27.5% did not survive.
Percentages
While list layout provides percentages of the total table count, you might want others.
Here we specify COLPCT in Crosstabs to print the percentage surviving within each
age-by-center stratum. The input is:
PRINT NONE / COLPCT
TABULATE age * center$ * survive$ * tumor$
PRINT
N
128
23
100.000
151
= Under 50
= Boston
Total
84.768
15.232
Total
67.241
32.759
39
19
100.000
58
= Under 50
= Glamorgn
Total
63.380
36.620
N
45
26
100.000
71
I-185
Crosstabulation
AGE
CENTER$
= 50 to 69
= Tokyo
100.000
120
Total
72.951
27.049
89
33
100.000
122
Total
73.394
26.606
80
29
100.000
109
Total
68.421
31.579
13
6
100.000
19
= 70 & Over
= Boston
89
31
= 70 & Over
= Tokyo
74.167
25.833
= 50 to 69
= Glamorgn
= 50 to 69
= Boston
Total
Total
58.904
41.096
43
30
100.000
73
= 70 & Over
= Glamorgn
Total
68.293
31.707
N
28
13
100.000
41
I-186
Chapter 8
Example 7
Two-Way Table Statistics
For the SURVEY2 data, you study the relationship between marital status and age. This
is a general tablewhile the categories for AGE are ordered, those for MARITAL are
not. The usual Pearson chi-square statistic is used to test the association between the
two factors. This statistic is the default for Crosstabs.
The data file is the usual cases-by-variables rectangular file with one record for each
person. We split the continuous variable AGE into four categories and add names such
as 30 to 45 for the output. There are too few separated people to tally, so here we
eliminate them and reorder the categories of MARITAL that remain. To supplement the
results, we request row percentages. The input is:
USE survey2
XTAB
LABEL age /
18 to
30 to
46 to
Over
Total
29
45
60
60
Married Divorced
Never
+----------------------------+
|
17
5
53 |
|
48
21
9 |
|
39
12
8 |
|
23
5
3 |
+----------------------------+
127
43
73
Total
75
78
59
31
243
I-187
Crosstabulation
Row percents
AGE (rows) by MARITAL (columns)
18 to
30 to
46 to
Over
Total
N
29
45
60
60
Married Divorced
Never
+----------------------------+
| 22.667
6.667
70.667 |
| 61.538
26.923
11.538 |
| 66.102
20.339
13.559 |
| 74.194
16.129
9.677 |
+----------------------------+
52.263
17.695
30.041
127
43
73
Test statistic
Pearson Chi-square
Total
100.000
100.000
100.000
100.000
75
78
59
31
100.000
243
Value
87.761
df
6.000
Prob
0.000
Even though the chi-square statistic is highly significant (87.761; p value < 0.0005), in
the Row percentages table, you see that 70.67% of the youngest age group fall into the
never-married category. Many of these people may be too young to consider marriage.
Eliminating a Stratum
If you eliminate the subjects in the youngest group, is there an association between
marital status and age? To address this question, the input is:
SELECT age > 29
PRINT / CHISQ PHI CRAMER
TABULATE age * marital
SELECT
CONT ROWPCT
Total
78
59
31
168
Row percents
AGE (rows) by MARITAL (columns)
Married Divorced
Never
+----------------------------+
30 to 45 | 61.538
26.923
11.538 |
46 to 60 | 66.102
20.339
13.559 |
Over 60 | 74.194
16.129
9.677 |
+----------------------------+
Total
65.476
22.619
11.905
N
110
38
20
Total
100.000
100.000
100.000
N
78
59
31
100.000
168
I-188
Chapter 8
Test statistic
Pearson Chi-square
Coefficient
Phi
Cramer V
Contingency
Value
2.173
df
4.000
Prob
0.704
Value
0.114
0.080
0.113
The proportion of married people is larger within the Over 60 group than for the 30 to
45 group74.19% of the former are married while 61.54% of the latter are married.
The youngest stratum has the most divorced people. However, you cannot say these
proportions differ significantly (chi-square = 2.173, p value = 0.704).
Example 8
Two-Way Table Statistics (Long Results)
This example illustrates LONG results and table input. It uses the AGE by CENTER$
table from the cancer study described in the frequency input example. The input is:
USE cancer
XTAB
FREQ = number
PRINT LONG
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
TABULATE center$ * age
Total
253
221
290
764
I-189
Crosstabulation
Value
74.039
76.963
79.401
df
4.000
4.000
3.000
Prob
0.000
0.000
0.000
Coefficient
Phi
Cramer V
Contingency
Goodman-Kruskal Gamma
Kendall Tau-B
Stuart Tau-C
Cohen Kappa
Spearman Rho
Somers D
(column dependent)
Lambda
(column dependent)
Uncertainty (column dependent)
Value
0.311
0.220
0.297
-0.417
-0.275
-0.265
-0.113
-0.305
-0.267
0.075
0.049
0.043
0.030
0.029
0.022
0.033
0.030
0.038
0.011
The null hypothesis for the Pearson chi-square test is that the table factors are
independent. You reject the hypothesis (chi-square = 74.039, p value < 0.0005). We
are concerned about the analysis of the full table with four factors in the cancer study
because we see an imbalance between AGE and study CENTER. The researchers in
Tokyo entered a much larger proportion of younger women than did the researchers in
the other cities.
Notice that with LONG, SYSTAT reports all statistics for an r c table including
those that are appropriate when both factors have ordered categories (gamma, tau-b,
tau-c, rho, and Spearmans rho).
Example 9
Odds Ratios
For a table with cell counts a, b, c, and d:
Exposure
Disease
yes
no
yes
no
I-190
Chapter 8
where, if you designate the Disease yes people sick and the Disease no people well, the
odds ratio (or cross-product ratio) equals the odds that a sick person is exposed divided
by the odds that a well person is exposed, or:
( a b ) ( c d ) = ( ad ) ( bc )
If the odds for the sick and disease-free people are the same, the value of the odds ratio
is 1.0.
As an example, use the SURVEY2 file and study the association between gender and
depressive illness. Be careful to order your table factors so that your odds ratio is
constructed correctly (we use LABEL to do this). The input is:
USE survey2
XTAB
LABEL casecont / 1=Depressed, 0=Normal
PRINT / FREQ ODDS
TABULATE sex$ * casecont
Total
152
104
256
Value
11.095
Value
3.724
1.315
df
1.000
Prob
0.001
The odds that a female is depressed are 36 to 116, the odds for a male are 8 to 96, and
the odds ratio is 3.724. Thus, in this sample, females are almost four times more likely
to be depressed than males. But, does our sample estimate differ significantly from
1.0? Because the distribution of the odds ratio is very skewed, significance is
I-191
Crosstabulation
determined by examining Ln(Odds), the natural logarithm of the ratio, and the standard
error of the transformed ratio. Note the symmetry when ratios are transformed:
3
2
1
1/2
1/3
Ln 3
Ln 2
Ln 0
Ln 2
Ln 3
The value of Ln(Odds) here is 1.315 with a standard error of 0.415. Constructing an
approximate 95% confidence interval using the statistic plus or minus two times its
standard error:
( 0.485 )
( 2.145 )
I-192
Chapter 8
Example 10
Fishers Exact Test
Lets say that you are interested in how salaries of female executives compare with
those of male executives at a particular firm. The accountant there will not give you
salaries in dollar figures but does tell you whether the executives salaries are low or
high:
Low
High
Male
Female
The sample size is very small. When a table has only two rows and two columns and
PRINT=MEDIUM is set as the length, SYSTAT reports results of five additional tests
and measures: Fishers exact test, the odds ratio (and Ln(Odds)), Yates corrected chisquare, and Yules Q and Y.) By setting PRINT=SHORT, you request three of these:
Fishers exact test, the chi-square test, and Yates corrected chi-square. The input is:
USE salary
XTAB
FREQ = count
LABEL sex
/ 1=male, 2=female
LABEL earnings / 1=low, 2=high
PRINT / FISHER CHISQ YATES
TABULATE sex * earnings
Total
9
6
15
WARNING: More than one-fifth of fitted cells are sparse (frequency < 5).
Significance tests computed on this table are suspect.
Test statistic
Pearson Chi-square
Yates corrected Chi-square
Fisher exact test (two-tail)
Value
5.402
3.225
df
1.000
1.000
Prob
0.020
0.073
0.041
I-193
Crosstabulation
Notice that SYSTAT warns you that the results are suspect because the counts in the
table are too low (sparse). Technically, the message states that more than one-fifth of
the cells have expected values (fitted values) of less than 5.
The p value for the Pearson chi-square (0.020) leads you to believe that SEX and
EARNINGS are not independent. But there is a warning about suspect results. This
warning applies to the Pearson chi-square test but not to Fishers exact test. Fishers
test counts all possible outcomes exactly, including the ones that produce an
interaction greater than what you observe. The Fisher exact test p value is also
significant. On this basis, you reject the null hypothesis of independence (no
interaction between SEX and EARNINGS).
Sensitivity
Results for small samples, however, can be fairly sensitive. One case can matter. What
if the accountant forgets one well-paid male executive?
Frequencies
SEX (rows) by EARNINGS (columns)
low
high
+---------------+
male |
2
6 |
female |
5
1 |
+---------------+
Total
7
7
Total
8
6
14
WARNING: More than one-fifth of fitted cells are sparse (frequency < 5).
Significance tests computed on this table are suspect.
Test statistic
Value
df
Prob
Pearson Chi-square
4.667
1.000
0.031
Yates corrected Chi-square
2.625
1.000
0.105
Fisher exact test (two-tail)
0.103
The results of the Fisher exact test indicates that you cannot reject the null hypothesis
of independence. It is too bad that you do not have the actual salaries. Much
information is lost when a quantitative variable like salary is dichotomized into LOW
and HIGH.
I-194
Chapter 8
Example 11
Cochrans Test of Linear Trend
When one table factor is dichotomous and the other has three or more ordered
categories (for example, low, median, and high), Cochrans test of linear trend is used
to test the null hypothesis that the slope of a regression line across the proportions is 0.
For example, in studying the relation of depression to education, you form this table
for the SURVEY2 data and plot the proportion depressed:
I-195
Crosstabulation
Total
44
212
256
Column percents
CASECONT (rows) by EDUCATN (columns)
Dropout
HS grad
College Degree +
+-----------------------------------------+
Depressed |
28.000
18.367
12.791
4.545 |
Normal |
72.000
81.633
87.209
95.455 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
50
98
86
22
Test statistic
Pearson Chi-square
Cochrans Linear Trend
Value
7.841
7.681
Total
17.187
82.813
44
212
100.000
256
df
3.000
1.000
Prob
0.049
0.006
Frequencies
CASECONT (rows) by HEALTHY (columns)
Excellent
Good Fair/Poor
+-------------------------------+
Depressed |
16
15
13 |
Normal |
105
78
29 |
+-------------------------------+
Total
121
93
42
Total
44
212
256
Column percents
CASECONT (rows) by HEALTHY (columns)
Excellent
Good Fair/Poor
+-------------------------------+
Depressed |
13.223
16.129
30.952 |
Normal |
86.777
83.871
69.048 |
+-------------------------------+
Total
100.000
100.000
100.000
N
121
93
42
Test statistic
Pearson Chi-square
Cochrans Linear Trend
Value
7.000
5.671
Total
17.187
82.813
N
44
212
100.000
256
df
2.000
1.000
Prob
0.030
0.017
I-196
Chapter 8
Example 12
Tables with Ordered Categories
In this example, we focus on statistics for studies in which both table factors have a
few ordered categories. For example, a teacher evaluating the activity level of
schoolchildren may feel that she cant score them from 1 to 20 but that she could
categorize the activity of each child as sedentary, normal, or hyperactive. Here you
study the relation of health status to age. If the category codes are character-valued,
you must indicate the correct ordering (as opposed to the default alphabetical
ordering).
For Spearmans rho, instead of using actual data values, the indices of the categories
are used to compute the usual correlation. Gamma measures the probability of getting
like (as opposed to unlike) orders of values. Its numerator is identical to that of
Kendalls tau-b and Stuarts tau-c. The input is:
USE survey2
XTAB
LABEL healthy /
LABEL age
/
Total
121
93
42
256
Row percents
HEALTHY (rows) by AGE (columns)
18 to 29 30 to 45 46 to 60
Over 60
+-----------------------------------------+
Excellent |
35.537
39.669
20.661
4.132 |
Good |
32.258
24.731
25.806
17.204 |
Fair/Poor |
14.286
21.429
35.714
28.571 |
+-----------------------------------------+
Total
30.859
31.250
25.000
12.891
N
79
80
64
33
Total
100.000
100.000
100.000
N
121
93
42
100.000
256
I-197
Crosstabulation
Test statistic
Pearson Chi-square
Value
29.380
Coefficient
Goodman-Kruskal Gamma
Spearman Rho
Value
0.346
0.274
df
6.000
Prob
0.000
Not surprisingly, as age increases, health status tends to deteriorate. In the table of row
percentages, notice that among those with EXCELLENT health, 4.13% are in the oldest
age group; in the GOOD category, 17.2% are in the oldest group; and in the
FAIR/POOR category, 28.57% are in the oldest group.
The value of gamma is 0.346; rho is 0.274. Here are confidence intervals (Value
2 * Asymptotic Std Error) for each statistic:
Example 13
McNemars Test of Symmetry
In November of 1993, the U.S. Congress approved the North American Free Trade
Agreement (NAFTA). Lets say that two months before the approval and before the
televised debate between Vice President Al Gore and businessman Ross Perot,
political pollsters queried a sample of 350 people, asking Are you for, unsure, or
against NAFTA? Immediately after the debate, the pollsters contacted the same
people and asked the question a second time. Here are the responses:
After
Before
For
Unsure
Against
For
51
22
28
Unsure
46
18
27
Against
52
49
57
The pollsters wonder, Is there a shift in opinion about NAFTA? The study design for
the answer is similar to a paired t testeach subject has two responses. The row and
column categories of our table are the same variable measured at different points in time.
I-198
Chapter 8
The file NAFTA contains these data. To test for an opinion shift, the input is:
USE nafta
XTAB
FREQ = count
ORDER before$ after$ / SORT=for,unsure,against
PRINT / FREQ MCNEMAR CHI PERCENT
TABULATE before$ * after$
We use ORDER to ensure that the row and column categories are ordered the same. The
output follows:
Frequencies
BEFORE$ (rows) by AFTER$ (columns)
for unsure against
+-------------------------+
for |
51
22
28 |
unsure |
46
18
27 |
against |
52
49
57 |
+-------------------------+
Total
149
89
112
Total
101
91
158
350
Test statistic
Pearson Chi-square
McNemar Symmetry Chi-square
Value
11.473
22.039
N
101
91
158
350
df
4.000
3.000
Prob
0.022
0.000
The McNemar test of symmetry focuses on the counts in the off-diagonal cells (those
along the diagonal are not used in the computations). We are investigating the direction
of change in opinion. First, how many respondents became more negative about
NAFTA?
n Among those who initially responded For, 22 (6.29%) are now Unsure and 28
afterwards.
I-199
Crosstabulation
The three cells in the upper right contain counts for those who became more
unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in
the lower left contain counts for people who became more positive about NAFTA (46,
52, and 49) or 42% of the sample.
The null hypothesis for the McNemar test is that the changes in opinion are equal.
The chi-square statistic for this test is 22.039 with 3 df and p < 0.0005. You reject the
null hypothesis. The pro-NAFTA shift in opinion is significantly greater than the antiNAFTA shift.
You also clearly reject the null hypothesis that the row (BEFORE$) and column
(AFTER$) factors are independent (chi-square = 11.473; p = 0.022). However, a test
of independence does not answer your original question about change of opinion and
its direction.
Example 14
Confidence Intervals for One-Way Table Percentages
If your data are binomially or multinomially distributed, you may want confidence
intervals on the cell proportions. SYSTATs confidence intervals are based on an
approximation by Bailey (1980). Crosstabs uses that references approximation
number 6 with a continuity correction, which closely fits the real intervals for the
binomial on even small samples and performs well when population proportions are
near 0 or 1. The confidence intervals are scaled on a percentage scale for compatibility
with the other Crosstabs output.
Here is an example using data from Davis (1977) on the number of buses failing
after driving a given distance (1 of 10 distances). Print the percentages of the 191 buses
failing in each distance category to see the cover of the intervals. The input follows:
USE buses
XTAB
FREQ = count
PRINT NONE / FREQ PERCENT
TABULATE distance / CONFI=.95
I-200
Chapter 8
N
191
There are 6 buses in the first distance category; this is 3.14% of the 191 buses. The
confidence interval for this percentage ranges from 0.55 to 8.23%.
Example 15
Mantel-Haenszel Test
For any ( k 2 2 ) table, if the output mode is MEDIUM or if you select the MantelHaenszel test, SYSTAT produces the Mantel-Haenszel statistic without continuity
correction. This tests the association between two binary variables controlling for a
stratification variable. The Mantel-Haenszel test is often used to test the effectiveness
of a treatment on an outcome, to test the degree of association between the presence or
absence of a risk factor and the occurrence of a disease, or to compare two survival
distributions.
I-201
Crosstabulation
A study by Ansfield, et al. (1977) examined the responses of two different groups of
patients (colon or rectum cancer and breast cancer) to two different treatments:
CANCER$
TREAT$ RESPONSE$
Colon-Rectum
a
Positive
Colon-Rectum
b
Positive
Colon-Rectum
a
Negative
Colon-Rectum
b
Negative
Breast
a
Positive
Breast
b
Positive
Breast
a
Negative
Breast
b
Negative
NUMBER
16.000
7.000
32.000
45.000
14.000
9.000
28.000
29.000
Colon-Rectum
Positive
Negative
Positive Negative
Treatment A
14
28
16
32
Treatment B
29
45
The odds ratio (cross-product ratio) for the first table is:
14 28
---------------- = 1.6
9 29
Similarly, for the second table, the odds ratio is:
16 32
---------------- = 3.2
7 45
If the odds for treatments A and B are identical, the ratios would both be 1.0. For these
data, the breast cancer patients on treatment A are 1.6 times more likely to have a
positive biopsy than patients on treatment B; while, for the colon-rectum, those on
treatment A are 3.2 times more likely to have a positive biopsy than those on treatment
B. But can you say these estimates differ significantly from 1.0? After adjusting for the
I-202
Chapter 8
total frequency in each table, the Mantel-Haenszel statistic combines odd ratios across
tables. The input is:
USE ansfield
XTAB
FREQ = number
ORDER response$ / SORT=Positive,Negative
PRINT / MANTEL
TABULATE cancer$ * treat$ * response$
The stratification variable (CANCER$) must be the first variable listed on TABULATE.
The output is:
Frequencies
TREAT$ (rows) by RESPONSE$ (columns)
CANCER$
= Breast
Total
Positive
Negative
+---------------------------+
a |
14
28 |
b |
9
29 |
+---------------------------+
23
57
CANCER$
Total
Total
42
38
80
= Colon-Rectum
Positive
Negative
+---------------------------+
a |
16
32 |
b |
7
45 |
+---------------------------+
23
77
Test statistic
Mantel-Haenszel statistic =
Mantel-Haenszel Chi-square =
Total
48
52
100
Value
df
2.277
4.739 Probability =
Prob
0.029
SYSTAT prints a chi-square test for testing whether this combined estimate equals 1.0
(that odds for A and B are the same). The probability associated with this chi-square is
0.029, so you reject the hypothesis that the odds ratio is 1.0 and conclude that treatment
A is less effectivemore patients on treatment A have positive biopsies after treatment
than patients on treatment B.
One assumption required for the Mantel-Haenszel chi-square test is that the odds
ratios are homogenous across tables. For your example, the second odds ratio is twice
as large as the first. You can use loglinear models to test if a cancer-by-treatment
interaction is needed to fit the cells of the three-way table defined by cancer, treatment,
and response. The difference between this model and one without the interaction was
not significant (a chi-square of 0.36 with 1 df).
I-203
Crosstabulation
Computation
All computations are in double precision.
References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning.
Ansfield, F., et al. (1977). A phase III study comparing the clinical utility of four regimens
of 5-fluorouracil. Cancer, 39, 3440.
Bailey, B. J. R. (1980). Large sample simultaneous confidence intervals for the
multinomial probabilities based on transformations of the cell frequencies.
Technometrics, 22, 583589.
Davis, D. J. (1977). An analysis of some failure data. Journal of the American Statistical
Association, 72, 113150.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. 2nd ed. New York: John
Wiley & Sons, Inc.
Morrison, A. S., Black, M. M., Lowe, C. R., MacMahon, B., and Yuasa, S. Y. (1973). Some
international differences in histology and survival in breast cancer. International
Journal of Cancer, 11, 261267.
Chapter
9
Descriptive Statistics
Leland Wilkinson and Laszlo Engelman
There are many ways to describe data, although not all descriptors are appropriate for
a given sample. Means and standard deviations are useful for data that follow a
normal distribution, but are poor descriptors when the distribution is highly skewed
or has outliers, subgroups, or other anomalies. Some statistics, such as the mean and
median, describe the center of a distribution. These estimates are called measures of
location. Others, such as the standard deviation, describe the spread of the
distribution.
Before deciding what you want to describe (location, spread, and so on), you
should consider what type of variables are present. Are the values of a variable
unordered categories, ordered categories, counts, or measurements?
For many statistical purposes, counts are treated as measured variables. Such
variables are called quantitative if it makes sense to do arithmetic on their values.
Means and standard deviations are appropriate for quantitative variables that follow a
normal distribution. Often, however, real data do not meet this assumption of
normality. A descriptive statistic is called robust if the calculations are insensitive to
violations of the assumption of normality. Robust measures include the median,
quartiles, frequency counts, and percentages.
Before requesting descriptive statistics, first scan graphical displays to see if the
shape of the distribution is symmetric, if there are outliers, and if the sample has
subpopulations. If the latter is true, then the sample is not homogeneous, and the
statistics should be calculated for each subgroup separately.
Descriptive Statistics offers the usual mean, standard deviation, and standard error
appropriate for data that follow a normal distribution. It also provides the median,
minimum, maximum, and range. A confidence interval for the mean and standard
errors for skewness and kurtosis can be requested. A stem-and-leaf plot is available
I-205
I-206
Chapter 9
Statistical Background
Descriptive statistics are numerical summaries of batches of numbers. Inevitably, these
summaries are misleading, because they mask details of the data. Without them,
however, we would be lost in particulars.
There are many ways to describe a batch of data. Not all are appropriate for every
batch, however. Lets look at the Whos Who data from Chapter 1 to see what this
means. First of all, here is a stem-and-leaf diagram of the ages of 50 randomly sampled
people from Whos Who. A stem-and-leaf diagram is a tally; it shows us the
distribution of the AGE values.
Stem and leaf plot of variable:
Minimum:
Lower hinge:
Median:
Upper hinge:
Maximum:
AGE
, N =
50
34.000
49.000
56.000
66.000
81.000
3
4
3
689
4
14
4 H 556778999
5
0011112
5 M 556688889
6
0023
6 H 55677789
7
04
7
5668
8
1
Notice that these data look fairly symmetric and lumpy in the middle. A natural way
to describe this type of distribution would be to report its center and the amount of
spread.
Location
How do we describe the center, or central location of the distribution, on a scale? One
way is to pick the value above which half of the data values fall and, by implication,
below which half of the data values fall. This measure is called the median. For our
AGE data, the median age is 56 years. Another measure of location is the center of
I-207
Descriptive Sta tistics
gravity of the numbers. Think of turning the stem-and-leaf diagram on its side and
balancing it. The balance point would be the mean. For a batch of numbers, the mean
is computed by averaging the values. In our sample, the mean age is 56.7 years. It is
quite close to the median.
Spread
One way to measure spread is to take the difference between the largest and smallest
value in the data. This is called the range. For the age data, the range is 47 years.
Another measure, called the interquartile range or midrange, is the difference between
the values at the limits of the middle 50% of the data. For AGE, this is 17 years. (Using
the statistics at the top of the stem-and-leaf display, subtract the lower hinge from the
upper hinge.) Still another way to measure would be to compute the average variability
in the values. The standard deviation is the square root of the average squared
deviation of values from the mean. For the AGE variable, the standard deviation is
11.62. Following is some output from STATS:
AGE
N of cases
Mean
Standard Dev
50
56.700
11.620
I-208
Chapter 9
12
Count
8
0.1
4
0
30
40
50
60 70
AGE
80
0.2
0.0
90
The fit of the curve to the data looks excellent. Lets examine the fit in more detail. For
a normal distribution, we would expect 68% of the observations to fall between one
standard deviation below the mean and one standard deviation above the mean (45.1
to 68.3 years). By counting values in the stem-and-leaf diagram, we find 34 caseson
target. This is not to say that every number follows a normal distribution exactly,
however. If we looked further, we would find that the tails of this distribution are
slightly shorter than those from a normal distribution, but not enough to worry.
Non-Normal Shape
Before you compute means and standard deviations on everything in sight, however,
lets take a look at some more data: the USDATA data. Following are histograms for
the first two variables, ACCIDENT and CARDIO:
Count
5
0.1
0
0.0
20 30 40 50 60 70 80 90
ACCIDENT
0.2
8
6
0.1
4
2
0
100
200
300 400
CARDIO
500
0.0
600
0.2
10
10
0.3
Count
15
I-209
Descriptive Sta tistics
Notice that the normal curves fit the distributions poorly. ACCIDENT is positively
skewed. That is, it has a long right tail. CARDIO, on the other hand, is negatively
skewed. It has a long left tail. The means (44.3 and 398.5) clearly do not fall in the
centers of the distributions. Furthermore, if you calculate the medians using the Stem
display, you will see that the mean for ACCIDENT is pulled away from the median
(41.9) toward the upper tail and the mean for CARDIO is pulled to the left of the
median (416.2).
In short, means and standard deviations are not good descriptors for non-normal
data. In these cases, you have two alternatives: either transform your data to look
normal, or find other descriptive statistics that characterize the data. If you log the
values of ACCIDENT, for example, the histogram looks quite normal. If you square the
values of CARDIO, the normal fit similarly improves.
If a transformation doesnt work, then you may be looking at data that come from a
different mathematical distribution or are mixtures of subpopulations (see below). The
probability plots in SYSTAT can help you identify certain mathematical distributions.
There is not room here to discuss parameters for more complex probability
distributions. Otherwise, you should turn to distribution-free summary statistics to
characterize your data: the median, range, minimum, maximum, midrange, quartiles,
and percentiles.
Subpopulations
Sometimes, distributions can look non-normal because they are mixtures of different
normal distributions. Lets look at the Fisher/Anderson IRIS flower measurements.
Following is a histogram of PETALLEN (petal length) smoothed by a normal curve:
50
0.3
Count
30
0.2
20
0.1
10
0
0
3 4 5
PETALLEN
0.0
7
40
I-210
Chapter 9
We forgot to notice that the petal length measurements involve three different flower
species. You can see one of them at the left. The other two are blended at the right.
Computing a mean and standard deviation on the mixed data is misleading.
The following box plot, split by species, shows how different the subpopulations are:
7
PETALLEN
6
5
4
3
2
1
0
2
3
SPECIES
When there are such differences, you should compute basic statistics by group. If you
want to go on to test whether the differences in subpopulation means are significant,
use analysis of variance.
But first notice that the Setosa flowers (Group 1) have the shortest petals and the
smallest spread; while the Virginica flowers (Group 3) have the longest petals and
widest spread. That is, the size of the cell mean is related to the size of the cell standard
deviation. This violates the assumption of equal variances necessary for a valid
analysis of variance.
PETALLEN
1
1
2
3
SPECIES
I-211
Descriptive Statistics
The spreads of the three distributions are now more similar. For the analysis, we should
log transform the data.
root of the sample size. It is the estimation error, or the average deviation of sample
means from the expected value of a variable.
I-212
Chapter 9
n CI of Mean. Endpoints for the confidence interval of the mean. You can specify
increasing order, the median is the value above which half of the values fall.
n SD. Standard deviation, a measure of spread, is the square root of the sum of the
mean.
n Range. The difference between the minimum and the maximum values.
n Variance. The mean of the squared deviations of values from the mean. (Variance
( SQR ( 6 n ) ) .
n Kurtosis. A value of kurtosis significantly greater than 0 indicates that the variable
has longer tails than those for a normal distribution; less than 0 indicates that the
distribution is flatter than a normal distribution. A kurtosis coefficient is considered
significant if the absolute value of KURTOSIS / SEK is greater than 2.
n SEK. The standard error of kurtosis
( SQR ( 24 n ) ) .
n Confidence. Confidence level for the confidence interval of the mean. Enter a value
Each selected statistic is a case in the new data file (both the statistic and the
I-213
Descriptive Sta tistics
group(s) are identified). The file contains the variable STATISTIC$ identifying the
statistics.
n Aggregate. Saves aggregate statistics to a data file. For each By Groups category, a
record (case) in the new data file contains all requested statistics. Three characters
are appended to the first eight letters of the variable name to identify the statistics.
The first two characters identify the statistic. The third character represents the
order in which the variables are selected. The statistics correspond to the following
two-letter combinations:
N of cases
Minimum
Maximum
Range
Sum
Median
Mean
CI Upper
CI Lower
NU
MI
MA
RA
SU
MD
ME
CU
CL
Std. Error
Std. Deviation
Variance
C.V.
Skewness
SE Skewness
Kurtosis
SE Kurtosis
SE
SD
VA
CV
SK
ES
KU
EK
Stem creates a stem-and-leaf plot for one or more variables. The plot shows the
distribution of a variable graphically. In a stem-and-leaf plot, the digits of each number
are separated into a stem and a leaf. The stems are listed as a column on the left, and
the leaves for each stem are in a row on the right. Stem-and-leaf plots also list the
minimum, lower-hinge, median, upper-hinge, and maximum values of the sample.
I-214
Chapter 9
Unlike histograms, stem-and-leaf plots show actual numeric values to the precision of
the leaves.
The stem-and-leaf plot is useful for assessing distributional shape and identifying
outliers. Values that are markedly different from the others in the sample are labeled
as outside valuesthat is, the value is more than 1.5 hspreads outside its hinge (the
hspread is the distance between the lower and upper hinges, or quartiles). Under
normality, this translates into roughly 2.7 standard deviations from the mean.
The following must be specified to obtain a stem-and-leaf plot:
n Variable(s). A separate stem-and-leaf plot is created for each selected variable.
In addition, you can indicate how many lines (stems) to include in the plot.
Cronbach computes Cronbachs alpha. This statistic is a lower bound for test reliability
and ranges in value from 0 to 1 (negative values can occur when items are negatively
correlated). Alpha can be viewed as the correlation between the items (variables)
selected and all other possible tests or scales (with the same number of items)
constructed to measure the characteristic of interest. The formula used to calculate
alpha is:
k avg ( cov )
-------------------------------avg ( var )
= ------------------------------------------------------------( k 1 ) avg ( cov )
1 + ---------------------------------------------avg ( var )
I-215
Descriptive Sta tistics
where k is the number of items, avg(cov) is the average covariance among the items,
and avg(var) is the average variance. Note that alpha depends on both the number of
items and the correlations among them. Even when the average correlation is small, the
reliability coefficient can be large if the number of items is large.
The following must be specified to obtain a Cronbachs alpha:
n Variable(s). To obtain Cronbachs alpha, at least two variables must be selected.
Using Commands
To generate descriptive statistics, choose your data by typing USE filename, and
continue with:
STATISTICS
STEM varlist / LINES=n
CRONBACH varlist
SAVE / AG
STATISTICS varlist / ALL N MIN MAX SUM MEAN SEM CIM,
CONFI=n MEDIAN SD CV RANGE VARIANCE,
SKEWNESS SES KURTOSIS SEK
Usage Considerations
Types of data. STATS uses only numeric data.
Print options. The output is standard for all PRINT options.
Quick Graphs. STATS does not create Quick Graphs.
Saving files. STATS saves basic statistics as either records (cases) or as variables.
BY groups. STATS analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. STATS uses the FREQ variable, if present, to duplicate cases.
Case weights. STATS uses the WEIGHT variable, if present, to weight cases. However,
STEM is not affected by the WEIGHT variable.
I-216
Chapter 9
Examples
Example 1
Basic Statistics
This example uses the OURWORLD data file, containing one record for each of 57
countries, and requests the default set of statistics for BABYMORT (infant mortality),
GNP_86 (gnp per capita in 1986), LITERACY (percentage of the population who can
read), and POP_1990 (population, in millions, in 1990).
The Statistics procedure knows only that these are numeric variablesit does not
know if the mean and standard deviation are appropriate descriptors for their
distributions. In other examples, we learned that the distribution of infant mortality is
right-skewed and has distinct subpopulations, the gnp is missing for 12.3% of the
countries, the distribution of LITERACY is left-skewed and has distinct subgroups. and
a log transformation markedly improves the symmetry of the population values. This
example ignores those findings.
The input is:
STATISTICS
USE ourworld
STATISTICS babymort gnp_86 literacy pop_1990
BABYMORT
GNP_86
57
50
5.0000
120.0000
154.0000 17680.0000
48.1404
4310.8000
47.2355
4905.8773
LITERACY
57
11.6000
100.0000
73.5632
29.7646
POP_1990
57
0.2627
152.5051
22.8003
30.3655
For each variable, SYSTAT prints the number of cases (N of cases) with data present.
Notice that the sample size for GNP_86 is 50, or 7 less than the total observations. For
each variable, Minimum is the smallest value and Maximum, the largest. Thus, the
lowest infant mortality rate is 5 deaths (per 1,000 live births), and the highest is 154
deaths. In a symmetric distribution, the mean and median are approximately the same.
The median for POP_1990 is 10.354 million people (see the stem-and-leaf plot
example). Here, the mean is 22.8 millionmore than double the median. This estimate
of the mean is quite sensitive to the extreme values in the right tail.
Standard Dev, or standard deviation, measures the spread of the values in each
distribution. When the data follow a normal distribution, we expect roughly 95% of the
values to fall within two standard deviations of the mean.
I-217
Descriptive Sta tistics
Example 2
Saving Basic Statistics: One Statistic and One Grouping Variable
For European, Islamic, and New World countries, we save the median infant mortality
rate, gross national product, literacy rate, and 1990 population using the OURWORLD
data file. The input is:
STATISTICS
USE ourworld
BY group$
SAVE mystats
STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN
BY
The text results that appear on the screen are shown below (they can also be sent to a
text file).
The following results are for:
GROUP$
= Europe
BABYMORT
N of cases
20
Median
6.000
GROUP$
N of cases
Median
GROUP$
N of cases
Median
GNP_86
18
9610.000
LITERACY
20
99.000
POP_1990
20
10.462
= Islamic
BABYMORT
16
113.000
GNP_86
12
335.000
LITERACY
16
28.550
POP_1990
16
16.686
= NewWorld
BABYMORT
21
32.000
GNP_86
20
1275.000
LITERACY
21
85.600
POP_1990
21
7.241
The MYSTATS data file (created in the SAVE step) is shown below:
Case GROUP$
1
2
3
4
5
6
Europe
Europe
Islamic
Islamic
NewWorld
NewWorld
STATISTIC$ BABYMORT
N of cases
Median
N of cases
Median
N of cases
Median
20
6
16
113
21
32
GNP_86
LITERACY
18
9610
12
335
20
1275
20
99
16
28.550
21
85.6
POP_1990
20
10.462
16
16.686
21
7.241
I-218
Chapter 9
Example 3
Saving Basic Statistics: Multiple Statistics and Grouping Variables
If you want to save two or more statistics for each unique cross-classification of the
values of the grouping variables, SYSTAT can write the results in two ways:
n A separate record for each statistic. The values of a new variable named
As examples, we save the median, mean, and standard error of the mean for the crossclassification of type of country with government for the OURWORLD data. The nine
cells for which we compute statistics are shown below (the number of countries is
displayed in each cell):
Democracy
Military
One Party
Europe
16
Islamic
12
New World
Note the empty cell in the first row. We illustrate both file layoutsa separate record
for each statistic and one record for all results.
One record per statistic. The following commands are used to compute and save
statistics for the combinations of GROUP$ and GOV$ shown in the table above:
STATISTICS
USE ourworld
BY group$ gov$
SAVE mystats2
STATISTICS babymort gnp_86 literacy pop_1990
BY
/ MEDIAN
MEAN
SEM
I-219
Descriptive Sta tistics
The MYSTATS2 file with 32 cases and seven variables is shown below:
Case
GROUP$
GOV$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Europe
Europe
Europe
Europe
Europe
Europe
Europe
Europe
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
Democracy
Democracy
Democracy
Democracy
OneParty
OneParty
OneParty
OneParty
Democracy
Democracy
Democracy
Democracy
OneParty
OneParty
OneParty
OneParty
Military
Military
Military
Military
Democracy
Democracy
Democracy
Democracy
OneParty
OneParty
OneParty
OneParty
Military
Military
Military
Military
STATISTC$ BABYMORT
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
16.000
6.875
0.547
6.000
4.000
11.500
1.708
12.000
4.000
91.000
23.083
97.000
5.000
109.800
15.124
116.000
7.000
110.857
11.801
116.000
12.000
44.667
9.764
35.000
3.000
14.667
1.333
16.000
6.000
53.167
13.245
55.000
GNP_86
16.000
9770.000
1057.226
10005.000
2.000
2045.000
25.000
2045.000
4.000
700.000
378.660
370.000
3.000
1016.667
787.196
280.000
5.000
458.000
180.039
350.000
12.000
2894.167
1085.810
1645.000
2.000
2995.000
2155.000
2995.000
6.000
1045.000
287.573
780.000
LITERACY
16.000
97.250
1.055
99.000
4.000
98.750
0.250
99.000
4.000
37.300
9.312
29.550
5.000
29.720
9.786
18.000
7.000
37.886
7.779
29.000
12.000
85.800
3.143
86.800
3.000
90.500
8.251
98.500
6.000
63.000
10.820
60.500
POP_1990
16.000
22.427
5.751
9.969
4.000
20.084
6.036
15.995
4.000
12.761
5.315
12.612
5.000
15.355
3.289
15.862
7.000
51.444
18.678
51.667
12.000
26.490
11.926
15.102
3.000
4.441
3.153
2.441
6.000
6.886
1.515
5.726
The average infant mortality rate for European democratic nations is 6.875 (case 2),
while the median is 6.0 (case 4).
One record for all statistics. Instead of four records (cases) for each combination of
GROUP$ and GOV$, we specify AG (aggregate) to prompt SYSTAT to write one
record for each cell:
I-220
Chapter 9
STATISTICS
USE ourworld
BY group$ gov$
SAVE mystats3 / AG
STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN
BY
MEAN
SEM
The MYSTATS3 file, with 8 cases and 18 variables, is shown below. (We separated
them into three panels and shortened the variable names):
Case
GROUP$
GOV$
1
2
3
4
5
6
7
8
Europe
Europe
Islamic
Islamic
Islamic
NewWorld
NewWorld
NewWorld
Democracy
OneParty
Democracy
OneParty
Military
Democracy
OneParty
Military
NU1BABYM
SE1BABYM
MD1BABYM
0.547
1.708
23.083
15.124
11.801
9.764
1.333
13.245
6.0
12.0
97.0
116.0
116.0
35.0
16.0
55.0
NU2GNP_8
ME2GNP_8
SE2GNP_8
MD2GNP_8
NU3LITER
ME3LITER
16
2
4
3
5
12
2
6
9770.000
2045.000
700.000
1016.667
458.000
2894.167
2995.000
1045.000
1057.226
25.000
378.660
787.196
180.039
1085.810
2155.000
287.573
10005
2045
370
280
350
1645
2995
780
16
4
4
5
7
12
3
6
SE3LITER
MD3LITER
NU4POP_1
ME4POP_1
SE4POP_1
1.055
0.250
9.312
9.786
7.779
3.143
8.251
10.820
99.0
99.0
29.5
18.0
29.0
86.8
98.5
60.5
16
4
4
5
7
12
3
6
22.427
20.084
12.761
15.355
51.444
26.490
4.441
6.886
5.751
6.036
5.315
3.289
18.678
11.926
3.153
1.515
16
4
4
5
7
12
3
6
ME1BABYM
6.875
11.500
91.000
109.800
110.857
44.667
14.667
53.167
97.250
98.750
37.300
29.720
37.886
85.800
90.500
63.000
MD4POP_1
9.969
15.995
12.612
15.862
51.667
15.102
2.441
5.726
Note that there are no European countries with Military governments, so no record is
written.
I-221
Descriptive Sta tistics
Example 4
Stem-and-Leaf Plot
We request robust statistics for BABYMORT (infant mortality), POP_1990 (1990
population in millions), and LITERACY (percentage of the population who can read)
from the OURWORLD data file. The input is:
STATISTICS
USE ourworld
STEM babymort pop_1990 literacy
BABYMORT, N = 57
0 H 5666666666677777
1
00123456668
2 M 227
3
028
4
9
5
6
11224779
7 H 4
8
77
9
10
77
11
066
12
559
13
6
14
07
15
4
Stem and Leaf Plot of variable:
Minimum:
0.2627
Lower hinge:
6.1421
Median:
10.3545
Upper hinge:
25.5665
Maximum:
152.5051
0
00122333444
0 H 5556667777788899
1 M 0000034
1
556789
2
14
2 H 56
3
23
3
79
4
4
5
1
* * * Outside Values * * *
5
6677
6
2
11
48
15
2
POP_1990, N = 57
I-222
Chapter 9
LITERACY, N = 57
1
1258
2
035689
3
1
4
5 H 002556
6
355
7
0446
8 M 03558
9 H 03344457888889999999999999
10
00
In a stem-and-leaf plot, the digits of each number are separated into a stem and a leaf.
The stems are listed as a column on the left, and the leaves for each stem are in a row
on the right. For infant mortality (BABYMORT), the Maximum number of babies who
die in their first year of life is 154 (out of 1,000 live births). Look for this value at the
bottom of the BABYMORT display. The stem for 154 is 15, and the leaf is 4. The
Minimum value for this variable is 5its leaf is 5 with a stem of 0.
The median value of 22 is printed here as the Median in the top panel and marked
by an M in the plot. The hinges, marked by Hs in the plot, are 7 and 74 deaths, meaning
that 25% of the countries in our sample have a death rate of 7 or less, and another 25%
have a rate of 74 or higher. Furthermore, the gaps between 49 and 61 deaths and
between 87 and 107 indicate that the sample does not appear homogeneous
Focusing on the second plot, the median population size is 10.354, or more than 10
million people. One-quarter of the countries have a population of 6.142 million or less.
The largest country (Brazil) has more than 152 million people. The largest stem for
POP_1990 is 15, like that for BABYMORT. This 15 comes from 152.505, so the 2 is
the leaf and the 0.505 is lost.
The plot for POP_1990 is very right-skewed. Notice that a real number line extends
from the minimum stem of 0 (0.623) to the stem of 5 for 51 million. The values below
Outside Values (stems of 5, 6, 11, and 25 with 8 leaves) do not fall along a number line,
so the right tail of this distribution extends further than one would think at first glance.
The median in the final plot indicates that half of the countries in our sample have
a literacy rate of 88% or better. The upper hinge is 99%, so more than one-quarter of
the countries have a rate of 99% or better. In the country with the lowest rate (Somalia),
only 11.6% of the people can read. The stem for 11.6 is 1 (the 10s digit), and the leaf
is 1 (the units digit). The 0.6 is not part of the display. For stem 10, there are two leaves
that are 0so two countries have 100% literacy rates (Finland and Norway). Notice
the 11 countries (at the top of the plot) with very low rates. Is there a separate subgroup
here?
I-223
Descriptive Sta tistics
Transformations
Because the distribution of POP_1990 is very skewed, it may not be suited for analyses
based on normality. To find out, we transform the population values to log base 10 units
using the L10 function. The input is:
STATISTICS
USE ourworld
LET logpop90=L10(pop_1990)
STEM logpop90
LOGPOP90, N = 57
-0
5
* * * Outside Values * * *
0
01
0
33
0
445
0 H 6667777
0
888888899999
1 M 00000111
1
2222233
1 H 445555
1
777777
1
2
001
2
For the untransformed values of the population, the stem-and-leaf plot identifies eight
outliers. Here, there is only one outlier. More important, however, is the fact that the
shape of the distribution for these transformed values is much more symmetric.
Subpopulations
Here, we stratify the values of LITERACY for countries grouped as European, Islamic,
and New World. The input is:
STATISTICS
USE ourworld
BY group$
STEM babymort pop_1990 literacy
BY
I-224
Chapter 9
LITERACY, N = 20
83
0
93
0
95
0
* * * Outside Values * * *
97
0
98 H 000
99 M 00000000000
100
00
The following results are for:
GROUP$
= Islamic
Stem and Leaf Plot of variable:
Minimum:
11.6000
Lower hinge:
19.0000
Median:
28.5500
Upper hinge:
53.5000
Maximum:
70.0000
LITERACY, N = 16
1 H 1258
2 M 05689
3
1
4
5 H 0255
6
5
7
0
The following results are for:
GROUP$
= NewWorld
Stem and Leaf Plot of variable:
Minimum:
23.0000
Lower hinge:
74.0000
Median:
85.6000
Upper hinge:
94.0000
Maximum:
99.0000
2
3
* * * Outside Values * * *
5
0
5
6
6
3
6
5
7 H 44
7
6
8
0
8 M 558
9 H 03444
9
8899
LITERACY, N = 21
I-225
Descriptive Sta tistics
The literacy rates for Europe and the Islamic nations do not even overlap. The rates
range from 83% to 100% for the Europeans and 11.6% to 70% for the Islamics. Earlier,
11 countries were identified that have rates of 31% or less. From these stratified results,
we learn that 10 of the countries are Islamic and 1 (Haiti) is from the New World. The
Haitian rate (23%) is identified as an outlier with respect to the values of the other New
World countries.
Computation
All computations are in double precision.
Algorithms
SYSTAT uses a one-pass provisional algorithm (Spicer, 1972). Wilkinson and Dallal
(1977) summarize the performance of this algorithm versus those used in several
statistical packages.
References
Spicer, C. C. (1972). Calculation of power sums of deviations about the mean. Applied
Statistics, 21, 226227.
Wilkinson, L. and Dallal, G. E. (1977). Accuracy of sample moments calculations among
widely used statistical programs. The American Statistician, 31, 128131.
Chapter
10
Design of Experiments
Herb Stenson
Design of Experiments (DOE) generates design matrices for a variety of ANOVA and
mixture models. You can use Design of Experiments as both an online library and a
search engine for experimental designs, saving any design to a SYSTAT file. You can
run the associated experiment, add the values of a dependent variable to the same file,
and analyze the experimental data by using General Linear Model (or another
SYSTAT statistical procedure).
SYSTAT offers three methods for generating experimental designs: Classic DOE,
the DOE Wizard, and the DESIGN command.
n Classic DOE provides a standard dialog interface for generating the most popular
questions defining the structure of the design. The wizard offers more designs
than Classic DOE, including response surface and optimal designs. Optimization
methods include the Fedorov, k-exchange, and coordinate exchange algorithms
I-227
I-228
Chapter 10
Statistical Background
The Research Problem
As an investigator interested in solving problems, you are faced with the task of
identifying good solutions. You do this by using what you already know about the
problem area to make a judgment about the solution(s). If you possess in-depth process
knowledge, then there is little work to be done; you simply apply that knowledge to the
problem at hand and derive a solution.
More common is the situation in which you have limited knowledge about the
factors involved and their interrelationships, so that any conjecture would be quite
uncertain and far from optimal. In these situations, the first step would be to enhance
your knowledge. This is usually done by empirical investigationthat is, by
systematically observing the factors and how they affect the outcome of interest. The
results of these observations become the data in your study.
Process problems usually have factors, or variables, that may affect the outcome,
and responses that measure the outcome of interest. The basic problem-solving
approach is to develop a model that helps you understand the specific relationships
between factors and responses. Such a model allows you to predict which factor values
will lead to a desired response, or outcome. These empirical data provide the statistical
basis used to generate models of your process.
I-229
Design of Ex periments
Types of Investigation
You can think of any empirical investigation as falling into one of two broad classes:
experiment or observational study. The two classes have different properties and are
used to approach different types of problems.
Experiments
Experiments are studies in which the factors are under the direct control of the
experimenter. That is, the experimenter assigns certain values of the factors to each
run, or observation. The response(s) are recorded for each chosen combination of
factor levels.
Because the factors are being manipulated by the experimenter, the experimenter
can make inferences about causality. If assigning a certain temperature leads to a
decrease in the output of a chemical process, you can be fairly certain that temperature
really did cause the decrease because you assigned the temperature value while holding
other factors constant.
Unfortunately, experiments do have a drawback in that there are some situations in
which it is either impossible or impractical, or even unethical, to exercise control over
the factors of interest. In those situations, an observational study must be used.
Observational Studies
Observational studies use only minimal, if any, intervention by the observer on the
process. The observer merely observes and records changes in the response as the
factors undergo their natural variation. No attempt is made to control the factors.
Because the factors are not under the control of the experimenter, observational
studies are very limited in their ability to explain causal relationships. For example, if
you observe that shoe size and scholastic achievement show a strong relationship
among school children, can you infer that larger feet cause achievement? Of course
not. The truth of the matter is that both variables are most likely caused by a third
(unmeasured) variableage. Older students have larger feet, and they have been in
school longer. If you could have some control over shoe size, you could make sure that
shoe sizes were evenly distributed across students of different ages, and you would be
in a much better position to make inferences about the causal relationship between
shoe size and achievement.
I-230
Chapter 10
But of course its silly to speak of controlling shoe size, since you cant change the
size of peoples feet. This illustrates the strength of observational studiesthey can be
employed where true experimental studies are impossible, for either ethical or practical
reasons.
Because the focus of this chapter is the design and analysis of experimental studies,
further references to observational studies will be minimal.
Completeness
By using a well-designed experiment, you will be able to discover the most important
relationships in your process. Lack of planning can lead to incomplete designs that
leave certain questions unanswered, confounding that causes confusion of two or more
effects so that they become statistically indistinguishable, and poor precision of
estimates.
Efficiency
Carefully planned experiments allow you to get the information you need at a fraction
of the cost of a poorly planned design. Content knowledge can be applied to select
specific effects of interest, and your experimental runs can be targeted to answer just
those effects. Runs are not wasted on testing effects you already understand well.
I-231
Design of Ex periments
Insight
A well-designed experiment allows you to see patterns in the data that would be
difficult to spot in a simple table of hastily collected values. The mathematical model
you build based on your observations will be more reliable, more accurate, and more
informative if you use well-chosen run points from an appropriate experimental
design.
Research
Question
Experimental
Design
Data
Collection
Analysis
Interpretation
New
Knowledge
process.
I-232
Chapter 10
n Response surface designs. These designs are useful when you want to find the
combination of factor values that gives the highest (or lowest) response.
n Mixture designs. These designs are useful when you want to find the ideal
proportions of ingredients for a mixture process. Mixture designs take into account
the fact that all the component proportions must sum to 1.0.
n Optimal designs. These designs are useful when you have enough information
available to give a very detailed specification of the model you want to test.
Because optimal designs are very flexible, you can use them in situations where no
standard design is available. Optimal designs are also useful when you want to
have explicit control over the type of efficiency maximized by the design.
Factorial Designs
In investigating the factors that affect a certain process, the basic building blocks of
your investigation are observations of the system under different conditions. You vary
the factors under your control and measure what happens to the outcome of the process.
The naive inquirer might use a haphazard, trial-and-error approach to testing the
factors. Of course, this approach can take a long time and many observations, or runs,
to give reasonable results (if it does at all), and, in fact, it may fail to reveal important
effects because of the lack of an investigative strategy.
Someone more familiar with scientific methodology might make systematic
comparisons of various levels of each factor, holding the others constant. However,
while this approach is more reliable than the trial-and-error approach, it can still cause
you to overlook important effects. Consider the following hypothetical response plot.
The contours indicate points of equal response.
220
20
200
180
16
160
140
12
4
120
100
80
60
40
20
10
20
30
40
50
60
X1
70
80
90 100
I-233
Design of Ex periments
If you tried the one-at-a-time approach, your ability to accurately measure the effects
of the variables would depend on the initial settings you chose. For example, if you
chose the point indicated by the horizontal line as your fixed starting value for x2 as
you varied x1, you would conclude that the maximum response occurs when x 1 = 47 .
Then, you would fix x1 at 47 and vary x2, concluding that the maximum response
occurs when x 2 = 98 . The two following figures illustrate this problem. However, it
is clear from the previous contours that the maximum effect occurs where x 1 = 100
and x 2 = 220 , or perhaps even somewhere outside the range that youve measured.
8
0
0
10
20
30 40 50 60 70 80
X1 (X2 held constant at 98.0)
90
100
50
100
150
200
X2 (X1 held constant at 47.0)
250
This illustrates the importance of considering the factors simultaneously. The only way
to find the true effects of the factors on the response variable is to take measurements
at carefully planned combinations of the factor levels, as shown below. Such designs
are called factorial designs. A factorial design that could be used to explore the
hypothetical process would take measurements at high, medium, and low levels of
each factor, with all combinations of levels used in the design.
I-234
Chapter 10
220
20
200
180
16
160
140
12
4
120
100
80
60
40
20
10
20
30
40
50
60
X1
70
80
90 100
Factorial designs can be classified into two broad types: full (or complete) factorials
and fractional factorials, shown below. Full factorials (a) use observations at all
combinations of all factor levels . Full factorials give a lot of insight into the effects of
the factors, particularly interactions, or joint effects of variables. Unfortunately, they
often require a large number of runs, which means that they can be expensive.
Fractional factorials (b) use only some combinations of factor levels. This means that
they are efficient, requiring fewer runs than their full factorial counterpart. However,
to gain this efficiency, they sacrifice some (or all) of their ability to measure interaction
effects. This makes them ill-suited to exploring the details of complex processes.
I-235
Design of Ex periments
n Homogeneous fractional. These are fractional designs in which all factors have the
numbers of levels.
n Box-Hunter. This is a set of fractional designs for two-level factors that can be
specified based on the number of factors and the number of runs (as a power of 2).
n Plackett-Burman. These designs are saturated (or nearly saturated) fractional
factorial designs based on orthogonal arrays. They are very efficient for estimating
main effects but rely on the absence of two-factor interactions.
n Taguchi. These designs are orthogonal arrays allowing for a maximum number of
where you need to isolate the effects of one or more blocking (or "nuisance")
factors. In Latin square designs, all factors must have the same number of levels.
Graeco-Latin squares and hyper-Graeco-Latin squares can also be generated when
you need to isolate the effects of more than one "nuisance" variable.
y = + i + j + +
where y is the response variable and i , j , represent the treatment effects of the
factors. This model assumes that all interactions are negligible. These models are
useful for describing very simple processes and for analyzing fractional designs of low
resolution. They are also useful for analyzing screening designs, where the goal is not
necessarily to model all effects realistically but merely to identify influential factors for
further study.
The next level of model complexity, the second-order model, involves adding
two-factor interaction terms to the equation. Following is an example for a twofactor model:
I-236
Chapter 10
y ijk = + i + j + ( ) ij + ijk
This model allows you to explore joint effects of factors taken in pairs. For example,
the term ( ) ij allows you to see whether the effect of the factor on y depends on
the level of . If this term is significant, you can conclude that the effect of does
indeed depend on the level of .
In many cases, the response surface must be considered in parts because when you
consider all of the possible values for the factors involved, the surface can be quite
complex. Because of this complexity, it is often not possible to build a mathematical
I-237
Design of Ex periments
model that truly reflects the shape of the surface. Fortunately, when you look at
restricted portions of the response surface, they can usually be modeled successfully
with relatively simple equations.
I-238
Chapter 10
and careful measurements, you can usually do a reasonably good job of predicting
response throughout the response surface neighborhood of interest.
When you make such predictions, however, you must accept the fact that the model
is not perfectthere are often imperfections in your measurements, and the
mathematical model almost never fits the true response function exactly. Thus, if you
were to conduct the experiment repeatedly, you would get slightly different answers
each time. The degree to which your predictions are expected to differ across multiple
experiments is known as the variance of prediction, or V( y ). The value of V( y )
depends on the design used and on where in the factor space you are calculating a
prediction. V( y ) increases as you get further from the observed data points. Of course,
you would like the portion of the design that produces the most precise predictions to
be near the optimum that you are trying to locate. Unfortunately, you usually dont
really know where the optimal value is (or in what direction it lies) when you start.
To deal with the fact that you dont know exactly where the optimum is, you can
use designs in which the variance of prediction depends only on the distance from the
center of the design, not on the direction from the center. Such designs are called
rotatable designs. First-order (linear) orthogonal models are always rotatable. Some
central composite designs are rotatable. (In SYSTAT, the distance from the center is
automatically chosen to ensure rotatability for unblocked designs. However, for
blocked designs, the distance is chosen to ensure orthogonality of blocks, which may
lead to nonrotatable designs.) In addition, some Box-Behnken designs are rotatable,
and most are nearly rotatable (meaning that directional differences in prediction
variance exist, but they are small). In general, three-level factorial response surface
designs are not rotatable. This means that care should be used before employing such
a designthe precision of your predictions may depend on the direction in which the
optimum lies.
I-239
Design of Ex periments
design. For these designs, factors need to be measured at only three levels. BoxBehnken designs are also quite efficient in requiring relatively few runs.
y = 0 + 1 x 1 + + k x k
where k is the number of factors in the design. Similarly, the quadratic model is
expressed as
2
2
y = 0 + 1 x 1 + + k x k + 11 x 1 + 12 x 1 x 2 + + ( k 1 )k x ( k 1 ) x k + kk x k
In either case, the estimated equation defines the response surface. This surface is often
plotted, either as a 3-D surface plot or a 2-D contour plot, to help the investigator
visualize the shape of the response surface.
Mixture Designs
Suppose that you are trying to determine the best blend of ingredients or components
for your product. Initially, this appears to be a problem that can be addressed with a
straightforward response surface design. However, upon closer examination, you
discover that there is an additional consideration in this problemthe amounts of the
ingredients are inextricably linked with each other. For example, suppose that you are
trying to determine the best combination of pineapple and orange juices for a fruit
punch. Increasing the amount of orange juice means that the amount of pineapple juice
must be decreased, relative to the whole. (Of course, you could add more pineapple
juice as you add more orange juice, but this would simply increase the total amount of
punch. It would not alter the fundamental quality of the punch.) The problem is shown
in the following plot.
I-240
Chapter 10
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
By specifying that the components are ingredients in a mixture, you limit the values
that the amounts of the components can take. All of the points corresponding to onegallon blends lie on the line shown in the plot. You can describe the constraint with the
equation OJ + PJ = 1 gallon.
Now, suppose that you decide to add a third type of juice, watermelon juice, to the
blend. Of course, you still want the total amount of juice to be one gallon, but with
three factors you have a bit more flexibility in the mixtures. For example, if you want
to increase the amount of orange juice, you can decrease the amount of pineapple juice,
the amount of watermelon juice, or both. The constraint now becomes OJ + PJ + WJ
= 1 gallon. The combinations of juice amounts that satisfy this constraint lie in a
triangular plane within the unconstrained factor space.
0.8
0.6
0.4
0.4
0.8
0.6
0.8
Pinea
pple J
uice (P
J)
1.0
1.0
an
g
0.6
0.2
OJ
e(
u ic
eJ
0.4
0.0
0.0
0.2
0.2
Or
Watermelon Juic
e (WJ)
1.0
I-241
Design of Ex periments
The feasible values for a mixture comprise a (k 1)-dimensional region within the kdimensional factor space (indicated by the shaded triangle). This region is called a
simplex. The pure mixtures (made of only one component) are at the corners of the
simplex, and binary mixtures (mixtures of only two components) are along the edges.
The concept of the mixture simplex extends to higher-dimensional problems as well
the simplex for a four-component problem is a three-dimensional regular tetrahedron
and so on.
To generalize, you measure component amounts as proportions of the whole rather
than as absolute amounts. When you take this approach, it is clear that increasing the
proportion of one ingredient necessarily decreases the proportion(s) of one or more of
the others. There is a constraint that the sum of the ingredient proportions must equal
the whole. In the case of proportions, the whole would be denoted by 1.0, and the
constraint is expressed as
x 1 + x 2 + + x k = 1.0
where x1, ..., xk are the proportions of each of the k components in the mixture.
Because of this constraint, such problems require a special approach. This approach
includes using a special class of experimental designs, called mixture designs. These
designs take into account the fact that the component amounts must sum to 1.0.
I-242
Chapter 10
Thus, the number of distinct points is 1 less than 2 raised to the q power, where q is the
number of components. Centroid designs are useful for investigating mixtures where
incomplete mixtures (with at least one component absent) are of primary importance.
Axial. In an axial design with m components, each run consists of at least (m - 1) equal
proportions of the components. These designs include: mixtures composed of one
component; mixtures composed of (m - 1) components in equal proportions; and
mixtures with equal proportions of all components. Thus, if we asked for an axial
design with four factors (components), the mixtures in the model would consist of all
permutations of the set (1,0,0,0), all permutations of the set (5/8,1/8,1/8,1/8), all
permutations of the set (0,1/3,1/3,1/3), and the set (1/4,1/4,1/4,1/4).
Screen. Screening designs are reduced axial designs, omitting the mixtures that contain
all but one components. Thus, if we asked for a screening design with four factors
(components), the mixtures in the model would consist of all permutations of the set
(1,0,0,0), all permutations of the set (5/8,1/8,1/8,1/8), and the set (1/4,1/4,1/4,1/4).
Screening designs enable you to single out unimportant components from an array of
many potential components.
I-243
Design of Ex periments
OJ
0.0
0.1
1.0
0.9
0.2
0.8
0.3
0.7
0.4
0.6
0.5
0.5
0.6
0.4
0.7
0.3
0.8
0.2
0.9
0.1
1.0
WJ
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
PJ
y = 1 x 1 + 2 x 2 + + k x k
and the quadratic form is
y = 1 x 1 + 2 x 2 + + k x k + 12 x 1 x 2 + + ( k 1 )k x k 1 x k
for mixtures with k components. Notice that the quadratic form does not include
squared terms. Such terms would be redundant, since the square of a component can
be reexpressed as a function of the linear and cross-product terms. For example,
x1 = x1 ( 1 x2 xk ) = x1 x1 x2 x1 xk
2
The model is estimated using standard general linear modeling techniques. The
parameters can be tested (with a sufficient number of observations), and they can be
used to define the response function. The plot of this function can give visual insights
I-244
Chapter 10
into the process under investigation and allow you to select the optimal combination
of components for your mixture.
Optimal Designs
In going through the process of designing experiments, you might ask yourself, What
is the advantage of a designed experiment over a more haphazard approach to
collecting data? The answer is that a carefully designed experiment will allow you to
estimate the specific model you have in mind for the process, and it will allow you to
do so efficiently. Efficiently in this context means that the model can be estimated with
high (or at least adequate) precision, with a manageable number of runs.
Through the years, statisticians have identified useful classes of research problems
and developed efficient experimental designs for each. Such classes of problems
include identifying important effects within a set of two-level factors (Box-Hunter
designs), optimizing a process using a quadratic surface (central composite or BoxBehnken designs), or optimizing a mixture process (mixture designs).
One of the standard designs may be appropriate for your research needs.
Sometimes, however, your research problem doesnt quite fit into the mold of these
standard designs. Perhaps you have specific ideas about which terms you want to
include in your model, or perhaps you cant afford the number of runs called for in the
standard design. The standard designs efficiency is based on assumptions about the
model to be estimated, the number of runs to be collected, and so on. When you try to
run experiments that violate these assumptions, you lose some of the efficiency of the
design.
You may now be asking yourself, Well, then, how do I find a design for my
idiosyncratic experiment? Isnt there a way that I can specify exactly what I want and
get an efficient design to test it? The answer is yesthis is where the techniques of
optimal experimental design (often abbreviated to optimal design) come in. Optimal
design methods allow you to specify your model exactly (including number of runs)
and to choose a criterion for measuring efficiency. The design problem is then solved
by mathematical programming to find a design that maximizes the efficiency of the
design, given your specifications. The use of the word optimal to describe designs
generated in this manner means that we are optimizing the design for maximum
efficiency relative to the desired efficiency criterion.
I-245
Design of Ex periments
Optimization Methods
First, you need to choose an optimization method. Different mathematical methods
(algorithms) are available for finding the design that optimizes the efficiency criterion.
Some of these methods require a candidate set of design points from which to choose
the points for the optimal design. Other methods do not require such a candidate set.
Three optimization methods are available:
Fedorov method. This method requires a predefined candidate set. It starts with an initial
design, and at each step it identifies a pair of pointsone from the design and one from
the candidate setto be exchanged. That is, the candidate point replaces the selected
design point to form a new design. The pair exchanged is the pair that shows the
greatest reduction in the optimality criterion when exchanged. This process repeats
until the algorithm converges.
K-exchange method. This method starts with a set of candidate points and an initial
design and exchanges the worst k points at each iteration in order to minimize the
objective function. Candidate points must come from a previously generated design.
Coordinate exchange method. This method does not require a candidate set. It starts
with an initial design based on a random starting point. At each iteration, k design
points are identified for exchange, and the coordinates of these points are adjusted one
by one to minimize the objective function. The fact that this method does not require a
candidate set makes it useful for problems with large factor spaces. Another advantage
of this method is that one can use either continuous or categorical variables, or a
mixture of both, in the model.
For the designs that require a candidate set, that set must be defined before you
generate your optimal design. The set of points must be in a file that was generated and
saved by the Design Wizard. You may eliminate undesirable rows before using the file
in the Fedorov or k-exchange method to generate an optimal design based on the
candidate design. The same requirements hold for any so-called starting design in a file
that is submitted by the user.
It is important to remember that these methods are iterative, based on starting
designs with a random component to them. Therefore, they will not always converge
on a design that is absolutely optimalthey may fall into a local minimum or saddle
point, or they may simply fail to converge within the allowed number of iterations.
That is why each method allows you to generate a design multiple times based on
different starting designs.
I-246
Chapter 10
In most circumstances, these methods will give similar results. Using G-optimality can
take more time to compute, since each iteration involves both maximization and
minimization. In many situations, D-optimality will be a good choice because it is fast
and invariant to linear transformations. A-optimality is especially sensitive to the scale
of continuous factors, such that a design with factors having very different scales may
lead to problems generating a design.
I-247
Design of Ex periments
There is one important difference, however. For an optimal experiment, you specify
the model for the experiment before you generate the design. This is necessary to
ensure that the design is optimized for your particular model, rather than an assumed
model (such as a complete factorial or a full quadratic model). This means that for
optimal designs, the form of the equation to be estimated is an integral part of the
experimental design.
Lets consider a simple example: suppose that you have three two-level factors (call
them A, B, and C), and you want to perform tests of the following effects: A, B, C, AB,
and AC. You could use the usual 23 factorial design, which would give you the
following runs:
A
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
Now, suppose that you want to estimate the model in only six runs. There is no standard
design for this, so you must use an optimal design. Using the coordinate exchange
method with the D-optimality criterion yields the following design:
A
1
0
1
1
0
0
1
0
1
0
0
1
0
0
1
0
1
0
I-248
Chapter 10
However, if we change the form of the model slightly, so that we are asking for the A,
B, C, AB, and BC effects, we get a slightly different design:
A
1
0
0
0
0
1
0
1
1
0
0
1
1
1
0
0
1
0
In general, a design generated based on one model will not be a good design for a
different model. The implication of this is that the model used to generate the design
places limits on the model that you estimate in your data analysis. In most cases, the
two models will be the same, although you may sometimes want to omit terms from
the analysis that were in the original model used to generate the design.
Choosing a Design
Deciding which design to use is an important part of the experimental design process.
The answer will depend on various aspects of the research problem at hand, such as:
n What type of knowledge do you want to gain from the research?
n How much background knowledge can you bring to bear on the question?
n How many factors are involved in the process of interest?
n How many different ways do you want to measure the outcome?
n What are the constraints, if any, on your factors?
n What is your criterion for the best design?
n Will you have to use the results of the experiment to convince others of your
I-249
Design of Ex periments
Setting Priorities
Consider what is really important in your study. Do you need the highest precision
possible, regardless of what it takes? Or are you more concerned about controlling
costs, even if it means settling for an approximate model? Would the cost of
overlooking an effect be greater than the cost of including the effect in your model?
Giving careful thought to questions like these will help you choose a design that
satisfies your criteria and helps you accomplish your research goals.
I-250
Chapter 10
The Design of Experiments Wizard offers nine different design types: General
Factorial, Box-Hunter, Latin Square, Taguchi, Plackett-Burman, Box-Behnken,
Central Composite, Optimal, and Mixture Model. After selecting a design type, a series
of dialogs prompts for design specifications before generating a final design matrix.
These specifications typically include the number of factors involved, as well as the
number of levels for each factor.
Replications. For any design created by the Design Wizard, replications can be saved
to a file. By default, SYSTAT saves the design without replications. If you request n
copies of a design, the complete design will be repeated n times in the saved file (global
replication). If local replications are desired, simply sort the saved file on the variable
named RUN to group replications by run number. Replications do not appear on the
output screen.
Note: It is not necessary to have a data file open to use Design of Experiments.
I-251
Design of Ex periments
Classic DOE offers a subset of the designs available using the Design Wizard,
including factorial, Box-Hunter, Latin Square, Taguchi, Plackett, Box-Behnken, and
mixture designs. In contrast to the wizard, classic DOE uses a single dialog to define
all design settings. The following options are available:
Levels. For factorial, Latin, and mixture designs, this is the number of levels for the
factors. Factorial designs are limited to either two or three levels per factor.
Factors. For factorial, BoxHunter, BoxBehnken, and lattice mixture designs, this is the
number of factors, or independent variables.
Runs. For Plackett and BoxHunter designs, this is the number of runs.
Replications. For all designs except BoxBehnken and mixture, this is the number of
replications.
Mixture type. For mixture designs, you can specify a mixture type from the drop-down
list. Select either Centroid, Lattice, Axial, or Screen.
Taguchi type. For Taguchi designs, you can select a Taguchi type from the drop-down
list.
I-252
Chapter 10
Using Commands
With commands:
DESIGN
SAVE filename
FACTORIAL / FACTORS=n REPS=n LETTERS RAND,
LEVELS = 2 or 3
BOXHUNTER / FACTORS=n RUNS=n REPS=n LETTERS RAND
LATIN / LEVELS=n SQUARE REPS=n LETTERS RAND
TAGUCHI / TYPE=design REPS=n LETTERS RAND
PLACKETT / RUNS=n REPS=n LETTERS RAND
BOXBEHNKEN / FACTORS=n BLOCK LETTERS RAND
MIXTURE / TYPE=LATTICE or CENTROID or AXIAL
or SCREEN,
FACTORS=n LEVELS=n RAND LETTERS
Note: Some designs generated by the Design Wizard cannot be created using
commands.
Usage Considerations
Types of data. No data file is needed to use Design of Experiments.
Print options. For Box-Hunter designs, using PRINT=LONG in Classic DOE yields a
listing of the generators (confounded effects) for the design. For Taguchi designs, a
table defining the interaction is available.
Quick Graphs. No Quick Graphs are produced.
Saving files. The design can be saved to a file.
I-253
Design of Ex periments
Examples
Example 1
Full Factorial Designs
The DOE Wizard input for a (2 x 2 x 2) design is:
Wizard Prompt
Response
Design Type
Choose a type of design:
Divide the design into incomplete
blocks?
Enter the number of factors desired:
Is the number of levels to be the same
for all factors?
Enter number of levels:
Display the factors for this design?
Save the design to a file?
General Factorial
Full Factorial Design
No
3
Yes
2
Yes
No
3 Factors,
8 Runs
I-254
Chapter 10
LEVELS=2
Example 2
Fractional Factorial Design
The DOE Wizard input for a (2 x 2 x 2 x 2) fractional factorial design in which the twoway interactions A*B and A*C must be estimable is:
Wizard Prompt
Response
Design Type
Choose a type of design:
Divide the design into incomplete blocks?
Enter the number of factors desired:
Is the number of levels to be the same for all factors?
Enter number of levels:
General Factorial
Fractional Factorial Design
No
4
Yes
2
Automatically find the smallest
design consistent with my criteria
Require that specific effects be
estimable
Please choose:
Choose a Search Criterion
I-255
Design of Ex periments
4 Factors,
8 Runs
Factor
Run
SYSTAT assumes that the main effects of any design should always be estimated.
Notice, however, that the defining relation avoids confounding the interaction of A
with any of the other factors, as requested by specifying the effects to be estimated
(A*B, A*C) and effects that should not be confounded even though they are not to be
estimated (A*D).
I-256
Chapter 10
Example 3
Box-Hunter Fractional Factorial Design
To generate a (2 x 2 x 2) Box-Hunter fractional factorial, the input is:
Wizard Prompt
Response
Design Type
Enter the number of factors desired:
Enter the total number of cells for the
entire design:
Display the factors for this design?
Save the design to a file?
Box-Hunter
3
4
Yes
No
4 Runs,
3 Factors
Factor
RUN
-1 -1
-1
1 -1
1 -1 -1
I-257
Design of Ex periments
Aliases
For 7 two-level factors, the number of cells (runs) for a complete factorial is 27=128.
The following example shows the smallest fractional factorial for estimating main
effects. The design codes for the first three factors generate the last four. The input is:
Wizard Prompt
Response
Design Type
Enter the number of factors desired:
Enter the total number of cells for the
entire design:
Display the factors for this design?
Save the design to a file?
Box-Hunter
7
8
Yes
No
I-258
Chapter 10
Identity =
A * B * D =
A * C * E =
B * C * F =
A * B * C * G
Box-Hunter Design:
8 Runs,
7 Factors
Factor
RUN
-1 -1 -1
1 -1
-1 -1
1 -1 -1
-1
1 -1 -1
-1
1 -1
1 -1 -1
1
1
1 -1
1 -1 -1 -1 -1
1 -1
1 -1
1 -1 -1 -1
1 -1
1
1 -1 -1
1
The main effect for factor D is confounded with the interaction between factors A and
B; the main effect for factor E is confounded with the interaction between factors A and
C; and so on.
Example 4
Latin Squares
To generate a Latin square when each factor has four levels, enter the following DOE
Wizard responses:
Wizard Prompt
Response
Design Type
The types available are:
Number of levels:
Randomize the design?
Display the square?
Display the factors for this design?
Save the design to a file?
Latin Square
Ordinary Latin Square
4
No
Yes
Yes
No
I-259
Design of Ex periments
4 Levels.
Factor
RUN
10
11
12
13
14
15
16
SQUARE
LETTERS
Omitting SQUARE prevents the Latin square from appearing in the output.
I-260
Chapter 10
Permutations
To randomly assign the factors to the cells, the input is:
Wizard Prompt
Response
Design Type
The types available are:
Number of levels:
Randomize the design?
Display the square?
Display the factors for this design?
Save the design to a file?
Latin Square
Ordinary Latin Square
4
Yes
Yes
No
No
Using commands:
DESIGN
LATIN / LEVELS=4
SQUARE
LETTERS
RAND
Example 5
Taguchi Design
To obtain a Taguchi L12 design with 11 factors, the DOE Wizard input is:
Wizard Prompt
Response
Design Type
Taguchi Design Type:
Display the factors for this design?
Save the design to a file?
Display confounding matrix?
Taguchi
L12
Yes
No
No
I-261
Design of Ex periments
10
11
12
Response
Design Type
Taguchi Design Type:
Display the factors for this design?
Save the design to a file?
Display confounding matrix?
Taguchi
L16
Yes
No
Yes
I-262
Chapter 10
10
11
12
13
14
15
16
9 10 11 12 13 14 15
8 11 10 13 12 15 14
9 10 11 12 13 14 15
7
1
10
11
9 14 15 12 13
3 10
11
10
8 15 14 13 12
1 11
12
13 14 15
9 10 11
7 12
13
12 15 14
8 11 10
1 13
14
15 12 13 10 11
3 14
15
14 13 12 11 10
1 15
I-263
Design of Ex periments
The matrix of confoundings identifies the factor pattern associated with the interaction
between the row and column factors. For example, the factor pattern for the interaction
between factors 6 and 8 is identical to the pattern for factor 14 (N).
Example 6
Plackett-Burman Design
To generate a Plackett-Burman design consisting of 11 two-level factors, the DOE
Wizard input is:
Wizard Prompt
Response
Design Type
Number of levels in design:
Runs per replication
Display the factors for this design?
Save the design to a file?
Plackett-Burman
2
12
Yes
No
12 Runs, 11 Factors
Factor
RUN
10
11
12
I-264
Chapter 10
Example 7
Box-Behnken Design
Each factor in this example has three levels. The DOE wizard input is:
Wizard Prompt
Response
Design Type
Number of Factors
Display the factors for this design?
Save the design to a file?
Box-Behnken
3
Yes
No
-1 -1
1 -1
-1
-1
0 -1
0 -1
-1
0 -1 -1
10
11
0 -1
1 -1
1
12
13
14
15
3 Factors,
15 Runs
I-265
Design of Ex periments
Example 8
Mixture Design
We illustrate a lattice mixture design in which each of the three factors has five levels;
that is, each component of the mixture is 0%, 25%, 50%, 75%, or 100% of the mixture
for a given run, subject to the restriction that the sum of the percentages is 100. To
generate the design for this situation using the DOE Wizard, enter the following
responses at the corresponding prompt:
Wizard Prompt
Response
Design Type
Are there to be constraints for any component(s)?
The possible kinds of unconstrained design are:
Enter the number of mixture components:
Enter the number of levels for each component:
Display the factors for this design?
Save the design to a file?
Mixture Model
No
Lattice
3
5
Yes
No
3 Factors,
Component
RUN
1.000
.000
.000
.000 1.000
.000
.000
.000 1.000
.750
.250
.000
.750
.000
.250
.000
.750
.250
.500
.500
.000
.500
.000
.500
.000
.500
.500
10
.250
.750
.000
11
.250
.000
.750
15 Runs,
5 Levels
I-266
Chapter 10
12
.000
.250
.750
13
.500
.250
.250
14
.250
.500
.250
15
.250
.250
.500
FACTORS=3,
After collecting your data, you may want to display it in a triangular scatterplot.
Example 9
Mixture Design with Constraints
This example is adapted from an experiment reported in Cornell (1990, p. 265). The
problem concerns the mixture of three plasticizers in the production of vinyl for car
seats. We know that the combination of plasticizers must make up 79.5% of the
mixture. There are further constraints on each of the plasticizers:
32.5% <= P1 <= 67.5%
0% <= P2 <= 20.0%
12.0% <= P3 <= 21.8%
Because we are interested in only the plasticizers, we can model them separately from
the other components in the overall process. Taking this approach, we can
reparameterize the components by dividing by 79.5%, giving
0.409 <= A <= 0.849
0 <= B <= 0.252
0.151 <= C <= 0.274
I-267
Design of Ex periments
We want to be sure that the design points span the feasible region adequately. To
generate the design using the DOE Wizard, the responses to the prompts follow:
Wizard Prompt
Response
Design Type
Are there to be constraints for any
component(s)?
The possible kind of constrained design are:
Enter the number of mixture components:
Enter the maximum dimension to be used to
compute centroids:
How many such constraints do you wish to have?
Constraint 1: Enter the coefficient for factor 1:
Constraint 1: Enter the coefficient for factor 2:
Constraint 1: Enter the coefficient for factor 3:
Constraint 1: Enter an additive constant:
Constraint 2: Enter the coefficient for factor 1:
Constraint 2: Enter the coefficient for factor 2:
Constraint 2: Enter the coefficient for factor 3:
Constraint 2: Enter an additive constant:
Constraint 3: Enter the coefficient for factor 1:
Constraint 3: Enter the coefficient for factor 2:
Constraint 3: Enter the coefficient for factor 3:
Constraint 3: Enter an additive constant:
Constraint 4: Enter the coefficient for factor 1:
Constraint 4: Enter the coefficient for factor 2:
Constraint 4: Enter the coefficient for factor 3:
Constraint 4: Enter an additive constant:
Constraint 5: Enter the coefficient for factor 1:
Constraint 5: Enter the coefficient for factor 2:
Constraint 5: Enter the coefficient for factor 3:
Constraint 5: Enter an additive constant:
Specify the tolerance for checking constraints
and duplication of points:
Display the factors for this design?
Save the design to a file?
Mixture Model
Yes
Extreme vertices plus centroids
3
1
5
1
0
0
-.409
-1
0
0
.849
0
-1
0
.252
0
0
1
-.151
0
0
-1
.274
.00001
Yes
No
I-268
Chapter 10
3 Factors,
9 Runs,
4 Vertices
Component
RUN
.849
.000
.151
.597
.252
.151
.726
.000
.274
.474
.252
.274
.787
.000
.213
.535
.252
.213
.723
.126
.151
.600
.126
.274
.661
.126
.213
The design contains nine runs: four points at the extreme vertices of the feasible region,
four points at the edge centroids, and one point at the overall centroid. The following
plot displays the constrained region for the mixture as a blue parallelogram with the
actual design points represented as red filled circles.
0.0 1.0
0.2
0.8
0.6
B
0.4
0.6
0.4
0.8
1.0
0.0
0.2
0.2
0.4
0.6
A
0.8
0.0
1.0
I-269
Design of Ex periments
Example 10
Central Composite Response Surface Design
In an industrial experiment reported by Aia et al. (1961), the authors investigated the
response surface of a chemical process for producing dihydrated calcium hydrogen
orthophosphate (CaHPO4 2H2O). The factors of interest are the ratio of NH3 to CaCl2
in the calcium chloride solution, the addition time of the NH3-CaCl2 mixture, and the
beginning pH of the NH4H2PO4 solution used. We will now see how this experiment
would be designed using the DOE Wizard.
For efficiency and rotatability, we use a central composite design with three factors.
The central composite design consists of a 2k factorial (or fraction thereof), a set of 2k
axial (or star) points on the axes of the design space, and some number of center
points. The distance between the axial points and the center of the design determines
important properties of the design. In SYSTAT, the distance used ensures rotatability
for unblocked designs. For blocked designs, the distance ensures orthogonality of
blocks.
The choice of number of center points hinges on desired properties of the design.
Orthogonal designs (designs in which the factors are uncorrelated) minimize the
average variance of prediction of the response surface equation. However, in some
cases, you may decide that it is more important to have the variance of predictions be
nearly constant throughout most of the experimental region, even if the overall
variance of predictions is increased somewhat. In such situations, we can use designs
in which the variance of predictions is the same at the center of the design as it is at any
point one unit distant from the center. This property of equal variance between the
center of the design and points one unit from the center is called uniform precision. In
this example, we sacrifice orthogonality in favor of uniform precision. Therefore, we
use six center points instead of the nine points required to make the design nearly
orthogonal. (A table of orthogonal and uniform precision designs with appropriate
numbers of center points can be found in Montgomery, 1991, p. 546.)
The input to generate the central composite design follows:
Wizard Prompt
Response
Design Type
Enter the number of factors desired:
Are the cube and star portions of the design to be
separate blocks?
Enter number of center points desired:
Display the factors for this design?
Save the design to a file?
Central Composite
3
No
6
Yes
No
I-270
Chapter 10
-1.000
-1.000
-1.000
-1.000
-1.000
1.000
-1.000
1.000
-1.000
-1.000
1.000
1.000
1.000
-1.000
-1.000
1.000
-1.000
1.000
1.000
1.000
-1.000
1.000
1.000
1.000
-1.682
.000
.000
10
1.682
.000
.000
11
.000
-1.682
.000
12
.000
1.682
.000
13
.000
.000
-1.682
14
.000
.000
1.682
15
.000
.000
.000
16
.000
.000
.000
17
.000
.000
.000
18
.000
.000
.000
19
.000
.000
.000
20
.000
.000
.000
In the central composite design, each factor is measured at five different levels. The
runs with no zeros for the factors are the factorial (cube) points, the runs with only
one nonzero factor are the axial (star) points, and the runs with all zeros are the center
points.
After collecting data according to this design, fit a response surface to analyze the
results.
Example 11
Optimal Designs: Coordinate Exchange
Consider a situation in which you want to compute a response surface but your
resources are very limited. Assume that you have three continuous factors but can
afford only 12 runs. This number of runs is not enough for any of the standard response
I-271
Design of Ex periments
surface models. However, you can generate a design with 12 runs that will allow you
to estimate the effects of interest using an optimal design.
To generate the design using the DOE Wizard, the responses to the prompts follow:
Wizard Prompt
Response
Design Type
Choose the method to use:
Choose the type of optimality desired:
Specify the number of points to replace in a single iteration:
Specify the maximum number of iterations within a trial:
Specify the relative convergence tolerance:
Specify the number of trials to be run:
Random number seed:
The starting design is to be:
Enter the number of factors desired:
How many points (runs) are desired?
The variables in the design are:
Optimal
Coordinate Exchange
D-optimality
1
100
.00001
3
131
Generated by the program.
3
12
All continuous
lower limit = -1
upper limit = 1
lower limit = -1
upper limit = 1
lower limit = -1
upper limit = 1
Yes
A*A
B*B
C*C
A*B
A*C
B*C
A*B*C
Yes
No
Yes
No
Yes
No
I-272
Chapter 10
12 Runs,
3 Factors
k =
Factor
RUN
-1.000
1.000
-0.046
-0.038
-0.000
-1.000
-1.000
1.000
-1.000
1.000
1.000
-1.000
1.000
1.000
1.000
-1.000
-1.000
-1.000
1.000
-1.000
-1.000
-1.000
1.000
1.000
1.000
-0.046
0.001
10
-1.000
-1.000
1.000
11
0.081
-1.000
-0.036
12
1.000
-1.000
1.000
The points shown here were generated from a particular run of the algorithm. Since the
initial design depends on a randomly chosen starting point, your design may vary
slightly from the design shown here. Your design should share several characteristics
with this one, however. First, notice that most values appear to be very close to one of
three values: 1, 0, or +1. For the purposes of conceptual discussion, we can act as if
the values were rounded to the nearest integer. We can see that the design includes the
eight corners of the design space (the runs where all values are either 1 or +1). The
design also includes three points that are face centers (runs where two values are near
0), and one edge point (where only one value is near 0).
This design will allow you to estimate all first- and second-order effects in your
model. Of course, you will not have as much precision as you would if you had used a
Box-Behnken or central composite design, because you dont have as much
information to work with. You also lose some of the other advantages of the standard
designs, such as rotatability. However, because the design is optimized with respect to
generalized variance of parameter estimates, you will be getting as much information
as you can out of your 12 runs.
I-273
Design of Ex periments
References
Aia, M. A, Goldsmith, R. L., and Mooney, R. W. (1961). Precipitating Stoichiometric
CaHPO42H2O. Industrial and Engineering Chemistry, 53, 5557.
Box, G. E. P., and Behnken, D. W. (1960). Some new three level designs for the study of
quantitative variables. Technometrics, 2, 455476.
Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for Experimenters. New
York: Wiley.
Cochran, W. G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John
Wiley & Sons, Inc.
Cornell, J. A. (1990). Experiments with Mixtures. New York: Wiley.
Fedorov, V. V. (1972). Theory of Optimal Experiments. New York: Academic Press.
Galil, Z., and Kiefer, J. (1980). Time- and space-saving computer methods, related to
Mitchells DETMAX, for finding D-optimum designs. Technometrics, 21, 301313.
John, P. W. M. (1971). Statistical Design and Analysis of Experiments. New York:
Macmillan.
_____. (1990). Statistical Methods in Engineering and Quality Assurance. New York:
Wiley.
Johnson, M. E., and Nachtsheim, C. J. (1983). Some guidelines for constructing exact Doptimal designs on convex design spaces. Technometrics, 25, 271277.
Meyer, R. K., and Nachtsheim, C. J. (1995). The coordinate-exchange algorithm for
constructing exact optimal designs. Technometrics, 37, 6069.
Montgomery, D. C. (1991). Design and Analysis of Experiments. New York: Wiley.
Plackett, R. L., and Burman, J. P. (1946). The design of optimum multifactorial
experiments. Biometrika, 33, 305325.
Schneider, A. M., and Stockett, A. L. (1963). An experiment to select optimum operating
conditions on the basis of arbitrary preference ratings. Chemical Engineering Progress
Symposium Series, No. 42, Vol. 59.
Taguchi, G. (1986). Introduction to Quality Engineering. Tokyo: Asian Productivity
Organization.
Taguchi, G. (1987). System of experimental design (2 volumes). New York:
UNIPUB/Kraus International Publications.
Chapter
11
Discriminant Analysis
Laszlo Engelman
I-275
I-276
Chapter 11
Statistical Background
When we have categorical variables in a model, it is often because we are trying to
classify cases; that is, what group does someone or something belong to? For example,
we might want to know whether someone with a grade point average (GPA) of 3.5 and
an Advanced Psychology Test score of 600 is more like the group of graduate students
successfully completing a Ph.D. or more like the group that fails. Or, we might want
to know whether an object with a plastic handle and no concave surfaces is more like
a wrench or a screwdriver.
Once we attempt to classify, our attention turns from parameters (coefficients) in a
model to the consequences of classification. We now want to know what proportion of
subjects will be classified correctly and what proportion incorrectly. Discriminant
analysis is one method for answering these questions.
I-277
Discriminant Analysis
The shape of the hills was computed from a bivariate normal distribution using the
covariance matrix averaged within groups. Weve plotted this figure this way to show
you that this model is like pie-in-the-sky if you use the information in the data below
to compute the shape of these hills. As you can see, there is a lot of smoothing of the
data going on, and if one or two data values in the scatterplot influence unduly the
shape of the hills above, you will have an unrepresentative model when you try to use
it on new samples.
How do we classify a new case into one group or another? Look at the figure again.
The new case could belong to one or the other group. Its more likely to belong to
the closer group, however. The simple way to find how far this case is from the center
of each group would be to take a direct walk from the new case to the center of each
group in the data plot.
I-278
Chapter 11
Instead of walking in sample data space below, however, we must climb the hills of
our theoretical model above when using the normal classification model. In other
words, we will use our theoretical model to calculate distances. The covariance matrix
we used to draw the hills in the figure makes distances depend on the direction we are
heading. The distance to a group is thus proportional to the altitude (not the horizontal
distance) we must climb to get to the top of the corresponding hill.
Because these hills can be oblong in shape, it is possible to be quite far from the top
of the hill as the crow flies, yet have little altitude to cover in a climb. Conversely, it is
possible to be close to the center of the hill and have a steep climb to get to the top.
Discriminant analysis adjusts for the covariance that causes these eccentricities in hill
shape. That is why we need the covariance matrix in the first place.
So much for the geometric representation. What do the numbers look like? Lets
look at how to set up the problem with SYSTAT. The input is:
DISCRIM
USE ADMIT
PRINT LONG
MODEL PHD = GRE,GPA
ESTIMATE
1
51
2
29
4.423
590.490
4.639
643.448
GRE
1.000
GRE
1.000
I-279
Discriminant Analysis
0.8026
9.4690
df =
df =
2
2
1
77
78
prob =
0.0002
Classification functions
---------------------1
-133.910
44.818
0.116
Constant
GPA
GRE
2
-150.231
46.920
0.127
45
35
75
44
Eigen
values
--------0.246
Wilks
36
Canonical
correlations
-----------0.444
74
Cumulative proportion
of total dispersion
--------------------1.000
lambda=
Approx.F=
0.803
9.469
DF=
2,
77
p-tail=
0.0002
Pillais trace=
Approx.F=
0.197
9.469
DF=
2,
77
p-tail=
0.0002
Lawley-Hotelling trace=
Approx.F=
0.246
9.469
DF=
2,
77
p-tail=
0.0002
Theres a lot to follow on this output. The counts and means per group are shown first.
Next comes the Pooled within covariance matrix, computed by averaging the separate-
I-280
Chapter 11
group covariance matrices, weighting by group size. The Total covariance matrix
ignores the groups. It includes variation due to the group separation. These are the
same matrices found in the MANOVA output with PRINT=LONG. The Between groups
F-matrix shows the F value for testing the difference between each pair of groups on
all the variables (GPA and GRE). The Wilks lambda is for the multivariate test of
dispersion among all the groups on all the variables, just as in MANOVA. Each case
is classified by our model into the group whose classification function yields the
largest score. Each function is like a regression equation. We compute the predicted
value of each equation for a cases values on GPA and GRE and classify the case into
the group whose function yields the largest value.
Next come the separate F statistics for each variable and the Classification matrix.
The goodness of classification is comparable to that for the PROBIT model. We did a
little worse with the No Ph.D. group and a little better with the Ph.D. The Jackknifed
classification matrix is an attempt to approximate cross-validation. It will tend to be
somewhat optimistic, however, because it uses only information from the current
sample, leaving out single cases to classify the remainder. There is no substitute for
trying the model on new data.
Finally, the program prints the same information produced in a MANOVA by
SYSTATs MGLH (GLM and ANOVA). The multivariate test statistics show the
groups are significantly different on GPA and GRE taken together.
I-281
Discriminant Analysis
5.5
5.0
GPA
4.5
4.0
N
Y N Y YY
N Y NN Y
N
YY
Y
YYY
N YN
N
YN
Y
YY
Y N
N NNNYN
N
N
Y
Y
N
Y
N
N
N YY
Y
N N YN
N
Y
N
Y
Y
N N N Y NNN
Y YY YN NNY
NN YN NNN N
YN
YN
Y NNNY NNN
N N
Y Y YN
Y N YN
N
NN
Y
N
YYYY NN
NYN N
Y
N
N N N Y
N
Y
N
NN
N
NN
N
Y
N
N
Y
N
3.5
400
500
600
GRE
700
800
Either of these variables separates the groups somewhat. The diagonal line underlying
the two diagonal normal distributions represents a linear combination of these two
variables. It is computed using the canonical discriminant functions in the output.
These are the same as the canonical coefficients produced by MGLH. Before applying
these coefficients, the variables must be standardized by the within-group standard
deviations. Finally, the dashed line perpendicular to this diagonal cuts the observations
into two groups: those to the left and those to the right of the dashed line.
You can see that this new canonical variable and its perpendicular dashed line are
an orthogonal (right-angle-preserving) rotation of the original axes. The separation of
the two groups using normal distributions drawn on the rotated canonical variable is
slightly better than that for either variable alone. To classify on the linear discriminant
axis, make the mean on this new variable 0 (halfway between the two diagonal normal
curves). Then add a scale along the diagonal, running from negative to positive. If we
do this, then any observations with negative scores on this diagonal scale will be
classified into the No Ph.D. group (to the left of the dashed perpendicular bisector) and
those with positive scores into the Ph.D. (to the right). All Ys to the left of the dashed
line and Ns to the right are misclassifications. Try rotating these axes any other way
to get a better count of correctly classified cases (watch out for ties). The linear
discriminant function is the best rotation.
Using this linear discriminant function variable, we get the same classifications we
got with the Mahalanobis distance method. Before computers, this was the preferred
method for classifying because the computations are simpler.
I-282
Chapter 11
The two Z variables are the raw scores minus the overall mean divided by the withingroups standard deviations. If Fz is less than 0, classify No Ph.D.; otherwise, classify
Ph.D.
As we mentioned, the Mahalanobis method and the linear discriminant function
method are equivalent. This is somewhat evident in the figure. The intersection of the
two hills is a straight line running from the northwest to the southeast corner in the
same orientation as the dashed line. Any point to the left of this line will be closer to
the top of the left hill, and any point to the right will be closer to the top of the right hill.
Prior Probabilities
Our sample contained fewer Ph.D.s than No Ph.D.s. If we want to use our discriminant
model to classify new cases and if we believe that this difference in sample sizes
reflects proportions in the population, then we can adjust our formula to favor No
Ph.D.s. In other words, we can make the prior probabilities (assuming we know
nothing about GRE and GPA scores) favor a No Ph.D. classification. We can do this
by adding the option
PRIORS = 0.625, 0.375
to the MODEL command. Do not be tempted to use this method as a way of improving
your classification table. If the probabilities you choose do not reflect real population
differences, then new samples will on average be classified worse. It would make sense
in our case because we happen to know that more people in our department tend to drop
out than stay for the Ph.D.
You might have guessed that the default setting is for prior probabilities to be equal
(both 0.5). In the last figure, this makes the dashed line run halfway between the means
of the two groups on the discriminant axis. By changing the priors, we move this
dashed line (the normal distributions stay in the same place).
Multiple Groups
The discriminant model generalizes to more than two groups. Imagine, for example,
three hills in the first figure. All the distances and classifications are computed in the
I-283
Discriminant Analysis
same manner. The posterior probabilities for classifying cases are computed by
comparing three distances rather than two.
The multiple group (canonical) discriminant model yields more than one
discriminant axis. For three groups, we get two sets of canonical discriminant
coefficients. For four groups, we get three. If we have fewer variables than groups, then
we get only as many sets as there are variables. The group classification function
coefficients are handy for classifying new cases with the multiple group model. Simply
multiply each coefficient times each variable and add in the constant. Then assign the
case to the group whose set yields the largest value.
I-284
Chapter 11
canonical variable scores. Scores/Data and Distances/Data save scores and distances
along with the data.
to-enter values (if a variable fails the Tolerance limit, however, it is excluded). Fto-remove and F-to-enter values are reported. When Backward is selected along
with Automatic, at each step, SYSTAT removes the variable with the lowest F-to-
I-285
Discriminant Analysis
remove value that passes the Remove limit of the F statistic (or reenters the
variable with the largest F-to-enter above the Remove limit of the F statistic).
n Forward. In forward stepping, the variables are entered in the model. F-to-enter
values are reported for all candidate variables, and F-to-remove values are reported
for forced variables. When Forward is selected along with Automatic, at each step,
SYSTAT enters the variable with the highest F-to-enter that passes the Enter limit
of the F statistic (or removes the variable with the lowest F-to-remove below the
Remove limit of the F statistic).
n Automatic. SYSTAT enters or removes variables automatically. F-to-enter and
variables interactively.
STEP
Variables are added to or eliminated from the model based on one of two possible
criteria.
n Probability. Variables with probability (F-to-enter) smaller than the Enter
probability are entered into the model if Tolerance permits. The default Enter value
is 0.15. For highly correlated predictors, you may want to set Enter = 0.01.
Variables with probability (F-to-remove) larger than the Remove probability are
removed from the model. The default Remove value is 0.15.
n F-statistic. Variables with F-to-enter values larger than the Enter F value are
entered into the model if Tolerance permits. The default Enter value is 4. Variables
with F-to-remove values smaller than the Remove F value are removed from the
model. The default Remove value is 3.9.
I-286
Chapter 11
You can also specify variables to include in the model, regardless of whether they meet
the criteria for entry into the model. In the Force text box, enter the number of
variables, in the order in which they appear in the Variables list, to force into the model
(for example, Force = 2 means include the first two variables on the Variables list in
the main dialog box). Force = 0 is the default.
All selected statistics will be displayed in the output. Depending on the specified length
of your output, you may also see additional statistics. By default, the print length is set
to Short (you will see all of the statistics on the Short Statistics list). To change the
length of your output, choose Options from the Edit menu. Select Short, Medium, or
Long from the Length drop-down list. Again, all selected statistics will be displayed in
the output, regardless of the print setting.
Short Statistics. Options for Short Statistics are FMatrix (between-groups F matrix),
FStats (F-to-enter/remove statistics), Eigen (eigenvalues and canonical correlation),
CMeans (canonical scores of group means), and Sum (summary panel).
Medium Statistics. Options for Medium Statistics are those for Short Statistics plus
Means (group frequencies and means), Wilks (Wilks lambda and approximate F),
CFunc (discriminant functions), Traces (Lawley-Hotelling and Pillai and Wilks
traces), CDFunc (canonical discriminant functions), SCDFunc (standardized canonical
I-287
Discriminant Analysis
Using Commands
Select your data by typing USE filename and continue as follows:
Basic
DISCRIM
MODEL grpvar = varlist / QUADRATIC
CONTRAST [matrix]
PRINT / length element
SAVE / DATA SCORES DISTANCES
ESTIMATE / TOL=n
Stepwise (Instead of ESTIMATE, specify START)
PRIORS=n1,n2,
START / FORWARD
In addition to indicating a length for the PRINT output, you can select elements not
included in the output for the specified length. Elements for each length include:
Length
Element
SHORT
MEDIUM
LONG
CLASS JCLASS
SCDFUNC
I-288
Chapter 11
these statistics.
Usage Considerations
Types of data. DISCRIM uses rectangular data only.
Print options. Print options allow the user to select panels of output to display, including
group means, variances, covariances, and correlations.
Quick Graphs. For two canonical variables, SYSTAT produces a canonical scores plot,
in which the axes are the canonical variables and the points are the canonical variable
scores. This plot includes confidence elipses for each group. For analyses involving
more than two canonical variables, SYSTAT displays a SPLOM of the first three
canonical variables.
Saving files. You can save the Mahalanobis distances to each group centroid (with the
posterior probability of the membership in each group) or the canonical variable
scores.
BY groups. DISCRIM analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. DISCRIM uses a FREQ variable to increase the number of cases.
Case weights. You can weight each case in a discriminant analysis using a weight
variable. Use a binary weight variable coded 0 and 1 for cross-validation. Cases that
have a zero weight do not influence the estimation of the discriminant functions but are
classified into groups.
Examples
Example 1
Discriminant Analysis Using Complete Estimation
In this example, we examine measurements made on 150 iris flowers: sepal length,
sepal width, petal length, and petal width (in centimeters). The data are from Fisher
I-289
Discriminant Analysis
(1936) and are grouped by species: Setosa, Versicolor, and Virginica (coded as 1, 2,
and 3, respectively).
The goal of the discriminant analysis is to find a linear combination of the four
measures that best classifies or discriminates among the three species (groups of
flowers). Here is a SPLOM of the four measures with within-group bivariate
confidence ellipses and normal curves. The input is:
DISCRIM
USE iris
SPLOM sepallen..petalwid / HALF GROUP=species ELL
DENSITY=NORM OVERLAY
PETALLEN
SEPALWID
SEPALLEN
PETALWID
SPECIES
SEPALLEN
SEPALWID
PETALLEN
PETALWID
1
2
3
Lets see what a default analysis tells us about the separation of the groups and the
usefulness of the variables for the classification. The input is:
USE iris
LABEL species / 1=Setosa, 2=Versicolor,
DISCRIM
MODEL species = sepallen .. petalwid
PRINT / MEANS
ESTIMATE
3=Virginica
Note the shortcut notation (..) in the MODEL statement for listing consecutive variables
in the file (otherwise, simply list each variable name separated by a space).
I-290
Chapter 11
Setosa
50
Versicolor
50
Virginica
50
5.0060
3.4280
1.4620
0.2460
5.9360
2.7700
4.2600
1.3260
6.5880
2.9740
5.5520
2.0260
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------2 SEPALLEN
4.72
0.347993 |
3 SEPALWID
21.94
0.608859 |
4 PETALLEN
35.59
0.365126 |
5 PETALWID
24.90
0.649314 |
50
51
98
%correct
100
96
98
Total
50
Eigen
values
--------32.192
0.285
49
49
Canonical
correlations
-----------0.985
0.471
51
98
Cumulative proportion
of total dispersion
--------------------0.991
1.000
I-291
Discriminant Analysis
Group Frequencies
The Group frequencies panel shows the count of flowers within each group and the
means for each variable. If the group code or one or more measures are missing, the
case is not used in the analysis.
I-292
Chapter 11
4 + 1, or 2 and 144. Because you may be scanning Fs for several variables, do not
use the probabilities from the usual F tables for a test. Here we conclude that
SEPALLEN is least helpful for discriminating among the species (F = 4.72).
Classification Tables
In the Classification matrix, each case is classified into the group where the value of
its classification function is largest. For Versicolor (row name), 48 flowers are
classified correctly and 2 are misclassified (classified as Virginica)96% of the
Versicolor flowers are classified correctly. Overall, 98% of the flowers are classified
correctly (see the last row of the table). The results in the first table can be misleading
because we evaluated the classification rule using the same cases used to compute it.
They may provide an overly optimistic estimate of the rules success. The Jackknifed
classification matrix attempts to remedy the problem by using functions computed
from all of the data except the case being classified. The method of leaving out one case
at a time is called the jackknife and is one form of cross-validation.
For these data, the results are the same. If the percentage for correct classification is
considerably lower in the Jackknifed panel than in the first matrix, you may have too
many predictors in your model.
I-293
Discriminant Analysis
Example 2
Discriminant Analysis Using Automatic Forward Stepping
Our problem for this example is to derive a rule for classifying countries as European,
Islamic, or New World. We know that strong correlations exist among the candidate
predictor variables, so we are curious about just which subset will be useful. Here are
the candidate predictors:
URBAN
BIRTH_RT
DEATH_RT
B_TO_D
BABYMORT
GDP_CAP
LIFEEXPM
LIFEEXPF
EDUC
HEALTH
MIL
LITERACY
Because the distributions of the economic variables are skewed with long right tails,
we log transform GDP_CAP and take the square root of EDUC, HEALTH, and MIL.
LET
LET
LET
LET
gdp_cap = L10(gdp_cap)
educ = SQR(educ)
health = SQR(health)
mil = SQR(mil)
Alternatively, you could also use shortcut notation to request the square root
transformations:
LET (educ, health, mil) = SQR(@)
I-294
Chapter 11
We use automatic forward stepping in an effort to identify the best subset of predictors.
After stepping stops, you need to type STOP to ask SYSTAT to produce the summary
table, classification matrices, and information about canonical variables. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
PRINT / MEANS
START / FORWARD
STEP / AUTO FENTER=4 FREMOVE=3.9
STOP
Notice that the initial results appear after START / FORWARD is specified. STEP /
AUTO and STOP are selected later, as indicated in the output that follows:
Group frequencies
----------------Frequencies
Group means
----------URBAN
BIRTH_RT
DEATH_RT
BABYMORT
GDP_CAP
EDUC
HEALTH
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
Europe
19
Islamic
15
NewWorld
21
68.7895
12.5789
10.1053
7.8947
4.0431
21.5275
21.9537
15.9751
1.2658
72.3684
79.5263
97.5263
30.0667
42.7333
13.4000
102.3333
2.7640
6.4156
3.1937
7.5431
3.5472
54.4000
57.1333
36.7333
56.3810
26.9524
7.4762
42.8095
3.2139
8.9619
6.8898
6.0903
3.9509
66.6190
71.5714
79.9571
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------|
6 URBAN
23.20
1.000000
|
8 BIRTH_RT
103.50
1.000000
|
10 DEATH_RT
14.41
1.000000
|
12 BABYMORT
53.62
1.000000
|
16 GDP_CAP
59.12
1.000000
|
19 EDUC
27.12
1.000000
|
21 HEALTH
49.62
1.000000
|
23 MIL
19.30
1.000000
|
34 B_TO_D
31.54
1.000000
|
30 LIFEEXPM
37.08
1.000000
|
31 LIFEEXPF
50.30
1.000000
|
32 LITERACY
63.64
1.000000
I-295
Discriminant Analysis
Step
--
****************
Step
--
****************
Step
--
****************
I-296
Chapter 11
21
15
19
89
NewWorld
0
2
16
%correct
100
87
76
18
87
Total
21
Eigen
values
--------5.247
1.146
16
Canonical
correlations
-----------0.916
0.731
Cumulative proportion
of total dispersion
--------------------0.821
1.000
FACTOR(2)
-2
-4
-4
GROUP
-2
0
FACTOR(1)
Europe
Islamic
NewWorld
I-297
Discriminant Analysis
From the panel of Group means, note that, on the average, the percentage of the
population living in cities (URBAN) is 68.8% in Europe, 30.1% in Islamic nations, and
56.4% in the New World. The LITERACY rates for these same groups are 97.5%,
36.7%, and 80.0%, respectively.
After the group means, you will find the F-to-enter statistics for each variable not
in the functions. When no variables are in the model, each F is the same as that for a
one-way analysis of variance. Thus, group differences are the strongest for BIRTH_RT
(F = 103.5) and weakest for DEATH_RT (F = 14.41). At later steps, each F
corresponds to the F for a one-way analysis of covariance where the covariates are the
variables already included.
At step 1, SYSTAT enters BIRTH_RT because its F-to-enter is largest in the last
panel and now displays the same F in the F-to-remove panel. BIRTH_RT is correlated
with several candidate variables, so notice how their F-to-enter values drop when
BIRTH_RT enters (for example, for GDP_CAP, from 59.1 to 4.6). DEATH_RT now
has the highest F-to-enter, so SYSTAT will enter it at step 2. From the between-groups
F-matrix, note that when BIRTH_RT is used alone, Europe and Islamic countries are
the groups that differ most (206.6), and Europe and the New World are the groups that
differ least (55.9).
After DEATH_RT enters, the F-to-enter for MIL (money spent per person on the
military) is largest, so SYSTAT enters it at step 3. The SYSTAT default limit for F-toenter values is 4.0. No variable has an F-to-enter above the limit, so the stepping stops.
Also, all F-to-remove values are greater than 3.9, so no variables are removed.
The summary table contains one row for each variable moved into the model. The
F-to-enter (F-to-remove) is printed for each, along with Wilks lambda and its
approximate F statistic, numerator and denominator degrees of freedom, and tail
probability.
After the summary table, SYSTAT prints the classification matrices. From the
biased estimate in the first matrix, our three-variable rule classifies 89% of the
countries correctly. For the jackknifed results, this percentage drops to 87%. All of the
European nations are classified correctly (100%), while almost one-fourth of the New
World countries are misclassified (two as Europe and three as Islamic). These
countries can be identified by using MAHALthe posterior probability for each case
belonging to each group is printed. You will find, for example, that Canada is
misclassified as European and that Haiti and Bolivia are misclassified as Islamic.
If you focus on the canonical results, you motice that the first canonical variable
accounts for 82.1% of the dispersion, and in the Canonical scores of group means panel,
the groups are ordered from left to right: Europe, New World, and then Islamic. The
second canonical variable contrasts Islamic versus New World (1.243 versus 1.258).
I-298
Chapter 11
In the canonical variable plot, the European nations (on the left) are well separated
from the other groups. The plus sign (+) next to the European confidence ellipse is
Canada. If you are unsure about which ellipse corresponds to what group, look at the
Canonical scores of group means.
Example 3
Discriminant Analysis Using Automatic Backward Stepping
It is possible that classification rules for other subsets of the variables perform better
than that found using forward steppingespecially when there are correlations among
the variables. We try backward stepping. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
PRINT SHORT / CFUNC
IDVAR = country$
START / BACKWARD
STEP / AUTO FENTER=4 FREMOVE=3.9
PRINT / TRACES CDFUNC SCDFUNC
STOP
Notice that we request STEP after an initial report and PRINT and STOP later.
The output follows:
Between groups F-matrix -- df =
12
41
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
25.3059
0.0
NewWorld
18.0596
7.3754
0.0
Classification functions
---------------------Europe
Constant
-4408.4004
URBAN
-2.4175
BIRTH_RT
41.9790
DEATH_RT
50.0202
BABYMORT
9.3190
GDP_CAP
243.6686
EDUC
2.0078
HEALTH
-17.9706
MIL
-9.8420
B_TO_D
-59.6547
LIFEEXPM
-9.8216
LIFEEXPF
93.5933
LITERACY
7.5909
Islamic
-4396.8904
-2.3572
43.1675
48.1539
9.3806
234.5165
4.0450
-19.8527
-10.1746
-62.2446
-9.1537
93.0934
7.5834
NewWorld
-4408.5297
-2.2871
43.1322
48.1950
9.3461
237.0805
3.4276
-19.3068
-10.6076
-61.8195
-9.4952
93.4108
7.7178
I-299
Discriminant Analysis
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.17
0.436470 |
8 BIRTH_RT
2.01
0.059623 |
10 DEATH_RT
2.26
0.091463 |
12 BABYMORT
0.10
0.083993 |
16 GDP_CAP
0.62
0.143526 |
19 EDUC
6.12
0.065095 |
21 HEALTH
5.36
0.083198 |
23 MIL
7.11
0.323519 |
34 B_TO_D
0.55
0.136148 |
30 LIFEEXPM
0.26
0.036088 |
31 LIFEEXPF
0.07
0.012280 |
32 LITERACY
1.45
0.177756 |
Step
--
****************
Islamic
-2147.9924
-0.8170
21.4523
27.6314
3.8419
282.7130
-1.8145
-7.7816
-7.3247
-16.5811
33.1607
5.5374
NewWorld
-2144.2709
-0.7416
21.3429
27.6026
3.7885
285.4413
-2.4518
-7.1945
-7.7480
-16.0004
32.9634
5.6648
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.45
0.466202 |
31 LIFEEXPF
0.07
0.012280
8 BIRTH_RT
3.04
0.077495 |
10 DEATH_RT
2.45
0.100658 |
12 BABYMORT
0.41
0.140589 |
16 GDP_CAP
0.68
0.144854 |
19 EDUC
6.71
0.066537 |
21 HEALTH
6.78
0.092071 |
23 MIL
7.39
0.328943 |
34 B_TO_D
0.70
0.148030 |
30 LIFEEXPM
0.24
0.077817 |
32 LITERACY
1.48
0.185492 |
I-300
Chapter 11
Step
--
****************
Islamic
-38.4306
1.3372
0.6592
1.3011
-0.8816
0.4181
NewWorld
-17.6982
0.9382
0.2591
0.8506
-0.3976
0.1794
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
27.89
0.622699 |
6 URBAN
3.65
0.504724
10 DEATH_RT
15.51
0.583392 |
12 BABYMORT
1.12
0.243722
19 EDUC
5.20
0.083925 |
16 GDP_CAP
1.20
0.171233
21 HEALTH
6.67
0.102470 |
34 B_TO_D
1.24
0.180347
23 MIL
7.42
0.501019 |
30 LIFEEXPM
0.02
0.123573
|
31 LIFEEXPF
0.49
0.076049
|
32 LITERACY
3.42
0.250341
Variable
F-to-enter Number of
entered or
or
variables
Wilks
Approx.
removed
F-to-remove in model
lambda
F-value df1
df2
p-tail
------------ ----------- --------- ----------- ----------- ---- ----- --------LIFEEXPF
0.068
11
0.0405
15.1458
22
84
0.00000
LIFEEXPM
0.237
10
0.0410
16.9374
20
86
0.00000
BABYMORT
0.219
9
0.0414
19.1350
18
88
0.00000
B_TO_D
0.849
8
0.0430
21.4980
16
90
0.00000
GDP_CAP
1.429
7
0.0457
24.1542
14
92
0.00000
LITERACY
2.388
6
0.0505
27.0277
12
94
0.00000
URBAN
3.655
5
0.0583
30.1443
10
96
0.00000
Classification matrix (cases in row categories classified into columns)
--------------------Europe
Islamic NewWorld %correct
Europe
19
0
0
100
Islamic
0
13
2
87
NewWorld
1
2
18
86
15
20
91
Total
20
NewWorld
0
2
18
%correct
100
87
86
20
91
Total
20
Eigen
values
--------6.984
1.147
15
Canonical
correlations
-----------0.935
0.731
Cumulative proportion
of total dispersion
--------------------0.859
1.000
I-301
Discriminant Analysis
0.058
30.144
df=
10,
96
p-tail=
0.0000
Pillais trace=
Approx.F=
1.409
23.360
df=
10,
98
p-tail=
0.0000
Lawley-Hotelling trace=
Approx.F=
8.131
38.215
df=
10,
94
p-tail=
0.0000
I-302
Chapter 11
Before stepping starts, SYSTAT uses all candidate variables to compute classification
functions. The output includes the coefficients for these functions used to classify
cases into groups. A variable is omitted only if it fails the Tolerance limit. For each
case, SYSTAT computes three functions. The first is:
4408.4 2.417*urban + 41.979*birth_rt + ... + 7.591*literacy
I-303
Discriminant Analysis
stepping did not include EDUC or HEALTH (their F-to-enter statistics at step 3 are
0.01 and 1.24, respectively). URBAN and LITERACY appear more likely candidates,
but their Fs are still less than 4.0.
In both classification matrices, 91% of the countries are classified correctly using
the five-variable discriminant functions. This is a slight improvement over the threevariable model from the forward stepping example, where the percentages were 89%
for the first matrix and 87% for the jackknifed results. The improvement from 87% to
91% is because two New World countries are now classified correctly. We add two
variables and gain two correct classifications.
Wilks lambda (or U statistic), a multivariate analysis of variance statistic that varies
between 0 and 1, tests the equality of group means for the variables in the discriminant
functions. Wilks lambda is transformed to an approximate F statistic for comparison
with the F distribution. Here, the associated probability is less than 0.00005, indicating
a highly significant difference among the groups. The Lawley-Hotelling trace and its
F approximation are documented in Morrison (1976). When there are only two groups,
it and Wilks lambda are equivalent. Pillais trace and its F approximation are taken
from Pillai (1960).
The canonical discriminant functions list the coefficients of the canonical variables
computed first for the data as input and then for the standardized values. For the
unstandardized data, the first canonical variable is:
1.984 + 0.160*birth_rt 0.159*death_rt + 0.236*educ 0.260*health 0.074*mil
The coefficients are adjusted so that the overall mean of the corresponding scores is 0
and the pooled within-group variances are 1. After standardizing, the first canonical
variable is:
0.974*birth_rt 0.519*death_rt + 1.557*educ 1.557*health 0.391*mil
Usually, one uses the latter set of coefficients to interpret what variables drive each
canonical variable. Here, EDUC and HEALTH, the variables with low tolerance
values, have the largest coefficients, and they appear to cancel one another. Also, in the
final model, the size of their F-to-remove values indicates they are the least useful
variables in the model. This indicates that we do not have an optimum set of variables.
These two variables contribute little alone, while together they enhance the separation
of the groups. This suggests that the difference between EDUC and HEALTH could be
a useful variable (for example, LET diff = educ health). We did this, and the following
is the first canonical variable for standardized values (we omit the constant):
1.024*birth_rt 0.539*death_rt 0.480*mil + 0.553*diff
I-304
Chapter 11
From the Canonical scores of group means for the first canonical variable, the groups
line up with Europe first, then New World in the middle, and Islamic on the right. In
the second dimension, DEATH_RT and MIL (military expenditures) appear to separate
Islamic and New World countries.
3.0
4.0
.3
9.1
2.1
2.3
5.7
11.9
3.6
2.1
.6
4.3
3.6
2.1
6.0
5.3
2.7
4.4
8.0
1.8
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
.99
1.00
1.00
1.00
33.7
37.7
42.7
37.6
40.5
45.5
48.6
71.7
42.8
42.8
44.7
51.7
40.4
43.9
65.8
38.5
29.5
39.8
42.4
40.9
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
13.6
19.8
26.0
24.9
22.3
29.1
28.3
48.3
18.9
29.9
23.0
35.9
18.8
24.2
45.5
28.4
12.5
24.3
31.9
25.1
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.01
.00
.00
.00
Islamic
-----------Gambia
Iraq
Pakistan
Bangladesh
Ethiopia
Guinea
Malaysia
-->
Senegal
Mali
Libya
Somalia
Afghanistan *
Sudan
Turkey
-->
Algeria
Yemen
43.2
71.3
38.7
37.2
40.5
41.2
36.6
42.8
49.3
60.3
50.0
.
43.8
25.0
43.1
57.4
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.
.00
.00
.00
.00
2.9
23.5
.5
2.0
1.1
8.0
7.7
.9
5.5
15.6
1.1
.
.3
7.2
4.1
3.1
1.00
1.00
.98
.91
.99
1.00
.17
.98
1.00
1.00
1.00
.
.99
.05
.79
1.00
15.3
41.7
8.6
6.8
10.0
24.1
4.5
9.1
23.5
30.1
13.1
.
10.1
1.5
6.7
23.2
.00
.00
.02
.09
.01
.00
.83
.02
.00
.00
.00
.
.01
.95
.21
.00
I-305
Discriminant Analysis
NewWorld
-----------Argentina
11.5 .03
19.8 .00
4.4
Barbados
16.4 .00
20.9 .00
4.7
Bolivia
-->
27.7 .00
3.4 .56
3.8
Brazil
27.4 .00
11.5 .00
.6
Canada
-->
6.7 1.00
35.9 .00
19.3
Chile
21.1 .00
15.7 .00
1.5
Colombia
35.2 .00
13.9 .00
1.9
CostaRica
34.8 .00
21.1 .00
5.5
Venezuela
41.2 .00
13.4 .01
4.6
DominicanR.
26.0 .00
13.2 .00
1.3
Uruguay
13.6 .07
22.9 .00
8.6
Ecuador
32.8 .00
8.6 .02
1.0
ElSalvador
35.3 .00
7.5 .07
2.5
Jamaica
25.6 .00
19.1 .00
1.9
Guatemala
37.6 .00
4.5 .33
3.1
Haiti
-->
37.9 .00
2.0 .99
10.6
Honduras
39.8 .00
6.4 .27
4.5
Trinidad
34.1 .00
11.4 .03
4.1
Peru
20.2 .00
10.5 .02
2.4
Panama
23.8 .00
16.5 .00
2.4
Cuba
12.0 .03
18.5 .00
5.1
-->
case misclassified
*
case not used in computation
.97
1.00
.44
1.00
.00
1.00
1.00
1.00
.99
1.00
.93
.98
.93
1.00
.67
.01
.73
.97
.98
1.00
.97
For each case (up to 250 cases), the Mahalanobis distance squared ( D ) is computed
to each group mean. The closer a case is to a particular mean, the more likely it belongs
to that group. The posterior probability for the distance of a case to a mean is the ratio
2
2
of EXP( 0.5 * D ) for the group divided by the sum of EXP( 0.5 * D ) for all groups
(prior probabilities, if specified, affect these computations).
An arrow (-->) marks incorrectly classified cases, and an asterisk (*) flags cases with
missing values. New World countries Bolivia and Haiti are classified as Islamic, and
Canada is classified as Europe. Note that even though an asterisk marks Belgium,
results are printedthe value of the unused candidate variable URBAN is missing. No
results are printed for Afghanistan because MIL, a variable in the final model, is
missing.
You can identify cases with all large distances as outliers. A case can have a 1.0
probability of belonging to a particular group but still have a large distance. Look at
Iraq. It is correctly classified as Islamic, but its distance is 23.5. The distances in this
panel are distributed approximately as a chi-square with degrees of freedom equal to
the number of variables in the function.
I-306
Chapter 11
Example 4
Discriminant Analysis Using Interactive Stepping
Automatic forward and backward stepping can produce different sets of predictor
variables, and still other subsets of the variables may perform equally as well or
possibly better. Here we use interactive stepping to explore alternative sets of
variables.
Using the OURWORLD data, lets say you decide not to include birth and death
rates in the model because the rates are changing rapidly for several nations (that is, we
omit these variables from the model). We also add the difference between EDUC and
HEALTH as a candidate variable.
SYSTAT provides several ways to specify which variables to move into (or out of)
the model. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
PRINT SHORT / SCDFUNC
GRAPH=NONE
START / BACK
After interpreting these commands and printing the output below, SYSTAT waits for
us to enter STEP instructions.
Between groups F-matrix -- df =
12
41
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
25.3059
0.0
NewWorld
18.0596
7.3754
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.17
0.436470 |
40 DIFFRNCE 0000000.00
0.000000
8 BIRTH_RT
2.01
0.059623 |
10 DEATH_RT
2.26
0.091463 |
12 BABYMORT
0.10
0.083993 |
16 GDP_CAP
0.62
0.143526 |
19 EDUC
6.12
0.065095 |
21 HEALTH
5.36
0.083198 |
23 MIL
7.11
0.323519 |
34 B_TO_D
0.55
0.136148 |
30 LIFEEXPM
0.26
0.036088 |
31 LIFEEXPF
0.07
0.012280 |
32 LITERACY
1.45
0.177756 |
I-307
Discriminant Analysis
A summary of the STEP arguments (variable numbers are visible in the output)
follows:
a.
b.
c.
d.
e.
f.
g.
h.
STEP
STEP
STEP
STEP
STEP
STEP
STEP
STEP
STOP
birth_rt death_rt
lifeexpf
Notice that the seventh STEP specification (g) removes EDUC and HEALTH and
enters DIFFRNCE. Remember, after the last step, type STOP for the canonical variable
results and other summaries.
Steps 1 and 2
Input:
STEP
birth_rt
death_rt
Output:
****************
Step
--
****************
I-308
Chapter 11
****************
Step
--
****************
Step 3
Input:
STEP
lifeexpf
Output:
****************
Step
--
****************
I-309
Discriminant Analysis
Steps 4 through 7
Input:
STEP
Output:
****************
Step
--
****************
Steps 8, 9, and 10
Input:
STEP
Output:
****************
Step
--
****************
I-310
Chapter 11
****************
Step
--
****************
Step
10
--
****************
I-311
Discriminant Analysis
Step 11
Input:
STEP
Output:
****************
Step
11
--
****************
Final Model
Input:
STOP
Output:
Variable
F-to-enter Number of
entered or
or
variables
Wilks
Approx.
removed
F-to-remove in model
lambda
F-value df1
df2
p-tail
------------ ----------- --------- ----------- ----------- ---- ----- --------BIRTH_RT
2.011
11
0.0444
14.3085
22
84
0.00000
DEATH_RT
2.002
10
0.0486
15.2053
20
86
0.00000
LIFEEXPF
0.275
9
0.0492
17.1471
18
88
0.00000
LIFEEXPM
0.277
8
0.0498
19.5708
16
90
0.00000
BABYMORT
0.524
7
0.0510
22.5267
14
92
0.00000
URBAN
2.615
6
0.0568
25.0342
12
94
0.00000
GDP_CAP
3.583
5
0.0655
27.9210
10
96
0.00000
EDUC
5.143
4
0.0795
31.1990
8
98
0.00000
HEALTH
2.438
3
0.0874
39.7089
6
100
0.00000
DIFFRNCE
6.983
4
0.0680
34.7213
8
98
0.00000
GDP_CAP
4.299
5
0.0577
30.3710
10
96
0.00000
Classification matrix (cases in row categories classified into columns)
--------------------Europe
Islamic NewWorld %correct
Europe
19
0
0
100
Islamic
0
14
1
93
NewWorld
1
1
19
90
Total
20
15
20
95
I-312
Chapter 11
20
Eigen
values
--------6.319
1.369
15
Canonical
correlations
-----------0.929
0.760
NewWorld
0
3
17
%correct
100
80
81
20
87
Cumulative proportion
of total dispersion
--------------------0.822
1.000
A summary of results for the models estimated by forward, backward, and interactive
stepping follows:
Model
% Correct
(Class)
% Correct
(Jackknife)
Forward (automatic)
1. BIRTH_RT DEATH_RT MIL
Backward (automatic)
2. BIRTH_RT DEATH_RT MIL EDUC HEALTH
89
87
91
91
84
91
91
95
84
89
89
87
Notice that the largest difference between the two classification methods (95% versus
87%) occurs for the last model, which includes the most variables. A difference like
I-313
Discriminant Analysis
this one (8%) can indicate overfitting of correlated candidate variables. Since the
jackknifed results can still be overly optimistic, cross-validation should be considered.
Example 5
Contrasts
Contrasts are available with commands only. When you have specific hypotheses
about differences among particular groups, you can specify one or more contrasts to
direct the entry (or removal) of variables in the model.
According to the jackknifed classification results in the stepwise examples, the
European countries are always classified correctly (100% correct). All of the
misclassifications are New World countries classified as Islamic or vice versa. In order
to maximize the difference between the second (Islamic) and third groups (New
World), we specify contrast coefficients with commands:
CONTRAST [0 -1 1]
If we want to specify linear and quadratic contrasts across four groups, we could
specify:
CONTRAST [-3 -1 1 3; -1 1 1 -1]
or
CONTRAST [-3 -1
-1 1
1 3
1 -1]
Here, we use the first contrast and request interactive forward stepping. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
CONTRAST [0 -1 1]
PRINT / SHORT
START / FORWARD
STEP literacy
STEP mil
STEP urban
STOP
I-314
Chapter 11
After viewing the results, remember to cancel the contrast if you plan to do other
discriminant analyses:
CONTRAST / CLEAR
****************
Step
--
****************
18
87
Total
20
NewWorld
1
1
16
%correct
95
93
76
18
87
Total
20
Eigen
values
--------1.845
17
Canonical
correlations
-----------0.805
Cumulative proportion
of total dispersion
--------------------1.000
I-315
Discriminant Analysis
Compare the F-to-enter values with those in the forward stepping example. The
statistics here indicate that for the economic variables (GDP_CAP, EDUC, HEALTH,
and MIL), differences between the second and third groups are much smaller than those
when European countries are included.
The Jackknifed classification matrix indicates that when LITERACY, MIL, and
URBAN are used, 87% of the countries are classified correctly. This is the same
percentage correct as in the forward stepping example for the model with BIRTH_RT,
DEATH_RT, and MIL. Here, however, one fewer Islamic country is misclassified, and
one European country is now classified incorrectly.
When you look at the canonical results, you see that because a single contrast has
one degree of freedom, only one dimension is definedthat is, there is only one
eigenvalue and one canonical variable.
Example 6
Quadratic Model
One of the assumptions necessary for linear discriminant analysis is equality of
covariance matrices. Within-group scatterplot matrices (SPLOMs) provide a picture
of how measures co-vary. Here we add 85% ellipses of concentration to enhance our
view of the bivariate relations. Since our sample sizes do not differ markedly (15 to 21
countries per group), the ellipses for each pair of variables should have approximately
the same shape and tilt across groups if the equality of covariance assumption holds.
The input is:
DISCRIM
USE ourworld
LET(educ, health, mil) = SQR(@)
STAND
SPLOM birth_rt death_rt educ health mil / HALF ROW=1,
GROUP=group$ ELL=.85 DENSITY=NORMAL
I-316
Chapter 11
NewWorld
DEATH_RT
EDUC
DEATH_RT
BIRTH_RT
DEATH_RT
EDUC
HEALTH
MIL
MIL
HEALTH
EDUC
HEALTH
MIL
MIL
HEALTH
EDUC
DEATH_RT
BIRTH_RT
BIRTH_RT
Islamic
BIRTH_RT
Europe
BIRTH_RT
DEATH_RT
EDUC
HEALTH
MIL
BIRTH_RT
DEATH_RT
EDUC
HEALTH
MIL
Because the length, width, and tilt of the ellipses for most pairs of variables vary
markedly across groups, the assumption of equal covariance matrices has not been met.
Fortunately, the quadratic model does not require equality of covariances. However,
it has a different problem: it requires a larger minimum sample size than that needed
for the linear model. For five variables, for example, the linear and quadratic models,
respectively, for each group are:
f = a + bx 1 + cx 2 + dx 3 + ex 4 + fx 5
2
f = a + bx 1 + cx 2 + dx 3 + ex 4 + fx 5 + gx 1 x 2 + ... + px 4 x 5 + qx 1 + ... + ux 5
So the linear model has six parameters to estimate for each group, and the quadratic
has 21. These parameters arent all independent, so we dont require as many as
( 3 21 ) cases for a quadratic fit.
In this example, we fit a quadratic model using the subset of variables identified in
the backward stepping example. Following this, we examine results for the subset
identified in the interactive stepping example before EDUC and HEALTH are
removed. The input is:
DISCRIM
USE ourworld
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = birth_rt death_rt educ health mil / QUAD
PRINT SHORT / GCOV WCOV GCOR CFUNC MAHAL
IDVAR = country$
ESTIMATE
MODEL group = educ health mil b_to_d literacy / QUAD
ESTIMATE
I-317
Discriminant Analysis
HEALTH
MIL
35.0939
16.9130
27.7095
EDUC
HEALTH
MIL
47.1696
44.2594
15.2891
47.3538
14.7387
15.7686
EDUC
HEALTH
MIL
1.0000
0.9365
0.5606
1.0000
0.5394
1.0000
HEALTH
MIL
-0.0617
0.0144
1.7008
-0.0249
-0.0468
EDUC
HEALTH
MIL
33.7508
18.8309
36.6788
10.8603
19.3235
66.0183
EDUC
HEALTH
MIL
1.0000
0.9836
0.7770
1.0000
0.7217
1.0000
8.67105970
10.34980794
I-318
Chapter 11
HEALTH
MIL
-0.8951
-0.0360
-0.9994
-0.0190
-0.3956
EDUC
HEALTH
MIL
45.0446
41.6304
18.3092
40.4104
17.2913
12.2372
EDUC
HEALTH
MIL
1.0000
0.9758
0.7798
1.0000
0.7776
1.0000
HEALTH
MIL
-0.1331
0.0210
-0.6543
-0.0580
0.5229
11.46371023
13.05914566
prob=
0.0000
Priors =
I-319
Discriminant Analysis
(We omit the distances and probabilities for the Europe and Islamic groups.)
NewWorld
-----------Argentina
48.1 .00
45.2 .00
3.8
Barbados
31.8 .00
65.0 .00
6.5
Bolivia
--> 369.3 .00
4.1 .65
4.2
Brazil
133.3 .00
9.4 .03
1.1
Canada
-->
14.5 .88 533.6 .00
15.7
Chile
66.6 .00
16.6 .00
1.8
Colombia
161.1 .00
9.2 .04
1.8
CostaRica
181.6 .00
93.2 .00
7.8
Venezuela
180.9 .00
16.6 .01
6.0
DominicanR.
175.3 .00
21.5 .00
2.3
Uruguay
23.1 .00
38.4 .00
5.8
Ecuador
212.2 .00
5.8 .13
.8
ElSalvador
312.9 .00
10.0 .03
2.0
Jamaica
73.8 .00
20.2 .00
2.5
Guatemala
404.9 .00
6.0 .17
1.7
Haiti
--> 792.1 .00
3.9 .99
11.2
Honduras
395.9 .00
16.1 .00
4.1
Trinidad
164.1 .00
38.0 .00
5.6
Peru
167.6 .00
18.9 .00
4.9
Panama
133.9 .00
97.7 .00
3.4
Cuba
33.6 .00
39.7 .00
6.8
-->
case misclassified
*
case not used in computation
1.00
1.00
.35
.97
.12
1.00
.96
1.00
.99
1.00
1.00
.87
.97
1.00
.83
.01
1.00
1.00
1.00
1.00
1.00
19
93
Total
NewWorld
0
2
18
%correct
100
87
86
20
91
Total
21
21
15
or
f = 51.178 + 4.135*birth_rt + 0.049*birth_rt*death_rt + 0.159*birth_rt2 +
0.025*mil2
I-320
Chapter 11
The output also includes the chi-square test for equality of covariance matrices. The
results are highly significant ( p < 0.00005 ). Thus, we reject the hypothesis of equal
covariance matrices.
The Mahalanobis distances reveal that only four cases are misclassified: Turkey as
a New World country, Canada as European, and Haiti and Bolivia as Islamic.
The classification matrix indicates that 93% of the countries are correctly classified;
using the jackknifed results, the percentage drops to 91%. The latter percentage agrees
with that for the linear model using the same variables.
The output for the second model follows:
Between groups F-matrix -- df =
5
49
--------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
51.5154
0.0
NewWorld
33.6025
17.9915
0.0
Priors =
NewWorld
-----------Argentina
30.9 .00
48.3 .00
4.3
Barbados
35.5 .00
68.7 .00
7.4
Bolivia
186.2 .00
10.1 .08
2.2
Brazil
230.8 .00
8.1 .13
1.2
Canada
-->
19.4 .74 524.3 .00
16.3
Chile
144.3 .00
17.2 .00
1.6
Colombia
475.1 .00
29.8 .00
1.9
CostaRica
834.5 .00 190.5 .00
10.3
Venezuela
932.5 .00
83.6 .00
8.8
DominicanR.
267.4 .00
18.6 .00
2.0
Uruguay
15.2 .04
60.5 .00
3.9
Ecuador
276.0 .00
11.5 .02
1.0
ElSalvador
498.0 .00
17.6 .00
1.7
Jamaica
312.0 .00
15.5 .00
.7
Guatemala
501.3 .00
7.9 .24
2.5
Haiti
--> 648.4 .00
4.6 .99
10.2
Honduras
688.1 .00
31.8 .00
4.0
Trinidad
315.4 .00
43.1 .00
4.6
Peru
179.9 .00
16.3 .02
5.1
Panama
411.0 .00 109.7 .00
3.6
Cuba
54.7 .00
54.5 .00
6.8
-->
case misclassified
*
case not used in computation
1.00
1.00
.92
.87
.26
1.00
1.00
1.00
1.00
1.00
.96
.98
1.00
1.00
.76
.01
1.00
1.00
.98
1.00
1.00
21
16
19
96
I-321
Discriminant Analysis
20
Eigen
values
--------5.585
1.391
15
Canonical
correlations
-----------0.921
0.763
NewWorld
1
1
19
%correct
95
93
90
21
93
Cumulative proportion
of total dispersion
--------------------0.801
1.000
This model does slightly better than the first onethe classification matrices here
show that 96% and 93%, respectively, are classified correctly. This is because Turkey
and Bolivia are classified correctly here and misclassified with the first model.
Example 7
Cross-Validation
At the end of the interactive stepping example, we reported the percentage of correct
classification for six models. The same sample was used to compute the estimates and
evaluate the success of the rules. We also reported results for the jackknifed
classification procedure that removes and replaces one case at a time. This approach,
however, may still give an overly optimistic picture. Ideally, we should try the rules on
a new sample and compare results with those for the original data. Since this usually
isnt practical, researchers often use a cross-validation procedurethat is, they
randomly split the data into two samples, use the first sample to estimate the
classification functions, and then use the resulting functions to classify the second
sample. The first sample is often called the learning sample and the second, the test
sample. The proportion of correct classification for the test sample is an empirical
measure for the success of the discrimination.
Cross-validation is easy to implement in discriminant analysis. Cases assigned a
weight of 0 are not used to estimate the discriminant functions but are classified into
groups. In this example, we generate a uniform random number (values range from 0
to 1.0) for each case, and when it is less than 0.65, the value 1.0 is stored in a new
weight variable named CASE_USE. If the random number is equal to or greater than
0.65, a 0 is placed in the weight variable. So, approximately 65% of the cases have a
weight of 1.0, and 35%, a weight of 0.
I-322
Chapter 11
We now request a cross-validation for each of the following six models using the
OURWORLD data:
1.
2.
3.
4.
5.
6.
Use interactive forward stepping to toggle variables in and out of the model subsets.
The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
LET case_use = URN < .65
WEIGHT = case_use
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
PRINT NONE / FSTATS CLASS JCLASS
GRAPH NONE
START / FORWARD
STEP birth_rt death_rt mil
STEP educ health
STEP birth_rt death_rt educ health b_to_d literacy
STEP educ health
STEP educ health diffrnce
STEP gdp_cap
STOP
Here are the results from the first STEP after MIL enters:
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
57.86
0.640126 |
6 URBAN
7.41
0.415097
10 DEATH_RT
24.56
0.513344 |
12 BABYMORT
0.20
0.234804
23 MIL
13.43
0.760697 |
16 GDP_CAP
3.22
0.394128
|
19 EDUC
2.00
0.673136
|
21 HEALTH
4.68
0.828565
|
34 B_TO_D
0.16
0.209796
|
30 LIFEEXPM
0.42
0.136526
|
31 LIFEEXPF
0.83
0.104360
|
32 LITERACY
1.54
0.244547
|
40 DIFFRNCE
5.23
0.784797
I-323
Discriminant Analysis
13
16
95
76
NewWorld
0
1
14
%correct
100
89
88
15
92
Total
14
Three classification matrices result. The first presents results for the learning sample,
the cases with CASE_USE values of 1.0. Overall, 95% of these countries are classified
correctly. The sample size is 13 + 9 + 16 = 38or 67.9% of the original sample of 56
countries. The second classification table reflects those cases not used to compute
estimates, the test sample. The percentage of correct classification drops to 76% for
these 17 countries. The final classification table presents the jackknifed results for the
learning sample. Notice that the percentages of correct classification are closer to those
for the learning sample than for the test sample.
Now we add the variables EDUC and HEALTH, with the following results:
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
21.13
0.588377 |
6 URBAN
6.50
0.414511
10 DEATH_RT
16.52
0.508827 |
12 BABYMORT
0.07
0.221475
19 EDUC
2.24
0.103930 |
16 GDP_CAP
3.06
0.242491
21 HEALTH
4.88
0.127927 |
34 B_TO_D
0.32
0.198963
23 MIL
5.68
0.567128 |
30 LIFEEXPM
0.05
0.117494
|
31 LIFEEXPF
0.04
0.080161
|
32 LITERACY
1.75
0.238831
|
40 DIFFRNCE
0.00
0.000000
Classification matrix (cases in row categories classified into columns)
--------------------Europe
Islamic NewWorld %correct
Europe
13
0
0
100
Islamic
0
8
1
89
NewWorld
0
1
15
94
Total
13
16
95
I-324
Chapter 11
88
NewWorld
0
1
15
%correct
100
89
94
16
95
Total
13
After we add EDUC and HEALTH, the results here for the learning sample do not
differ from those for the previous model. However, for the test sample, the addition of
EDUC and HEALTH increases the percentage correct from 76% to 88%.
We continue by issuing the STEP specifications listed above, each time noting the
total percentage correct as well as the percentages for the Islamic and New World
groups. After scanning the classification results from both the test sample and the
learning sample jackknifed panel, we conclude that model 2 (BIRTH_RT, DEATH_RT,
MIL, EDUC, and HEALTH) is best and that model 1 performs the worst.
and request automatic forward stepping for the model containing BIRTH_RT,
DEATH_RT, MIL, EDUC, and HEALTH:
I-325
Discriminant Analysis
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
IF group = 3 THEN LET group = .
LABEL group / 1=Europe, 2=Islamic
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
IDVAR = country$
PRINT / MAHAL
START / FORWARD
STEP / AUTO
STOP
The following are the Mahalanobis distances and posterior probabilities for the
countries with missing group codes and also the classification matrix. The weight
variable is not used here.
Mahalanobis distance-square from group means and
Posterior probabilities for group membership
Priors =
.500
.500
Europe
Islamic
Not Grouped
-----------Argentina
*
Barbados
*
Bolivia
*
Brazil
*
Canada
*
Chile
*
Colombia
*
CostaRica
*
Venezuela
*
DominicanR. *
Uruguay
*
Ecuador
*
ElSalvador *
Jamaica
*
Guatemala
*
Haiti
*
Honduras
*
Trinidad
*
Peru
*
Panama
*
Cuba
*
-->
*
28.6 1.00
59.6 .00
25.9 1.00
71.9 .00
120.7 .00
2.7 1.00
115.7 .00
10.0 1.00
2.1 1.00 124.1 .00
63.2 .00
35.5 1.00
204.0 .00
22.4 1.00
306.5 .00
60.8 1.00
297.0 .00
49.2 1.00
129.9 .00
10.8 1.00
12.5 1.00
91.4 .00
149.3 .00
8.4 1.00
183.7 .00
10.1 1.00
100.2 .00
32.7 1.00
155.3 .00
5.5 1.00
136.8 .00
1.4 1.00
216.6 .00
13.2 1.00
132.6 .00
14.0 1.00
99.4 .00
7.4 1.00
160.5 .00
18.9 1.00
19.4 1.00
70.7 .00
case misclassified
case not used in computation
19
15
16
100
Argentina, Barbados, Canada, Uruguay, and Cuba are classified as European; the other
15 countries are classified as Islamic.
I-326
Chapter 11
References
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179188.
Hill, M. A. and Engelman, L. (1992). Graphical aids for nonlinear regression and
discriminant analysis. Computational Statistics, Vol. 2, Y. Dodge and J. Whittaker, eds.
Proceedings of the 10th Symposium on Computational Statistics Physica-Verlag,
111126.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The
Statistical Center, University of Phillipines.
Chapter
12
Factor Analysis
Herb Stenson and Leland Wilkinson
Factor analysis provides principal components analysis and common factor analysis
(maximum likelihood and iterated principal axis). SYSTAT has options to rotate, sort,
plot, and save factor loadings. With the principal components method, you can also
save the scores and coefficients. Orthogonal methods of rotation include varimax,
equamax, quartimax, and orthomax. A direct oblimin method is also available for
oblique rotation. Users can explore other rotations by interactively rotating a 3-D
Quick Graph plot of the factor loadings. Various inferential statistics (for example,
confidence intervals, standard errors, and chi-square tests) are provided, depending on
the nature of the analysis that is run.
Statistical Background
Principal components (PCA) and common factor (MLA for maximum likelihood and
IPA for iterated principal axis) analyses are methods of decomposing a correlation or
covariance matrix. Although principal components and common factor analyses are
based on different mathematical models, they can be used on the same data and both
usually produce similar results. Factor analysis is often used in exploratory data
analysis to:
n Study the correlations of a large number of variables by grouping the variables in
factors so that variables within each factor are more highly correlated with
variables in that factor than with variables in other factors.
n Interpret each factor according to the meaning of the variables.
n Summarize many variables by a few factors. The scores from the factors can be
used as input data for t tests, regression, ANOVA, discriminant analysis, and so on.
I-327
I-328
Chapter 12
Often the users of factor analysis are overwhelmed by the gap between theory and
practice. In this chapter, we try to offer practical hints. It is important to realize that
you may need to make several passes through the procedure, changing options each
time, until the results give you the necessary information for your problem.
If you understand the component model, you are on the way toward understanding
the factor model, so lets begin with the former.
A Principal Component
What is a principal component? The simplest way to see is through real data. The
following data consist of Graduate Record Examination verbal and quantitative scores.
These scores are from 25 applicants to a graduate psychology department.
VERBAL
QUANTITATIVE
590
620
640
650
620
610
560
610
600
740
560
680
600
520
660
750
630
570
600
570
600
690
770
610
600
530
620
620
550
610
660
570
730
650
790
580
710
540
530
650
710
640
660
650
570
550
540
670
660
640
I-329
Factor Analysis
Now, we could decide to try linear regression to predict verbal scores from
quantitative. Or, we could decide to predict quantitative from verbal by the same
method. The data dont suggest which is a dependent variable; either will do. What if
we arent interested in predicting either one separately but instead want to know how
both variables hang together jointly? This is what a principal component does. Karl
Pearson, who developed principal component analysis in 1901, described a component
as a line of closest fit to systems of points in space. In short, the regression line
indicates best prediction, and the component line indicates best association.
The following figure shows the regression and component lines for our GRE data.
The regression of y on x is the line with the smallest slope (flatter than diagonal). The
regression of x on y is the line with the largest slope (steeper than diagonal). The
component line is between the other two. Interestingly, when most people are asked to
draw a line relating two variables in a scatterplot, they tend to approximate the
component line. It takes a lot of explaining to get them to realize that this is not the best
line for predicting the vertical axis variable (y) or the horizontal axis variable (x).
800
700
600
500
500
600
700
Verbal GRE Score
800
Notice that the slope of the component line is approximately 1, which means that the
two variables are weighted almost equally (assuming the axis scales are the same). We
could make a new variable called GRE that is the sum of the two tests:
GRE = VERBAL + QUANTITATIVE
This new variable could summarize, albeit crudely, the information in the other two. If
the points clustered almost perfectly around the component line, then the new
component variable could summarize almost perfectly both variables.
I-330
Chapter 12
Component Coefficients
In the above equation for computing the first principal component on our test data, we
made both coefficients equal. In fact, when you run the sample covariance matrix using
factor analysis in SYSTAT, the coefficients are as follows:
GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE
They are indeed nearly equal. Their magnitude is considerably less than 1 because
principal components are usually scaled to conserve variance. That is, once you
compute the components with these coefficients, the total variance on the components
is the same as the total variance on the original variables.
Component Loadings
Most researchers want to know the relation between the original variables and the
components. Some components may be nearly identical to an original variable; in other
words, their coefficients may be nearly 0 for all variables except one. Other
components may be a more even amalgam of several original variables.
Component loadings are the covariances of the original variables with the
components. In our example, these loadings are 51.085 for VERBAL and 62.880 for
QUANTITATIVE. You may have noticed that these are proportional to the coefficients;
they are simply scaled differently. If you square each of these loadings and add them
up separately for each component, you will have the variance accounted for by each
component.
I-331
Factor Analysis
Correlations or Covariances
Most researchers prefer to analyze the correlation rather than covariance structure
among their variables. Sample correlations are simply covariances of sample
standardized variables. Thus, if your variables are measured on very different scales or
if you feel the standard deviations of your variables are not theoretically significant,
you will want to work with correlations instead of covariances. In our test example,
working with correlations yields loadings of 0.879 for each variable instead of 51.085
and 62.880. When you factor the correlation instead of the covariance matrix, then the
loadings are the correlations of each component with each original variable.
For our test data, loadings of 0.879 mean that if you created a GRE component by
standardizing VERBAL and QUANTITATIVE and adding them together weighted by
the coefficients, you would find the correlation between these component scores and
the original VERBAL scores to be 0.879. The same would be true for QUANTITATIVE.
Factor Analysis
We have seen how principal components analysis is a method for computing new
variables that summarize variation in a space parsimoniously. For our test variables,
the equation for computing the first component was:
GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE
I-332
Chapter 12
This model was presented by Spearman near the turn of the century in the context of a
single intelligence factor and extended to multiple mental measurement factors by
Thurstone several decades later. Notice that the factor model makes observed variables
a function of unobserved factors. Even though this looks like a linear regression model,
none of the graphical and analytical techniques used for regression can be applied to
the factor model because there is no unique, observable set of factor scores or residuals
to examine.
Factor analysts are less interested in prediction than in decomposing a covariance
matrix. This is why the fundamental equation of factor analysis is not the above linear
model, but rather its quadratic form:
Observed covariances = Factor covariances + Error covariances
The covariances in this equation are usually expressed in matrix form, so that the
model decomposes an observed covariance matrix into a hypothetical factor
covariance matrix plus a hypothetical error covariance matrix. The diagonals of these
two hypothetical matrices are known, respectively, as communalities and
specificities.
In ordinary language, then, the factor model expresses variation within and relations
among observed variables as partly common variation among factors and partly
specific variation among random errors.
Estimating Factors
Factor analysis involves several steps:
n First, the correlation or covariance matrix is computed from the usual cases-by-
rotation methods make the loadings for each factor either large or small, not inbetween. These methods are described in the next section.
Factors must be estimated iteratively in a computer. There are several methods
available. The most popular approach, available in SYSTAT, is to modify the diagonal
of the observed covariance matrix and calculate factors the same way components are
computed. This procedure is repeated until the communalities reproduced by the factor
covariances are indistinguishable from the diagonal of the modified matrix.
I-333
Factor Analysis
Rotation
Usually the initial factor extraction does not give interpretable factors. One of the
purposes of rotation is to obtain factors that can be named and interpreted. That is, if
you can make the large loadings larger than before and the smaller loadings smaller,
then each variable is associated with a minimal number of factors. Hopefully, the
variables that load strongly together on a particular factor will have a clear meaning
with respect to the subject area at hand.
It helps to study plots of loadings for one factor against those for another. Ideally,
you want to see clusters of loadings at extreme values for each factor: like what A and
C are for factor 1, and B and D are for factor 2 in the left plot, and not like E and F in
the middle plot.
1
1
d
-1
-1
0
e
-1
-1
-1
-1
1
1
In the middle plot, the loadings in groups E and F are sizeable for both factors 1 and 2.
However, if you lift the plot axes away from E and F, rotating them 45 degrees, and
then set them down as on the right, you achieve the desired effect. Sounds easy for two
factors. For three factors, imagine that the loadings are balls floating in a room and that
you rotate the floor and walls so that each loading is as close to the floor or a wall as it
can be. This concept generalizes to more dimensions.
Researchers let the computer do the rotation automatically. There are many criteria
for achieving a simple structure among component loadings, although Thurstones are
most widely cited. For p variables and m components:
n Each component should have at least m near-zero loadings.
n Few components should have nonzero loadings on the same variable.
I-334
Chapter 12
I-335
Factor Analysis
Myth. Factor loadings are real; principal component loadings are approximations.
Fact. This statement is too ambiguous to have any meaning. It is easy to define things
so that factors are approximations of components.
Myth. Factor analysis is more likely to uncover lawful structure in your data; principal
components are more contaminated by error.
Fact. Again, this statement is ambiguous. With further definition, it can be shown to be
true for some data, false for other. It is true that, in general, factor solutions will have
lower dimensionality than corresponding component solutions. This can be an
advantage when searching for simple structure among noisy variables, as long as you
compare the result to a principal components solution to avoid being fooled by the sort
of degeneracies illustrated above.
I-336
Chapter 12
factors by starting with the principal components solution and iteratively solving
for communalities.
n Maximum likelihood analysis (MLA) iteratively finds communalities and common
factors.
Display. You can sort factor loadings by size or display extended results. Selecting
Extended results displays all possible Factor output.
Sample size for matrix input. If your data are in the form of a correlation or covariance
matrix, you must specify the sample size on which the input matrix is based so that
inferential statistics (available with extended results) can be computed.
Matrix for extraction. You can factor a correlation matrix or a covariance matrix. Most
frequently, the correlation matrix is used. You can also delete missing cases pairwise
instead of listwise. Listwise deletes any case with missing data for any variable in the
list. Pairwise examines each pair of variables and uses all cases with both values
present.
Extraction parameters. You can limit the results by specifying extraction parameters.
n Minimum eigenvalue. Specify the smallest eigenvalue to retain. The default is 1.0
for PCA and IPA (not available with maximum likelihood). Incidentally, if you
specify 0, factor analysis ignores components with negative eigenvalues (which
can occur with pairwise deletion).
n Number of factors. Specify the number of factors to compute. If you specify both
the number of factors and the minimum eigenvalue, factor analysis uses whichever
criterion results in the smaller number of components.
n Iterations. Specify the number of iterations SYSTAT should perform (not available
I-337
Factor Analysis
Rotation Parameters
This dialog box specifies the factor rotation method.
that have high loadings on each factor. It simplifies the interpretation of the factors.
n Equamax. A rotation method that is a combination of the varimax method, which
simplifies the factors, and the quartimax method, which simplifies the variables.
The number of variables that load highly on a factor and the number of factors
needed to explain a variable are minimized.
n Quartimax. A rotation method that minimizes the number of factors needed to
of the family to use. Varying Gamma changes maximization of the variances of the
loadings from columns (Varimax) to rows (Quartimax).
n Oblimin. Specifies families of oblique (non-orthogonal) rotations. Gamma
specifies the member of the family to use. For Gamma, specify 0 for moderate
correlations, positive values to allow higher correlations, and negative values to
restrict correlations.
I-338
Chapter 12
Save
You can save factor analysis results for further analyses.
For the maximum likelihood and iterated principal axis methods, you can save only
loadings. For the principal components method, select from these options:
n Do not save results. Results are not saved.
n Factor scores. Standardized factor scores.
n Residuals. Residuals for each case. For a correlation matrix, the residual is the
actual z score minus the predicted z score using the factor scores times the loadings
to get the predicted scores. For a covariance matrix, the residuals are from
unstandardized predictions. With an orthogonal rotation, Q and PROB are also
saved. Q is the sum of the squared residuals, and PROB is its probability.
n Principal components. Unstandardized principal components scores with mean 0
and variance equal to the eigenvalue for the factor (only for PCA without rotation).
n Factor coefficients. Coefficients that produce standardized scores. For a correlation
unstandardized scores.
n Factor loadings. Factor loadings.
n Save data with scores. Saves the selected item and all the variables in the working
data file as a new data file. Use with options for scores (not loadings, coefficients,
or other similar options).
I-339
Factor Analysis
If you save scores, the variables in the file are labeled FACTOR(1), FACTOR(2), and
so on. Any observations with missing values on any of the input variables will have
missing values for all scores. The scores are normalized to have zero mean and, if the
correlation matrix is used, unit variance. If you use the covariance matrix and perform
no rotations, SYSTAT does not standardize the component scores. The sum of their
variances is the same as for the original data.
If you want to use the score coefficients to get component scores for new data,
multiply the coefficients by the standardized data. SYSTAT does this when it saves
scores. Another way to do cross-validation is to assign a zero weight to those cases not
used in the factoring and to assign a unit weight to those cases used. The zero-weight
cases are not used in the factoring, but scores are computed for them.
When Factor scores or Principal components is requested, T2 and PROB are also
2
saved. The former is the Hotelling T statistic that squares the standardized distance
from each case to the centroid of the factor space (that is, the sum of the squared,
standardized factor scores). PROB is the upper-tail probability of T2. Use this statistic
to identify outliers within the factor space. T2 is not computed with an oblique rotation.
Using Commands
After selecting a data file with USE filename, continue with:
FACTOR
MODEL varlist
SAVE filename / SCORES DATA LOAD COEF VECTORS PC
ESTIMATE / METHOD = PCA or IPA or MLA ,
LISTWISE or PAIRWISE N=n CORR or COVA ,
NUMBER=n EIGEN=n ITER=n CONV=n SORT ,
ROTATE = VARIMAX or EQUAMAX or QUARTIMAX ,
or ORTHOMAX or OBLIMIN
GAMMA=n
RESID
Usage Considerations
Types of data. Data for factor analysis can be a cases-by-variables data file, a correlation
matrix, or a covariance matrix.
Print options. Factor analysis offers three categories of output: short (the default),
medium, and long. Each has specific output panels associated with it.
For Short, the default, panels are: Latent roots or eigenvalues (not MLA), initial and
final communality estimates (not PCA), component loadings (PCA) or factor pattern
I-340
Chapter 12
I-341
Factor Analysis
Examples
Example 1
Principal Components
Principal components (PCA, the default method) is a good way to begin a factor
analysis (and possibly the only method you may need). If one variable is a linear
combination of the others, the program will not stop (MLA and IPA both require a
nonsingular correlation or covariance matrix). The PCA output can also provide
indications that:
n One or more variables have little relation to the others and, therefore, are not suited
for factor analysisso in your next run, you might consider omitting them.
n The final number of factors may be three or four and not double or triple this
number.
To illustrate this method of factor extraction, we borrow data from Harman (1976),
who borrowed them from a 1937 unpublished thesis by Mullen. This classic data set is
widely used in the literature. For example, Jackson (1991) reports loadings for the
PCA, MLA, and IPA methods. The data are measurements recorded for 305 youth
aged seven to seventeen: height, arm span, length of forearm, length of lower leg,
weight, bitrochanteric diameter (the upper thigh), girth, and width. Because the units
of these measurements differ, we analyze a correlation matrix:
Height
Arm_Span
Bitro
Girth
Height
1.000
Arm_Span
0.846
1.000
Forearm
0.805
0.881
1.000
Lowerleg
0.859
0.826
0.801
1.000
Weight
0.473
0.376
0.380
0.436
1.000
Bitro
0.398
0.326
0.319
0.329
0.762
1.000
Girth
0.301
0.277
0.237
0.327
0.730
0.583
1.000
Width
0.382
0.415
0.345
0.365
0.629
0.577
0.539
Width
1.000
I-342
Chapter 12
The correlation matrix is stored in the YOUTH file. SYSTAT knows that the file contains
a correlation matrix, so no special instructions are needed to read the matrix. The input is:
FACTOR
USE youth
MODEL height .. width
ESTIMATE / METHOD=PCA
N=305
SORT
ROTATE=VARIMAX
Notice the shortcut notation (..) for listing consecutive variables in a file.
The output follows:
Latent Roots (Eigenvalues)
1
4.6729
6
0.1867
2
1.7710
7
0.1373
3
0.4810
8
0.0965
Component loadings
1
HEIGHT
ARM_SPAN
LOWERLEG
FOREARM
WEIGHT
BITRO
WIDTH
GIRTH
0.8594
0.8416
0.8396
0.8131
0.7580
0.6742
0.6706
0.6172
2
0.3723
0.4410
0.3953
0.4586
-0.5247
-0.5333
-0.4185
-0.5801
2
1.7710
2
22.1373
0.9298
0.9191
0.8998
0.8992
0.2507
0.1806
0.1068
0.2509
2
0.1955
0.1638
0.2599
0.2295
0.8871
0.8404
0.8403
0.7496
1.0000)
4
0.4214
5
0.2332
I-343
Factor Analysis
3.4973
2.9465
43.7165
36.8318
Scree Plot
1.0
WEIGHT
GIRTH
BITRO
WIDTH
FACTOR(2)
Eigenvalue
0.5
3
2
0.0
-0.5
1
0
0
HEIGHT
LOWERLEG
FOREARM
ARM_SPAN
3 4 5 6 7
Number of Factors
-1.0
-1.0
-0.5
0.0
0.5
FACTOR(1)
1.0
Notice that we did not specify how many factors we wanted. For PCA, the assumption
is to compute as many factors as there are eigenvalues greater than 1.0so, in this run,
you study results for two factors. After examining the output, you may want to specify
a minimum eigenvalue or, very rarely, a lower limit.
Unrotated loadings (and orthogonally rotated loadings) are correlations of the
variables with the principal components (factors). They are also the eigenvectors of the
correlation matrix multiplied by the square roots of the corresponding eigenvalues.
Usually these loadings are not useful for interpreting the factors. For some industrial
applications, researchers prefer to examine the eigenvectors alone.
The Variance explained for each component is the eigenvalue for the factor. The
first factor accounts for 58.4% of the variance; the second, 22.1%. The Total Variance
is the sum of the diagonal elements of the correlation (or covariance) matrix. By
summing the Percent of Total Variance Explained for the two factors
( 58.411 + 22.137 = 80.548 ), you can say that more than 80% of the variance of all
eight variables is explained by the first two factors.
In the Rotated Loading Matrix, the rows of the display have been sorted, placing the
loadings > 0.5 for factor 1 first, and so on. These are the coefficients of the factors after
I-344
Chapter 12
rotation, so notice that large values for the unrotated loadings are larger here and the
small values are smaller. The sums of squares of these coefficients (for each factor or
column) are printed below under the heading Variance Explained by Rotated
Components. Together, the two rotated factors explain more than 80% of the variance.
Factor analysis offers five types of rotation. Here, by default, the orthogonal varimax
method is used.
To interpret each factor, look for variables with high loadings. The four variables
that load highly on factor 1 can be said to measure lankiness; while the four that load
highly on factor 2, stockiness. Other data sets may include variables that do not load
highly on any specific factor.
In the factor scree plot, the eigenvalues are plotted against their order (or associated
component). Use this display to identify large values that separate well from smaller
eigenvalues. This can help to identify a useful number of factors to retain. Scree is the
rubble at the bottom of a cliff; the large retained roots are the cliff, and the deleted ones
are the rubble.
The points in the factor loadings plot are variables, and the coordinates are the
rotated loadings. Look for clusters of loadings at the extremes of the factors. The four
variables at the right of the plot load highly on factor 1 and all reflect length. The
variables at the top of the plot load highly on factor 2 and reflect width.
Example 2
Maximum Likelihood
This example uses maximum likelihood for initial factor extraction and 2 as the
number of factors. Other options remain as in the principal components example. The
input is:
FACTOR
USE youth
MODEL height .. width
ESTIMATE / METHOD=MLA
N=305
NUMBER=2
SORT
ROTATE=VARIMAX
I-345
Factor Analysis
0.8162
0.8493
3
0.8006
0.6041
0.5622
Maximum Change in
SQRT(uniqueness)
0.722640
0.243793
0.051182
0.010359
0.000493
0.7884
5
0.7488
8
0.4778
0.001000.
Negative log of
Likelihood
0.384050
0.273332
0.253671
0.253162
0.253162
2
0.8929
7
0.5837
Canonical Correlations
1
0.9823
2
0.9489
Factor pattern
1
HEIGHT
ARM_SPAN
LOWERLEG
FOREARM
WEIGHT
BITRO
WIDTH
GIRTH
0.8797
0.8735
0.8551
0.8458
0.7048
0.5887
0.5743
0.5265
2
0.2375
0.3604
0.2633
0.3442
-0.6436
-0.5383
-0.3653
-0.5536
2
1.5179
2
18.9742
3
0.8338
8
0.4633
4
0.8006
5
0.9109
I-346
Chapter 12
1.0000)
0.9262
0.8942
0.8628
0.8569
0.2268
0.1891
0.1289
0.2734
0.1873
0.1853
0.2928
0.2576
0.9271
0.7750
0.7530
0.6233
3.3146
2.6370
2
32.9628
2
44.3073
WEIGHT
BITRO
GIRTH
WIDTH
FACTOR(2)
0.5
LOWERLEG
HEIGHT
FOREARM
ARM_SPAN
0.0
-0.5
-1.0
-1.0
-0.5
0.0
0.5
FACTOR(1)
1.0
The first panel of output contains the communality estimates. The communality of a
variable is its theoretical squared multiple correlation with the factors extracted. For
MLA (and IPA), the assumption for the initial communalities is the observed squared
multiple correlation with all the other variables.
I-347
Factor Analysis
The canonical correlations are the largest multiple correlations for successive
orthogonal linear combinations of factors with successive orthogonal linear
combinations of variables. These values are comfortably high. If, for other data, some
of the factors have values that are much lower, you might want to request fewer factors.
The loadings and amount of variance explained are similar to those found in the
principal components example. In addition, maximum likelihood reports the
percentage of common variance explained. Common variance is the sum of the
communalities. If A is the unrotated MLA factor pattern matrix, common variance is
the trace of AA.
Number of Factors
In this example, we specified two factors to extract. If you were to omit this
specification and rerun the example, SYSTAT adds this report to the output:
The Maximum Number of Factors for Your Data is 4
SYSTAT will also report this message if you request more than four factors for these
data. This result is due to a theorem by Lederman and indicates that the degrees of
freedom allow estimates of loadings and communalities for only four factors.
If we set the print length to long, SYSTAT reports:
Chi-square Test that the Number of Factors is 4
CSQ = 4.3187 P = 0.1154 DF = 2.00
The results of this chi-square test indicate that you do not reject the hypothesis that
there are four factors (p value > 0.05). Technically, the hypothesis is that no more than
four factors are required. This, of course, does not negate 2 as the right number. For
the YOUTH data, here are rotated loadings for four factors:
Rotated Pattern Matrix ( VARIMAX, Gamma =
1
ARM_SPAN
LOWERLEG
HEIGHT
FOREARM
WEIGHT
BITRO
GIRTH
WIDTH
0.9372
0.8860
0.8776
0.8732
0.2414
0.1823
0.1133
0.2597
2
0.1984
0.2142
0.2819
0.1957
0.8830
0.8233
0.7315
0.6459
1.0000)
3
-0.2831
0.1878
0.1134
-0.0851
0.1077
0.0163
-0.0048
-0.1400
4
0.0465
0.1356
-0.0077
-0.0065
0.1080
-0.0784
0.5219
0.0819
I-348
Chapter 12
The loadings for the last two factors do not make sense. Possibly, the fourth factor has
one variable, GIRTH, but it still has a healthier loading on factor 2. This test is based
on an assumption of multivariate normality (as is MLA itself). If not true, then the test
is invalid.
Example 3
Iterated Principal Axis
This example continues with the YOUTH data described in the principal components
example, this time using the IPA (iterated principal axis) method to extract factors. The
input is:
FACTOR
USE youth
MODEL height .. width
ESTIMATE / METHOD=IPA
SORT
ROTATE=VARIMAX
2
0.8493
7
0.5622
3
0.8006
5
0.7488
8
0.4778
4
0.7884
0.001000.
Maximum Change in
SQRT(communality)
0.308775
0.039358
0.017077
0.008751
0.004934
0.002923
0.001776
0.001093
0.000677
2
0.8887
7
0.5835
3
0.8205
8
0.4921
4
0.8077
5
0.8880
I-349
Factor Analysis
4.4489
1.5100
6
-0.0374
7
-0.0602
3
0.1016
8
-0.0743
Factor pattern
1
HEIGHT
ARM_SPAN
LOWERLEG
FOREARM
WEIGHT
BITRO
WIDTH
GIRTH
0.8561
0.8482
0.8309
0.8082
0.7500
0.6307
0.6074
0.5688
2
0.3244
0.4114
0.3424
0.4090
-0.5706
-0.4924
-0.3509
-0.5098
4.4489
1.5100
2
18.8753
0.9203
0.8874
0.8724
0.8639
0.2334
0.1884
0.1291
0.2581
0.2045
0.1815
0.2775
0.2478
0.9130
0.7777
0.7529
0.6523
3.3150
2.6439
2
33.0485
2
44.3686
1.0000)
4
0.0551
5
0.0150
I-350
Chapter 12
WEIGHT
BITRO
GIRTH
WIDTH
FACTOR(2)
0.5
LOWERLEG
HEIGHT
FOREARM
ARM_SPAN
0.0
-0.5
-1.0
-1.0
-0.5
0.0
0.5
FACTOR(1)
1.0
Before the first iteration, the communality of a variable is its multiple correlation
squared with the remaining variables. At each iteration, communalities are estimated
from the loadings matrix, A, by finding the trace of AA, where the number of
columns in A is the number of factors. Iterations continue until the largest change in
any communality is less than that specified with Convergence. Replacing the diagonal
of the correlation (or covariance) matrix with these final communality estimates and
computing the eigenvalues yields the latent roots in the next panel.
Example 4
Rotation
Lets compare the unrotated and orthogonally rotated loadings from the principal
components example with those from an oblique rotation. The input is:
FACTOR
USE youth
PRINT = LONG
MODEL height .. width
ESTIMATE / METHOD=PCA
N=305
SORT
N=305
SORT
ROTATE=VARIMAX
N=305
SORT
ROTATE=OBLIMIN
I-351
Factor Analysis
0.8594
0.8416
0.8396
0.8131
0.7580
0.6742
0.6706
0.6172
2
0.3723
0.4410
0.3953
0.4586
-0.5247
-0.5333
-0.4185
-0.5801
2
1.7710
2
22.1373
0.9298
0.9191
0.8998
0.8992
0.2507
0.1806
0.1068
0.2509
1.0000)
2
0.1955
0.1638
0.2599
0.2295
0.8871
0.8404
0.8403
0.7496
2
2.9465
2
36.8318
0.9572
0.9533
0.9157
0.9090
0.0537
-0.0904
-0.0107
0.0876
2
-0.0166
-0.0482
0.0276
0.0604
0.8975
0.8821
0.8642
0.7487
0.0)
I-352
Chapter 12
3.5273
2.9166
44.0913
36.4569
3.5087
0.0186
2.8979
0.9350
0.9500
0.9277
0.9325
0.4407
0.3620
0.4104
0.2900
0.4523
0.3962
0.4225
0.3629
0.9206
0.8596
0.7865
0.8431
No rotation
1.0
FOREARM
ARM_SPAN
LOWERLEG
HEIGHT
FACTOR(2)
0.5
0.0
WIDTH
WEIGHT
BITRO
GIRTH
-0.5
-1.0
-1.0
-0.5
0.0
0.5
FACTOR(1)
1.0
Varimax
Oblimin
1.0
1.0
WEIGHT
GIRTH
BITRO
WIDTH
GIRTH
BITRO
WIDTH
0.5
HEIGHT
LOWERLEG
FOREARM
ARM_SPAN
0.0
-0.5
FACTOR(2)
FACTOR(2)
0.5
-1.0
-1.0
WEIGHT
HEIGHT
LOWERLEG
ARM_SPAN
FOREARM
0.0
-0.5
-0.5
0.0
0.5
FACTOR(1)
1.0
-1.0
-1.0
-0.5
0.0
0.5
FACTOR(1)
1.0
I-353
Factor Analysis
The values in Direct and Indirect Contributions of Factors to Variance are useful for
determining if part of a factors contribution to Variance Explained is due to its
correlation with another factor. Notice that
3.509 + 0.019 = 3.528 (or 3.527)
is the Variance Explained for factor 2 (differences in the last digit are due to a
rounding error).
Think of the values in the Rotated Structure Matrix as correlations of the variable
with the factors. Here we see that the first four variables are highly correlated with the
first factor. The remaining variables are highly correlated with the second factor.
The factor loading plots illustrate the effects of the rotation methods. While the
unrotated factor loadings form two distinct clusters, they both have strong positive
loadings for factor 1. The lanky variables have moderate positive loadings on factor
2 while the stocky variables have negative loadings on factor 2. With the varimax
rotation, the lanky variables load highly on factor 1 with small loadings on factor 2;
the stocky variables load highly on factor 2. The oblimin rotation does a much better
job of centering each cluster at 0 on its minor factor.
Example 5
Factor Analysis Using a Covariance Matrix
Jackson (1991) describes a project in which the maximum thrust of ballistic missiles
was measured. For a specific measure called total impulse, it is necessary to calculate
the area under a curve. Originally, a planimeter was used to obtain the area, and later
an electronic device performed the integration directly but unreliably in its early usage.
As data, two strain gauges were attached to each of 40 Nike rockets, and both types of
measurements were recorded in parallel (making four measurements per rocket). The
covariance matrix of the measures is stored in the MISSLES file.
In this example, we illustrate features associated with covariance matrix input
(asymptotic 95% confidence limits for the eigenvalues, estimates of the population
eigenvalues with standard errors, and latent vectors (eigenvectors or characteristic
vectors) with standard errors).
I-354
Chapter 12
2
48.0344
3
29.3305
4
16.4096
2
85.5102
3
52.2138
4
29.2122
Lower Limits:
1
233.1534
2
33.3975
3
20.3930
4
11.4093
2
46.9298
3
31.0859
4
18.3953
2
10.1768
3
5.7355
4
3.2528
9.00
0.4681
0.6079
0.4590
0.4479
2
0.6215
0.1788
-0.1387
-0.7500
3
0.5716
-0.7595
0.1677
0.2615
4
0.2606
0.1473
-0.8614
0.4104
0.0532
0.0412
0.0342
0.0561
2
0.1879
0.2456
0.1359
0.1058
3
0.2106
0.0758
0.2366
0.2633
4
0.1773
0.2066
0.0519
0.1276
I-355
Factor Analysis
Component loadings
1
INTEGRA1
PLANMTR1
INTEGRA2
PLANMTR2
8.5727
11.1325
8.4051
8.2017
4.3072
1.2389
-0.9616
-5.1983
3.0954
-4.1131
0.9084
1.4165
1.0559
0.5965
-3.4893
1.6625
335.3355
48.0344
29.3305
16.4096
78.1467
11.1940
6.8352
3.8241
PLANMTR1
0.0000
0.0000
0.0000
0.0000
INTEGRA2
0.0000
0.0000
0.0000
0.0000
0.0000
FACTOR(1)
FACTOR(1)
FACTOR(2)
FACTOR(3)
FACTOR(4)
2
3
4
Number of Factors
PLANMTR1
INTEGRA2INTEGRA2
PLANMTR2
PLANMTR2
INTEGRA1
INTEGRA2
PLANMTR1
PLANMTR1
PLANMTR2
PLANMTR2
PLANMTR1INTEGRA1
INTEGRA1
PLANMTR1
PLANMTR1
INTEGRA2
INTEGRA2
FACTOR(1)
INTEGRA1
PLANMTR1
PLANMTR2
INTEGRA1
PLANMTR2
PLANMTR2
INTEGRA2
INTEGRA2
PLANMTR1
INTEGRA1
PLANMTR2
FACTOR(2)
INTEGRA1
PLANMTR2
PLANMTR1
FACTOR(4)
100
INTEGRA1
INTEGRA1
PLANMTR1
INTEGRA2
FACTOR(4)
INTEGRA1
INTEGRA2INTEGRA2
PLANMTR2
PLANMTR2
INTEGRA1
INTEGRA2
FACTOR(3)
FACTOR(3)
200
FACTOR(3)
PLANMTR1
PLANMTR1
INTEGRA1
FACTOR(2)
300
FACTOR(2)
INTEGRA2
PLANMTR2
FACTOR(1)
400
Eigenvalue
0.0000
Scree Plot
0
0
PLANMTR2
FACTOR(4)
SYSTAT performs a test to determine if all eigenvalues are equal. The null hypothesis
is that all eigenvalues are equal against an alternative hypothesis that at least one root
is different. The results here indicate that you reject the null hypothesis (p < 0.00005).
At least one of the eigenvalues differs from the others.
The size and sign of the loadings reflect how the factors and variables are related.
The first factor has fairly similar loadings for all four variables. You can interpret this
I-356
Chapter 12
factor as an overall average of the area under the curve across the four measures. The
second factor represents gauge differences because the signs are different for each. The
third factor is primarily a comparison between the first planimeter and the first
integration device. The last factor has no simple interpretation.
When there are four or more factors, the Quick Graph of the loadings is a SPLOM.
The first component represents 78% of the variability of the product, so plots of
loadings for factors 2 through 4 convey little information (notice that values in the
stripe displays along the diagonal concentrate around 0, while those for factor 1 fall to
the right).
Example 6
Factor Analysis Using a Rectangular File
Begin this analysis from the OURWORLD cases-by-variables data file. Each case
contains information for one of 57 countries. We will study the interrelations among a
subset of 13 variables including economic measures (gross domestic product per capita
and U.S. dollars spent per person on education, health, and the military), birth and
death rates, population estimates for 1983, 1986, and 1990 plus predictions for 2020,
and the percentages of the population who can read and who live in cities.
We request principal components extraction with an oblique rotation. As a first step,
SYSTAT computes the correlation matrix. Correlations measure linear relations.
However, plots of the economic measures and population values as recorded indicate
a lack of linearity, so you use base 10 logarithms to transform six variables, and you
use square roots to transform two others. The input is:
FACTOR
USE ourworld
LET (gdp_cap, gnp_86, pop_1983, pop_1986, pop_1990, pop_2020),
= L10(@)
LET (mil,educ) = SQR(@)
MODEL urban birth_rt death_rt gdp_cap gnp_86 mil,
educ b_to_d literacy pop_1983 pop_1986,
pop_1990 pop_2020
PRINT=MEDIUM
SAVE pcascore / SCORES
ESTIMATE / METHOD=PCA SORT ROTATE=OBLIMIN
I-357
Factor Analysis
1.0000
-0.8002
-0.5126
0.7636
0.7747
0.6453
0.6238
-0.3074
0.7997
0.2133
0.1898
0.1700
0.0054
MIL
MIL
EDUC
B_TO_D
LITERACY
POP_1983
POP_1986
POP_1990
POP_2020
1.0000
0.8869
-0.6184
0.6421
0.2206
0.1942
0.1727
-0.0339
POP_1986
POP_1986
POP_1990
POP_2020
1.0000
0.9992
0.9605
BIRTH_RT
1.0000
0.5110
-0.9189
-0.8786
-0.7547
-0.7528
0.5106
-0.9302
-0.0836
-0.0523
-0.0252
0.1880
EDUC
1.0000
-0.5252
0.6869
-0.0062
-0.0306
-0.0513
-0.2555
POP_1990
DEATH_RT
1.0000
-0.4012
-0.4518
-0.1482
-0.2151
-0.4340
-0.6601
0.0152
0.0291
0.0284
0.0743
B_TO_D
1.0000
-0.2737
-0.1526
-0.1358
-0.1070
0.0617
GDP_CAP
1.0000
0.9736
0.8657
0.8996
-0.5293
0.8337
0.0583
0.0248
-0.0015
-0.2116
LITERACY
1.0000
-0.0050
-0.0327
-0.0534
-0.2360
GNP_86
1.0000
0.8514
0.9207
-0.4411
0.8404
0.0090
-0.0215
-0.0447
-0.2484
POP_1983
1.0000
0.9984
0.9966
0.9531
POP_2020
1.0000
0.9673
1.0000
4.0165
1.6557
0.0966
0.0812
0.0403
11
12
13
0.0054
0.0012
0.0002
2
-0.0366
-0.0846
0.0136
9
0.0251
7.4817.
0.9769
0.9703
-0.9512
0.4327
GDP_CAP
GNP_86
BIRTH_RT
3
-0.0606
0.0040
-0.0774
78.00
59.89
5
0.2390
10
0.0110
I-358
Chapter 12
LITERACY
EDUC
MIL
URBAN
B_TO_D
POP_1990
POP_1986
POP_1983
POP_2020
DEATH_RT
0.8972
0.8927
0.8770
0.8393
-0.5166
0.0382
0.0636
0.0945
-0.1796
-0.4533
-0.1008
-0.0857
0.1501
0.1425
-0.1225
0.9972
0.9966
0.9940
0.9748
0.0820
0.3004
-0.2296
-0.2909
0.2300
0.7762
0.0394
0.0253
0.0248
0.1002
-0.8662
4.0165
1.6557
49.1924
30.8964
Rotated Pattern Matrix (OBLIMIN, Gamma =
1
GDP_CAP
GNP_86
BIRTH_RT
EDUC
LITERACY
MIL
URBAN
B_TO_D
POP_1990
POP_1986
POP_1983
POP_2020
DEATH_RT
0.9779
0.9714
-0.9506
0.8961
0.8956
0.8777
0.8349
-0.5224
0.0236
0.0491
0.0801
-0.1945
-0.4459
3
12.7361
0.0)
2
-0.0399
-0.0816
0.0040
-0.1049
-0.0700
0.1242
0.1658
-0.0501
0.9977
0.9958
0.9932
0.9805
-0.0011
3
0.0523
-0.0146
0.0843
0.2194
-0.3112
0.2924
-0.2285
-0.7787
0.0095
0.0234
0.0235
-0.0510
0.8730
4.0057
1.6669
2
30.8129
3
12.8225
1.0000
0.0127
-0.0020
2
1.0000
0.0452
1.0000
I-359
F actor Anal ysi s
DEATH_RT
MIL
POP_1983
POP_1986
POP_1990
POP_2020
EDUC
GDP_CAP
GNP_86
URBAN
BIRTH_RT
LITERACY
B_TO_D
By default, SYSTAT extracts three factors because three eigenvalues are greater than
1.0. On factor 1, seven or eight variables have high loadings. The eighth, B_TO_D (a
ratio of birth-to-death rate) has a higher loading on factor 3. With the exception of
BIRTH_RT, the other variables are economic measures, so lets identify this as the
economic factor. Clearly, the second factor can be named population, and the
third, less clearly, death rates.
The economic and population factors account for 80% (49.19 + 30.81) of the total
variance, so a plot of the scores for these factors should be useful for characterizing
differences among the countries. The third factor accounts for 13% of the total
variance, a much smaller amount than the other two factors. Notice, too, that only 7%
of the total variance is not accounted for by these three factors.
I-360
Chapter 12
B_TO_D
1.000
-0.434
DEATH_RT
1.000
Factor Scores
Look at the scores just stored in PCASCORE. First, merge the name of each country
and the grouping variable GROUP$ with the scores. The values of GROUP$ identify
each country as Europe, Islamic, or New World. Next, plot factor 2 against factor 1
(labeling points with country names) and factor 3 against factor 1 (labeling points with
the first letter of their group membership). Finally, use SPLOMs to display the scores,
adding 75% confidence ellipses for each subgroup in the plots and normal curves for
the univariate distributions. Repeat the latter using kernel density estimators.
I-361
Factor Analysis
Population
1
0
-1
Bangladesh Brazil
Pakistan
WGermany
Italy
Turkey
UK France
Spain
Ethiopia Colombia
Poland
Algeria
Peru
Sudan
Canada
Argentina
Venezuela Netherlands
Malaysia
Chile
Ecuador
Greece
Senegal
Sweden
Hungary
Guatemala
DominicanR.
PortugalSwitzerland
Yemen
Mali
HaitiElSalvador
Austria
Denmark
Somalia
Honduras Bolivia
Finland
Uruguay
Norway
Ireland
Jamaica
CostaRica
Panama
1
Death Rate
I
I I
IN
II II I
N
I
I
N
I
NNNN N N
N N
N
N
N N
N
-1
Trinidad
-2
Gambia
Barbados
FACTOR(1)
FACTOR(2)
0
Economic
FACTOR(1)
FACTOR(2)
FACTOR(3)
FACTOR(3)
FACTOR(3)
NewWorld
Islamic
Europe
FACTOR(2)
FACTOR(2)
FACTOR(3)
FACTOR(3)
GROUP$
0
Economic
FACTOR(1)
FACTOR(2)
FACTOR(2)
-1
FACTOR(1)
FACTOR(3)
FACTOR(1)
FACTOR(1)
-3
-2
FACTOR(1)
-1
FACTOR(2)
-3
-2
FACTOR(3)
-2
E
E
E
EEEE
E
E
E
E
EE
EN
N
N E
N
E
FACTOR(1)
FACTOR(2)
FACTOR(3)
GROUP$
NewWorld
Islamic
Europe
I-362
Chapter 12
High loadings on the economic factor show countries that are strong economically
(Germany, Canada, Netherlands, Sweden, Switzerland, Denmark, and Norway)
relative to those with low loadings (Bangladesh, Ethiopia, Mali, and Gambia). Not
surprisingly, the population factor identifies Barbados as the smallest and Bangladesh,
Pakistan, and Brazil as largest. The questionable third factor (death rate) does help to
separate the New World countries from the others.
In each SPLOM, the dashed lines marking curves, ellipses, and kernel contours
identify New World countries. The kernel contours in the plot of factor 3 against factor
1 identify a pocket of Islamic countries within the New World group.
Computation
Algorithms
Provisional methods are used for computing covariance or correlation matrices (see
Correlations for references). Components are computed by using a Householder
tridiagonalization and implicit QL iterations. Rotations are computed with a variant of
Kaisers iterative algorithm, described in Mulaik (1972).
Missing Data
Ordinarily, Factor Analysis and other multivariate procedures delete all cases having
missing values on any variable selected for analysis. This is listwise deletion. For data
with many missing values, you may end up with too few complete cases for analysis.
Select Pairwise deletion if you want covariances or correlations computed separately
for each pair of variables selected for analysis. Pairwise deletion takes more time than
the standard listwise deletion because all possible pairs of variances and covariances
are computed. The same option is offered for Correlations, should you decide to create
a symmetric matrix for use in factor analysis that way. Also notice that Correlation
provides an EM algorithm for estimating correlation or covariance matrices when data
are missing.
Be careful. When you use pairwise deletion, you can end up with negative
eigenvalues for principal components or be unable to compute common factors at all.
With either method, it is desirable that the pattern of missing data be random.
Otherwise, the factor structure you compute will be influenced systematically by the
pattern of how values are missing.
I-363
Factor Analysis
References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Clarkson, D. B. and Jennrich, R. I. (1988). Quartic rotation criteria and algorithms,
Psychometrika, 53, 251259.
Dixon, W. J. et al. (1985). BMDP statistical software manual. Berkeley: University of
California Press.
Gnanadesikan, R. (1977). Methods for statistical data analysis of multivariate
observations. New York: John Wiley & Sons, Inc.
Harman, H. H. (1976). Modern factor analysis, 3rd ed. Chicago: University of Chicago
Press.
Jackson, J. E. (1991). A users guide to principal components. New York: John Wiley &
Sons, Inc.
Jennrich, R. I. and Robinson, S. M. (1969). A newton-raphson algorithm for maximum
likelihood factor analysis. Psychometrika, 34, 111123.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate analysis. London:
Academic Press.
Morrison, D. F. (1976). Multivariate statistical methods, 2nd ed. New York: McGraw-Hill.
Mulaik, S. A. (1972). The foundations of factor analysis. New York: McGraw-Hill.
Rozeboom, W. W. (1982). The determinacy of common factors in large item domains.
Psychometrika, 47, 281295.
Steiger, J. H. (1979). Factor indeterminacy in the 1930s and 1970s: some interesting
parallels. Psychometrika, 44, 157167.
Chapter
13
Linear Models
Each chapter in this manual normally has its own statistical background section. In
this part, however, Regression, ANOVA, and General Linear Models are grouped
together. There are two reasons for doing this. First, while some introductory
textbooks treat regression and analysis of variance as distinct, statisticians know that
they are based on the same underlying mathematical model. When you study what
these procedures do, therefore, it is helpful to understand that model and learn the
common terminology underlying each method. Second, although SYSTAT has three
commands (REGRESS, ANOVA, and GLM) and menu settings, it is a not-so-wellguarded secret that these all lead to the same program, originally called MGLH (for
Multivariate General Linear Hypothesis). Having them organized this way means that
SYSTAT can use tools designed for one approach (for example, dummy variables in
ANOVA) in another (such as computing within-group correlations in multivariate
regression). This synergy is not usually available in packages that treat these models
independently.
I-365
I-366
Chapter 13
y = a + bx
This is the equation for a straight line that you learned in school. The quantities in this
equation are:
y
a dependent variable
an independent variable
Variables are quantities that can vary (have different numerical values) in the same
equation. The remaining quantities are called parameters. A parameter is a quantity
that is constant in a particular equation, but that can be varied to produce other
equations in the same general family. The parameters are:
a
The value of y when x is 0. This is sometimes called a y-intercept (where a line intersects
the y axis in a graph when x is 0).
The slope of the line, or the number of units y changes when x changes by one unit.
Lets look at an example. Here are some data showing the yearly earnings a partner
should theoretically get in a certain large law firm, based on annual personal billings
over quota (both in thousands of dollars):
EARNINGS
60
70
80
90
100
120
140
150
175
190
BILLINGS
20
40
60
80
100
140
180
200
250
280
I-367
Linear Models
We can plot these data with EARNINGS on the vertical axis (dependent variable) and
BILLINGS on the horizontal (independent variable). Notice in the following figure that
all the points lie on a straight line.
What is the equation for this line? Look at the vertical axis value on the sloped line
where the independent variable has a value of 0. Its value is 50. A lawyer is paid
$50,000 even when billing nothing. Thus, a is 50 in our equation. What is b? Notice
that the line rises by $10,000 when billings change by $20,000. The line rises half as
fast as it runs. You can also look at the data and see that the earnings change by $1 as
billing changes by $2. Thus, b is 0.5, or a half, in our equation.
Why bother with all these calculations? We could use the table to determine a
lawyers compensation, but the formula and the line graph allow us to determine wages
not found in the table. For example, we now know that $30,000 in billings would yield
earnings of $65,000:
I-368
Chapter 13
Regression
Data are seldom this clean unless we design them to be that way. Law firms typically
fine tune their partners earnings according to many factors. Here are the real billings
and earnings for our law firm (these lawyers predate Reagan, Bush, Clinton, and
Gates):
EARNINGS
86
67
95
105
86
82
140
145
144
184
BILLINGS
20
40
60
80
100
140
180
200
250
280
Our techniques for computing a linear equation wont work with these data. Look at
the following graph. There is no way to draw a straight line through all the data.
Given the irregularities in our data, the line drawn in the figure is a compromise. How
do we find a best fitting line? If we are interested in predicting earnings from the billing
data values rather well, a reasonable method would be to place a line through the points
so that the vertical deviations between the points and the line (errors in predicting
I-369
Linear Models
Least Squares
There are several ways to draw the line so that, on average, the deviations are small.
We could minimize the mean, the median, or some other measure of the typical
behavior of the absolute values of the residuals. Or we can minimize the sum (or mean)
of the squared residuals, which yields almost the same line in most cases. Using
squared instead of absolute residuals gives more influence to points whose y value is
farther from the average of all y values. This is not always desirable, but it makes the
mathematics simpler. This method is called ordinary least squares.
By specifying EARNINGS as the dependent variable and BILLINGS as the
independent variable in a MODEL statement, we can compute the ordinary leastsquares regression y-intercept as $62,800 and the slope as 0.375. These values do not
predict any single lawyers earnings exactly. They describe the whole firm well, in the
sense that, on the average, the line predicts a given earnings value fairly closely from
a given billings value.
I-370
Chapter 13
to avoid confusion later. We are going to use Greek to denote parameters and italic
Roman letters for variables. The error parameter is usually called .
y = + x +
Notice that is a random variable. It varies like any other variable (for example, x),
but it varies randomly, like the tossing of a coin. Since is random, our model forces
y to be random as well because adding fixed values ( and x ) to a random variable
produces another random variable. In ordinary language, we are saying with our model
that earnings are only partly predictable from billings. They vary slightly according to
many other factors, which we assume are random.
We do not know all of the factors governing the firms compensation decisions, but
we assume:
n All the salaries are derived from the same linear model.
n The error in predicting a particular salary from billings using the model is
independent of (not in any way predictable from) the error in predicting other
salaries.
n The errors in predicting all the salaries come from the same random distribution.
Our model for predicting in our population contains parameters, but unlike our perfect
straight line example, we cannot compute these parameters directly from the data. The
data we have are only a small sample from a much larger population, so we can only
estimate the parameter values using some statistical method on our sample data. Those
of you who have heard this story before may not be surprised that ordinary least
squares is one reasonable method for estimating parameters when our three
assumptions are appropriate. Without going into all the details, we can be reasonably
assured that if our population assumptions are true and if we randomly sample some
cases (that is, each case has an equal chance of being picked) from the population, the
least-squares estimates of and will, on average, be close to their values in the
population.
So far, we have done what seems like a sleight of hand. We delved into some
abstruse language and came up with the same least-squares values for the slope and
intercept as before. There is something new, however. We have now added conditions
that define our least-squares values as sample estimates of population values. We now
regard our sample data as one instance of many possible samples. Our compensation
model is like Platos cave metaphor; we think it typifies how this law firm makes
compensation decisions about any lawyer, not just the ones we sampled. Before, we
were computing descriptive statistics about a sample. Now, we are computing
inferential statistics about a population.
I-371
Linear Models
Standard Errors
There are several statistics relevant to the estimation of and . Perhaps most
important is a measure of how variable we could expect our estimates to be if we
continued to sample data from our population and used least squares to get our
estimates. A statistic calculated by SYSTAT shows what we could expect this variation
to be. It is called, appropriately, the standard error of estimate, or Std Error in the
output. The standard error of the y-intercept, or regression constant, is in the first row
of the coefficients: 10.440. The standard error of the billing coefficient or slope is
0.065. Look for these numbers in the following output:
Dep Var: EARNINGS
N: 10
Multiple R: 0.897
Coefficient
Std Error
Std Coef
Tolerance
62.838
0.375
10.440
0.065
0.0
0.897
.
1.000
t
6.019
5.728
P(2 Tail)
0.000
0.000
Analysis of Variance
Source
Sum-of-Squares
Regression
Residual
10191.109
2485.291
df
1
8
Mean-Square
F-ratio
10191.109
310.661
32.805
P
0.000
Hypothesis Testing
From these standard errors, we can construct hypothesis tests on these coefficients.
Suppose a skeptic approached us and said, Your estimates look as if something is
going on here, but in this firm, salaries have nothing to do with billings. You just
happened to pick a sample that gives the impression that billings matter. It was the luck
of the draw that provided you with such a misleading picture. In reality, is 0 in the
population because billings play no role in determining earnings.
We can reply, If salaries had nothing to do with billings but are really just a mean
value plus random error for any billing level, then would it be likely for us to find a
coefficient estimate for at least this different from 0 in a sample of 10 lawyers?
To represent these alternatives as a bet between us and the skeptic, we must agree
on some critical level for deciding who will win the bet. If the likelihood of a sample
result at least this extreme occurring by chance is less than or equal to this critical level
(say, five times out of a hundred), we win; otherwise, the skeptic wins.
This logic might seem odd at first because, in almost every case, our skeptics null
hypothesis would appear ridiculous, and our alternative hypothesis (that the skeptic is
wrong) seems plausible. Two scenarios are relevant here, however. The first is the
I-372
Chapter 13
lawyers. We are trying to make a case here. The only way we will prevail is if we
convince our skeptical jury beyond a reasonable doubt. In statistical practice, that
reasonable doubt level is relatively liberal: fewer than five times in a hundred. The
second scenario is the scientists. We are going to stake our reputation on our model.
If someone sampled new data and failed to find nonzero coefficients, much less
coefficients similar to ours, few would pay attention to us in the future.
To compute probabilities, we must count all possibilities or refer to a mathematical
probability distribution that approximates these possibilities well. The most widely
used approximation is the normal curve, which we reviewed briefly in Chapter 1. For
large samples, the regression coefficients will tend to be normally distributed under the
assumptions we made above. To allow for smaller samples, however, we will add the
following condition to our list of assumptions:
n The errors in predicting the salaries come from a normal distribution.
If we estimate the standard errors of the regression coefficients from the data instead
of knowing them in advance, then we should use the t distribution instead of the
normal. The two-tail value for the probability represents the area under the theoretical
t probability curve corresponding to coefficient estimates whose absolute values are
more extreme than the ones we obtained. For both parameters in the model of lawyers
earnings, these values (given as P(2 tail)) are less than 0.001, leading us to reject our
null hypothesis at well below the 0.05 level.
At the bottom of our output, we get an analysis of variance table that tests the
goodness of fit of our entire model. The null hypothesis corresponding to the F ratio
(32.805) and its associated p value is that the billing variable coefficient is equal to 0.
This test overwhelmingly rejects the null hypothesis that both and are 0.
Multiple Correlation
In the same output is a statistic called the squared multiple correlation. This is the
proportion of the total variation in the dependent variable (EARNINGS) accounted for by
the linear prediction using BILLINGS. The value here (0.804) tells us that approximately
80% of the variation in earnings can be accounted for by a linear prediction from billings.
The rest of the variation, as far as this model is concerned, is random error. The square
root of this statistic is called, not surprisingly, the multiple correlation. The adjusted
squared multiple correlation (0.779) is what we would expect the squared multiple
correlation to be if we used the model we just estimated on a new sample of 10 lawyers
in the firm. It is smaller than the squared multiple correlation because the coefficients
were optimized for this sample rather than for the new one.
I-373
Linear Models
Regression Diagnostics
We do not need to understand the mathematics of how a line is fitted in order to use
regression. You can fit a line to any x-y data by the method of least squares. The
computer doesnt care where the numbers come from. To have a model and estimates
that mean something, however, you should be sure the assumptions are reasonable and
that the sample data appear to be sampled from a population that meets the
assumptions.
The sample analogues of the errors in the population model are the residualsthe
differences between the observed and predicted values of the dependent variable.
There are many diagnostics you can perform on the residuals. Here are the most
important ones:
The errors are normally distributed. Draw a normal probability plot (PPLOT) of the
residuals.
2
-1
-2
-40
-30
-20 -10
0
RESIDUAL
10
20
The residuals should fall approximately on a diagonal straight line in this plot. When
the sample size is small, as in our law example, the line may be quite jagged. It is
difficult to tell by any method whether a small sample is from a normal population. You
can also plot a histogram or stem-and-leaf diagram of the residuals to see if they are
lumpy in the middle with thin, symmetric tails.
The errors have constant variance. Plot the residuals against the estimated values. The
following plot shows studentized residuals (STUDENT) against estimated values
(ESTIMATE). Studentized residuals are the true external kind discussed in Velleman
I-374
Chapter 13
and Welsch (1981). Use these statistics to identify outliers in the dependent variable
space. Under normal regression assumptions, they have a t distribution with
( N p 1 ) degrees of freedom, where N is the total sample size and p is the number
of predictors (including the constant). Large values (greater than 2 or 3 in absolute
magnitude) indicate possible problems.
2
STUDENT
1
0
-1
-2
-3
50
100
150
ESTIMATE
200
Our residuals should be arranged in a horizontal band within two or three units around
0 in this plot. Again, since there are so few observations, it is difficult to tell whether
they violate this assumption in this case. There is only one particularly large residual,
and it is toward the middle of the values. This lawyer billed $140,000 and is earning
only $80,000. He or she might have a gripe about supporting a higher share of the
firms overhead.
The errors are independent. Several plots can be done. Look at the plot of residuals
against estimated values above. Make sure that the residuals are randomly scattered
above and below the 0 horizontal and that they do not track in a snaky way across the
plot. If they look as if they were shot at the plot by a horizontally moving machine gun,
then they are probably not independent of each other. You may also want to plot
residuals against other variables, such as time, orientation, or other ways that might
influence the variability of your dependent measure. ACF PLOT in SERIES measures
whether the residuals are serially correlated. Here is an autocorrelation plot:
I-375
Linear Models
All the bars should be within the confidence bands if each residual is not predictable
from the one preceding it, and the one preceding that, and the one preceding that, and
so on.
All the members of the population are described by the same linear model. Plot Cooks
distance (COOK) against the estimated values.
0.5
COOK
0.4
0.3
0.2
0.1
0.0
50
100
150
ESTIMATE
200
Cooks distance measures the influence of each sample observation on the coefficient
estimates. Observations that are far from the average of all the independent variable
values or that have large residuals tend to have a large Cooks distance value (say,
greater than 2). Cooks D actually follows closely an F distribution, so aberrant values
depend on the sample size. As a rule of thumb, under the normal regression
assumptions, COOK can be compared to an F distribution with p and N p degrees of
freedom. We dont want to find a large Cooks D value for an observation because it
would mean that the coefficient estimates would change substantially if we deleted that
I-376
Chapter 13
observation. While none of the COOK values are extremely large in our example, could
it be that the largest one in the upper right corner is the founding partner in the firm?
Despite large billings, this partner is earning more than the model predicts.
Another diagnostic statistic useful for assessing the model fit is leverage, discussed
in Belsley, Kuh, and Welsch (1980) and Velleman and Welsch (1981). Leverage helps
to identify outliers in the independent variable space. Leverage has an average value
of p N , where p is the number of estimated parameters (including the constant) and
N is the number of cases. What is a high value of leverage? In practice, it is useful to
examine the values in a stem-and-leaf plot and identify those that stand apart from the
rest of the sample. However, various rules of thumb have been suggested. For example,
values of leverage less than 0.2 appear to be safe; between 0.2 and 0.5, risky; and above
0.5, to be avoided. Another says that if p > 6 and (N p) > 12, use ( 3p ) N as a
cutoff. SYSTAT uses an F approximation to determine this value for warnings
(Belsley, Kuh, and Welsch, 1980).
In conclusion, keep in mind that all our diagnostic tests are themselves a form of
inference. We can assess theoretical errors only through the dark mirror of our
observed residuals. Despite this caveat, testing assumptions graphically is critically
important. You should never publish regression results until you have examined these
plots.
Multiple Regression
A multiple linear model has more than one independent variable; that is:
y = a + bx + cz
This is the equation for a plane in three-dimensional space. The parameter a is still an
intercept term. It is the value of y when x and z are 0. The parameters b and c are still
slopes. One gives the slope of the plane along the x dimension; the other, along the
z dimension.
The statistical model has the same form:
y = + x + z +
I-377
Linear Models
Before we run out of letters for independent variables, lets switch to a more frequently
used notation:
y = 0 + 1 x + 2 x2 +
Notice that we are still using Greek letters for unobservables and Roman letters for
observables.
Now, lets look at our law firm data again. We have learned that there is another
variable that appears to determine earningsthe number of hours billed per year by
each lawyer. Here is an expanded listing of the data:
EARNINGS
86
67
95
105
86
82
140
145
144
184
BILLINGS
20
40
60
80
100
140
180
200
250
280
HOURS
1771
1556
1749
1754
1594
1400
1780
1737
1645
1863
For our model, 1 is the coefficient for BILLINGS, and 2 is the coefficient for
HOURS. Lets look first at its graphical representation. The following figure shows the
plane fit by least squares to the points representing each lawyer. Notice how the plane
slopes upward on both variables. BILLINGS and HOURS both contribute positively to
EARNINGS in our sample.
I-378
Chap te r 13
Fitting this model involves no more work than fitting the simple regression model. We
specify one dependent and two independent variables and estimate the model as
before. Here is the result:
Dep Var:EARNINGS
N: 10
MULTIPLE R: .998
Variable
Coefficient
Std Error
Std Coef
Tolerance
CONSTANT
BILLINGS
HOURS
139.925
0.333
0.124
11.116
0.010
0.007
0.000
0.797
0.449
.
.9510698
.9510698
T
-12.588
32.690
18.429
P(2 tail)
0.000
0.000
0.000
Analysis of Variance
Source
Regression
Residual
Sum-ofSquares
12626.210
50.190
DF
2
7
Mean Square
F-ratio
6313.105
7.170
880.493
0.000
This time, we have one more row in our regression table for HOURS. Notice that its
coefficient (0.124) is smaller than that for BILLINGS (0.333). This is due partly to the
different scales of the variables. HOURS are measured in larger numbers than
BILLINGS. If we wish to compare the influence of each independent of scales, we
should look at the standardized coefficients. Here, we still see that BILLINGS (0.797)
play a greater role in predicting EARNINGS than do HOURS (0.449). Notice also that
both coefficients are highly significant and that our overall model is highly significant,
as shown in the analysis of variance table.
I-379
Linear Models
Variable Selection
In applications, you may not know which subset of predictor variables in a larger set
constitute a good model. Strategies for identifying a good subset are many and
varied: forward selection, backward elimination, stepwise (either a forward or
backward type), and all subsets. Forward selection begins with the best predictor,
adds the next best and continues entering variables to improve the fit. Backward
selection begins with all candidate predictors in an equation and removes the least
useful one at a time as long as the fit is not substantially worsened. Stepwise begins
as either forward or backward, but allows poor predictors to be removed from the
candidate model or good predictors to re-enter the model at any step. Finally, all
subsets methods compute all possible subsets of predictors for each model of a given
size (number of predictors) and choose the best one.
Bias and variance tradeoff. Submodel selection is a tradeoff between bias and variance.
By decreasing the number of parameters in the model, its predictive capability is
enhanced. This is because the variance of the parameter estimates decreases. On the
other side, bias may increase because the true model may have a higher dimension.
So wed like to balance smaller variance against increased bias. There are two aspects
to variable selection: selecting the dimensionality of the submodel (how many
variables to include) and evaluating the model selected. After you determine the
dimension, there may be several alternative subsets that perform equally well. Then,
knowledge of the subject matter, how accurately individual variables are measured,
and what a variable communicates may guide selection of the model to report.
A strategy. If you are in an exploratory phase of research, you might try this version of
backwards stepping. First, fit a model using all candidate predictors. Then identify the
least useful variable, remove it from the model list, and fit a smaller model. Evaluate
your results and select another variable to remove. Continue removing variables. For a
given size model, you may want to remove alternative variables (that is, first remove
variable A, evaluate results, replace A and remove B, etc.).
Entry and removal criteria. Decisions about which variable to enter or remove should be
based on statistics and diagnostics in the output, especially graphical displays of these
values, and your knowledge of the problem at hand.
You can specify your own alpha-to-enter and alpha-to-remove values (do not make
alpha-to-remove less than alpha-to-enter), or you can cycle variables in and out of the
equation (stepping automatically stops if this happens). The default values for these
options are Enter = 0.15 and Remove = 0.15. These values are appropriate for predictor
I-380
Chapter 13
variables that are relatively independent. If your predictor variables are highly
correlated, you should consider lowering the Enter and Remove values well below
0.05.
When there are high correlations among the independent variables, the estimates of
the regression coefficients can become unstable. Tolerance is a measure of this
2
condition. It is ( 1 R ) ; that is, one minus the squared multiple correlation between a
predictor and the other predictors included in the model. (Note that the dependent
variable is not used.) By setting a minimum tolerance value, variables highly correlated
with others already in the model are not allowed to enter.
As a rough guideline, consider models that include only variables that have absolute
t values well above 2.0 and tolerance values greater than 0.1. (We use quotation
marks here because t and other statistics do not have their usual distributions when you
are selecting subset models.)
Evaluation criteria. There is no one test to identify the dimensionality of the best
submodel. Recent research by Leo Breiman emphasizes the usefulness of crossvalidation techniques involving 80% random subsamples. Sample 80% of your file, fit
a model, use the resulting coefficients on the remaining 20% to obtain predicted values,
2
and then compute R for this smaller sample. In over-fitting situations, the discrepancy
2
between the R for the 80% sample and the 20% sample can be dramatic.
A warning. If you do not have extensive knowledge of your variables and expect this
strategy to help you to find a true model, you can get into a lot of trouble. Automatic
stepwise regression programs cannot do your work for you. You must be able to
examine graphics and make intelligent choices based on theory and prior knowledge;
otherwise, you will be arriving at nonsense.
Moreover, if you are thinking of testing hypotheses after automatically fitting a
subset model, dont bother. Stepwise regression programs are the most notorious
source of pseudo p values in the field of automated data analysis. Statisticians seem
to be the only ones who know these are not real p values. The automatic stepwise
option is provided to select a subset model for prediction purposes. It should never be
used without cross-validation.
If you still want some sort of confidence estimate on your subset model, you might
look at tables in Wilkinson (1979), Rencher and Pun (1980), and Wilkinson and Dallal
2
(1982). These tables provide null hypothesis R values for selected subsets given the
number of candidate predictors and final subset size. If you dont know this literature
already, you will be surprised at how large multiple correlations from stepwise
regressions on random data can be. For a general summary of these and other
problems, see Hocking (1983). For more specific discussions of variable selection
I-381
Linear Models
problems, see the previous references and Flack and Chang (1987), Freedman (1983),
and Lovell (1983). Stepwise regression is probably the most abused computerized
statistical technique ever devised. If you think you need automated stepwise regression
to solve a particular problem, it is almost certain that you do not. Professional
statisticians rarely use automated stepwise regression because it does not necessarily
find the best fitting model, the real model, or alternative plausible models.
Furthermore, the order in which variables enter or leave a stepwise program is usually
of no theoretical significance. You are always better off thinking about why a model
could generate your data and then testing that model.
I-382
Chapter 13
The triangular matrix input facility is useful for meta-analysis of published data and
missing-value computations. There are a few warnings, however. First, if you input
correlation matrices from textbooks or articles, you may not get the same regression
coefficients as those printed in the source. Because of round-off error, printed and raw
data can lead to different results. Second, if you use pairwise deletion with CORR, the
degrees of freedom for hypotheses will not be appropriate. You may not even be able
to estimate the regression coefficients because of singularities.
In general, when an incomplete data procedure is used to estimate the correlation
matrix, the estimate of regression coefficients and hypothesis tests produced from it are
optimistic. You can correct for this by specifying a sample size smaller than the
number of actual observations (preferably, set it equal to the smallest number of cases
used for any pair of variables), but this is a crude guess that you could refine only by
doing Monte Carlo simulations. There is no simple solution. Beware, especially, of
multivariate regressions (or MANOVA, etc.) with missing data on the dependent
variables. You can usually compute coefficients, but results from hypothesis tests are
particularly suspect.
Analysis of Variance
Often, you will want to examine the influence of categorical variables (such as gender,
species, country, and experimental group) on continuous variables. The model
equations for this case, called analysis of variance, are equivalent to those used in
linear regression. However, in the latter, you have to figure out a numerical coding for
categories so that you can use the codes in an equation as the independent variable(s).
I-383
Linear Models
Effects Coding
The following data file, EARNBILL, shows the breakdown of lawyers sampled by sex.
Because SEX is a categorical variable (numerical values assigned to MALE or
FEMALE are arbitrary), a code variable with the values 1 or 1 is used. It doesnt
matter which group is assigned 1, as long as the other is assigned 1.
EARNINGS
86
67
95
105
86
82
140
145
144
184
SEX
female
female
female
female
female
male
male
male
male
male
CODE
1
1
1
1
1
1
1
1
1
1
There is nothing wrong with plotting earnings against the code variable, as long as you
realize that the slope of the line is arbitrary because it depends on how you assign your
codes. By changing the values of the code variable, you can change the slope. Here is
a plot with the least-squares regression line superimposed.
I-384
Chapter 13
Lets do a regression on the data using these codes. Here are the coefficients as
computed by ANOVA:
Variable
Constant
Code
Coefficients
113.400
25.600
Notice that Constant (113.4) is the mean of all the data. It is also the regression
intercept because the codes are symmetrical about 0. The coefficient for Code (25.6)
is the slope of the line. It is also one half the difference between the means of the
groups. This is because the codes are exactly two units apart. This slope is often called
an effect in the analysis of variance because it represents the amount that the
categorical variable SEX affects BILLINGS. In other words, the effect of SEX can be
represented by the amount that the mean for males differs from the overall mean.
Means Coding
The effects coding model is useful because the parameters (constant and slope) can be
interpreted as an overall level and as the effect(s) of treatment, respectively. Another
model, however, that yields the means of the groups directly is called the means model.
Here are the codes for this model:.
EARNINGS
86
67
95
105
86
82
140
145
144
184
SEX
female
female
female
female
female
male
male
male
male
male
CODE1
CODE2
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
Notice that CODE1 is nonzero for all females, and CODE2 is nonzero for all males. To
estimate a regression model with these codes, you must leave out the constant. With
I-385
Linear Models
only two groups, only two distinct pieces of information are needed to distinguish
them. Here are the coefficients for these codes in a model without a constant:
Variable
Coefficient
Code1
Code2
87.800
139.000
Notice that the coefficients are now the means of the groups.
Models
Lets look at the algebraic models for each of these codings. Recall that the regression
model looks like this:
y = 0 + 1 x1 +
For the effects model, it is convenient to modify this notation as follows:
yj = + j +
When x (the code variable) is 1, j is equivalent to 1; when x is 1, j is equivalent to
2. This shorthand will help you later when dealing with models with many categories.
For this model, the parameter stands for the grand (overall) mean, and the parameter
stands for the effect. In this model, our best prediction of the score of a group member
is derived from the grand mean plus or minus the deviation of that group from this
grand mean.
The means model looks like this:
yj = j +
In this model, our best prediction of the score of a group member is the mean of that
group.
I-386
Chapter 13
Hypotheses
As with regression, we are usually interested in testing hypotheses concerning the
parameters of the model. Here are the hypotheses for the two models:
H 0: 1 = 2 = 0
H0: 1 = 2
(effects model)
(means model)
The tests of this hypothesis compare variation between the means to variation within
each group, which is mathematically equivalent to testing the significance of
coefficients in the regression model. In our example, the F ratio in the analysis of
variance table tells you that the coefficient for SEX is significant at p = 0.019, which is
less than the conventional 0.05 value. Thus, on the basis of this sample and the validity
of our usual regression assumptions, you can conclude that women earn significantly
less than men in this firm.
Dep Var:earnings
N:
10
Multiple R:
.719
Squared Multiple R:
.517
Analysis Of Variance
Source
Df
Mean-square
F-ratio
Sex
Sum-of-squares
6553.600
6553.600
8.563
Error
6122.800
765.350
P
0.019
The nice thing about realizing that ANOVA is specially-coded regression is that the
usual assumptions and diagnostics are appropriate in this context. You can plot
residuals against estimated values, for example, to check for homogeneity of variance.
Multigroup ANOVA
When there are more groups, the coding of categories becomes more complex. For the
effects model, there are one fewer coding variables than number of categories. For two
categories, you need only one coding variable; for three categories, you need two
coding variables:
Category
1
2
3
Code
1
0
1
0
1
1
I-387
Linear Models
1
2
3
Code
1
0
0
0
1
0
0
0
1
For multigroup ANOVA, the models have the same form as for the two-group ANOVA
above. The corresponding hypotheses for testing whether there are differences between
means are:
H0: 1 = 2 = 3 =0
H0: 1 = 2 = 3
(effects model)
(means model)
You do not need to know how to produce coding variables to do ANOVA. SYSTAT
does this for you automatically. All you need is a single variable that contains different
values for each group. SYSTAT translates these values into different codes. It is
important to remember, however, that regression and analysis of variance are not
fundamentally different models. They are both instances of the general linear model.
Factorial ANOVA
It is possible to have more than one categorical variable in ANOVA. When this
happens, you code each categorical variable exactly the same way as you do with
multi-group ANOVA. The coded design variables are then added as a full set of
predictors in the model.
ANOVA factors can interact. For example, a treatment may enhance bar pressing
by male rats, yet suppress bar pressing by female rats. To test for this possibility, you
can add (to your model) variables that are the product of the main effect variables
already coded. This is similar to what you do when you construct polynomial models.
For example, this is a model without an interaction:
y = CONSTANT + treat + sex
I-388
Chapter 13
If the hypothesis test of the coefficients for the TREAT*SEX term is significant, then
you must qualify your conclusions by referring to the interaction. You might say, It
works one way for males and another for females.
Graphical displays are useful for checking assumptions. For analysis of variance, try
dit plots, box-and-whisker displays, or bar charts with standard error bars.
Levene Test
Analysis of variance assumes that the data within cells are independent and normally
distributed with equal variances. This is the ANOVA equivalent of the regression
assumptions for residuals. When the homogeneous variance part of the assumptions is
false, it is sometimes possible to adjust the degrees of freedom to produce
approximately distributed F statistics.
Levene (1960) proposed a test for unequal variances. You can use this test to
determine whether you need an unequal variance F test. Simply fit your model in
ANOVA and save residuals. Then transform the residuals into their absolute values.
Merge these with your original grouping variable(s). Then redo your ANOVA on the
absolute residuals. If it is significant, then you should consider using the separate
variances test.
Before doing all this work, you should do a box plot by groups to see whether the
distributions differ. If you see few differences in the spread of the boxes, Levenes test
is unlikely to be significant.
I-389
Linear Models
For 10 pairs, this probability increases to 0.40. The result of following such a strategy
is to declare differences as significant when they are not.
As an alternative to the situation described above, SYSTAT provides four
techniques to perform pairwise mean comparisons: Bonferroni, Scheffe, Tukey, and
Fishers LSD. The first three methods provide protection for multiple tests. To
determine significant differences, simply look for pairs with probabilities below
your critical value (for example, 0.05 or 0.01).
There is an abundance of literature covering multiple comparisons (see Miller, 1985);
however, a few points are worth noting here:
n If you have a small number of groups, the Bonferroni pairwise procedure will often
be more powerful (sensitive). For more groups, consider the Tukey method. Try all
the methods in ANOVA (except Fishers LSD) and pick the best one.
n All possible pairwise comparisons are a waste of power. Think about a meaningful
subset of comparisons and test this subset with Bonferroni levels. To do this, divide
your critical level, say 0.05, by the number of comparisons you are making. You
will almost always have more power than with any other pairwise multiple
comparison procedures.
I-390
Chapter 13
Duncans test, for example, does not maintain its claimed protection level. Other
stepwise multiple range tests, such as Newman-Keuls, have not been conclusively
demonstrated to maintain overall protection levels for all possible distributions of
means.
The test statistic for a contrast is similar to that for a two-sample t test; the result of the
contrast (a relation among means, such as mean A minus mean B) is in the numerator
of the test statistic, and an estimate of within-group variability (the pooled variance
estimate or the error term from the ANOVA) is part of the denominator.
You can select contrast coefficients to test:
n Pairwise comparisons (test for a difference between two particular means)
n A linear combination of means that are meaningful to the study at hand (compare
(that is, you might test a linear increase in sales by comparing people with no
training, those with moderate training, and those with extensive training)
Many experimental design texts place coefficients for linear and quadratic contrasts for
three groups, four groups, and so on, in a table. SYSTAT allows you to type your
contrasts or select a polynomial option. A polynomial contrast of order 1 is linear; of
order 2, quadratic; of order 3, cubic; and so on.
I-391
Linear Models
Unbalanced Designs
An unbalanced factorial design occurs when the numbers of cases in cells are unequal
and not proportional across rows or columns. The following is an example of a
2 2 design:
A1
A2
B1
B2
1
2
5
3
4
6
7
9
8
4
2
1
5
3
Unbalanced designs require a least-squares procedure like the General Linear Model
because the usual maximum likelihood method of adding up sums of squared
deviations from cell means and the grand mean does not yield maximum likelihood
estimates of effects. The General Linear Model adjusts for unbalanced designs when
you get an ANOVA table to test hypotheses.
However, the estimates of effects in the unbalanced design are no longer orthogonal
(and thus statistically independent) across factors and their interactions. This means
that the sum of squares associated with one factor depends on the sum of squares for
another or its interaction.
Analysts accustomed to using multiple regression have no problem with this
situation because they assume that their independent variables in a model are
correlated. Experimentalists, however, often have difficulty speaking of a main effect
conditioned on another. Consequently, there is extensive literature on hypothesis
testing methodology for unbalanced designs (for example, Speed and Hocking, 1976,
and Speed, Hocking, and Hackney, 1978), and there is no consensus on how to test
hypotheses with non-orthogonal designs.
Some statisticians advise you to do a series of hierarchical tests beginning with
interactions. If the highest-order interactions are insignificant, drop them from the
model and recompute the analysis. Then, examine the lower-order interactions. If they
are insignificant, recompute the model with main effects only. Some computer
programs automate this process and print sums of squares and F tests according to the
hierarchy (ordering of effects) you specify in the model. SAS and SPSS GLM, for
example, calls these Type I sums of squares.
I-392
Chapter 13
=
=
=
=
=
=
=
CONSTANT
CONSTANT
CONSTANT
CONSTANT
CONSTANT
CONSTANT
CONSTANT
+
+
+
+
+
+
+
a
a
a
a
a
a
a
+
+
+
+
+
+
b
b
b
b
b
b
+
+
+
+
+
c
c
c
c
c
+
+
+
+
ab + ac + bc + abc
ab + ac + bc
ab + ac
ab
The problem with this approach, however, is that plausible subsets of effects are
ignored if you examine only one hierarchy. The following model, which may be the
best fit to the data, is never considered:
Y = CONSTANT + a + b + ab
Furthermore, if you decide to examine all the other plausible subsets, you are really
doing all possible subsets regression, and you should use Bonferroni confidence levels
before rejecting a null hypothesis. The example above has 127 possible subset models
(excluding ones without a CONSTANT). Interactive stepwise regression allows you to
explore subset models under your control.
If you have done an experiment and have decided that higher-order effects
(interactions) are of enough theoretical importance to include in your model, you
should condition every test on all other effects in the model you selected. This is the
classical approach of Fisher and Yates. It amounts to using the default F values on the
ANOVA output, which are the same as the SAS and SPSS Type III sums of squares.
Probably the most important reason to stay with one model is that if you eliminate
a series of effects that are not quite significant (for example, p = 0.06), you could end
up with an incorrect subset model because of the dependencies among the sums of
squares. In summary, if you want other sums of squares, compute them. You can
supply the mean square error to customize sums of squares by using a hypothesis test
in GLM, selecting MSE, and specifying the mean square error and degrees of freedom.
I-393
Linear Models
Repeated Measures
In factorial ANOVA designs, each subject is measured once. For example, the
assumption of independence would be violated if a subject is measured first as a
control group member and later as a treatment group member. However, in a repeated
measures design, the same variable is measured several times for each subject (case).
A paired-comparison t test is the most simple form of a repeated measures design (for
example, each subject has a before and after measure).
Usually, it is not necessary for you to understand how SYSTAT carries out
calculations; however, repeated measures is an exception. It is helpful to understand
the quantities SYSTAT derives from your data. First, remember how to calculate a
paired-comparison t test by hand:
n For each subject, compute the difference between the two measures.
n Calculate the average of the differences.
n Calculate the standard deviation of the differences.
n Calculate the test statistic using this mean and standard deviation.
SYSTAT derives similar values from your repeated measures and uses them in
analysis-of-variance computations to test changes across the repeated measures
(within subjects) as well as differences between groups of subjects (between subjects.)
Tests of the within-subjects values are called polynomial tests of order 1, 2,..., up to k,
where k is one less than the number of repeated measures. The first polynomial is used
to test linear changes (for example, do the repeated responses increase (or decrease)
around a line with a significant slope?). The second polynomial tests if the responses
fall along a quadratic curve, and so on.
For each case, SYSTAT uses orthogonal contrast coefficients to derive one
number for each polynomial. For the coefficients of the linear polynomial, SYSTAT
uses (1, 0, 1) when there are three measures; (3, 1, 1, 3) when there are four
measures; and so on. When there are three repeated measures, SYSTAT multiplies the
first by 1, the second by 0, and the third by 1, and sums these products (this sum is
then multiplied by a constant to make the sum of squares of the coefficients equal to
1). Notice that when the responses are the same, the result of the polynomial contrast
is 0; when the responses fall closely along a line with a steep slope, the polynomial
differs markedly from 0.
For the coefficients of the quadratic polynomial, SYSTAT uses (1, 2, 1) when
there are three measures; (1, 1, 1, 1) when there are four measures; and so on. The
cubic and higher-order polynomials are computed in a similar way.
I-394
Chapter 13
Lets continue the discussion for a design with three repeated measures. Assume
that you record body weight once a month for three months for rats grouped by diet.
(Diet A includes a heavy concentration of alcohol and Diet B consists of normal lab
chow.) For each rat, SYSTAT computes a linear component and a quadratic
component. SYSTAT also sums the weights to derive a total response. These derived
values are used to compute two analysis of variance tables:
n The total response is used to test between-group differences; that is, the total is
used as the dependent variable in the usual factorial ANOVA computations. In the
example, this test compares total weight for Diet A against that for Diet B. This is
analogous to a two-sample t test using total weight as the dependent variable.
n The linear and quadratic components are used to test changes across the repeated
measures (within subjects) and also to test the interaction of the within factor with
the grouping factor. If the test for the linear component is significant, you can
report a significant linear increase in weight over the three months. If the test for
the quadratic component is also significant (but much less so than the linear
component), you might report that growth is predominantly linear, but there is a
significant curve in the upward trend.
n A significant interaction between Diet (the between-group factor) and the linear
component across time might indicate that the slopes for Diet A and Diet B differ.
This test may be the most important one for the experiment.
I-395
Linear Models
of the output and stay with the classical univariate ANOVA table because the
multivariate tests will be generally less powerful.
There is a middle approach. The Greenhouse-Geisser and Huynh-Feldt statistics are
used to adjust the probability for the classical univariate tests when compound
symmetry fails. (Huynh-Feldt is a more recent adjustment to the conservative
Greenhouse-Geiser statistic.) If the Huynh-Feldt p values are substantially different
from those under the column directly to the right of the F statistic, then you should be
aware that compound symmetry has failed. In this case, compare the adjusted p values
under Huynh-Feldt to those for the multivariate tests.
If all else fails, single degree-of-freedom polynomial tests can always be trusted. If
there are several to examine, however, remember that you may want to use Bonferroni
adjustments to the probabilities; that is, divide the normal value (for example, 0.05) by
the number of polynomial tests you want to examine. You need to make a Bonferroni
adjustment only if you are unable to use the summary univariate or multivariate tests
to protect the overall level; otherwise, you can examine the polynomials without
penalty if the overall test is significant.
groups and their interactions with trials. This means that the traditional analysis
method has highly restrictive assumptions. You must assume that the variances
within cells are homogeneous and that the covariances across all pairs of cells are
equivalent (compound symmetry). There are some mathematical exceptions to this
requirement, but they rarely occur in practice. Furthermore, the compound
symmetry assumption rarely holds for real data.
n Compound symmetry is not required for the validity of the single degree-of-
The effects are printed in the same order as they appear in Winer (1971) and other
texts, but they include the single degree-of-freedom and multivariate tests to
I-396
Chapter 13
protect you from false conclusions. If you are satisfied that both are in agreement,
you can delete the additional lines in the output file.
n You can test any hypothesis after you have estimated a repeated measures design
and examined the output. For example, you can use polynomial contrasts to test
single degree-of-freedom components in an unevenly spaced design. You can also
use difference contrasts to do post hoc tests on adjacent trials.
then the sum of squares for AB is produced from the difference between SSE (sum of
squared error) in the two following models:
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a + b + a*b
Similarly, the Type I sums of squares for B in this model are computed from the
difference in SSE between the following models:
MODEL y = CONSTANT + a
MODEL y = CONSTANT + a + b
Finally, the Type I sums of squares for A is computed from the difference in residual
sums of squares for the following:
MODEL y = CONSTANT
MODEL y = CONSTANT + a
In summary, to compute sums of squares, move from right to left and construct models
which differ by the right-most term only.
I-397
Linear Models
Type II. Type II sums of squares are computed similarly to Type I except that main
effects and interactions determine the ordering of differences instead of the MODEL
statement order. For the above model, Type II sums of squares for the interaction are
computed from the difference in residual sums of squares for the following models:
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a + b + a*b
For the A effect, difference the following (this is not the same as for
Type I):
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + b
In summary, include interactions of the same order as well as all lower order
interactions and main effects when differencing to get an interaction. When getting
sums of squares for a main effect, difference against all other main effects only.
Type III. Type III sums of squares are the default for ANOVA and are much simpler to
understand. Simply difference from the full model, leaving out only the term in
question. For example, the Type III sum of squares for A is taken from the following
two models:
MODEL y = CONSTANT + b + a*b
MODEL y = CONSTANT + a + b + a*b
Type IV. Type IV sums of squares are designed for missing cells designs and are not
easily presented in the above terminology. They are produced by balancing over the
means of nonmissing cells not included in the current hypothesis.
I-398
Chapter 13
full model. Later, effects are conditioned on earlier effects, but earlier effects are not
conditioned on later effects. A Type II test is produced most easily with interactive
stepping (STEP). Type III is printed in the regression and ANOVA table. Finally, Type
IV is produced by the careful use of SPECIFY in testing means models. The advantage
of this approach is that the user is always aware that sums of squares depend on explicit
mathematical models rather than additions and subtractions of dimensionless
quantities.
Chapter
14
Linear Models I:
Linear Regression
Leland Wilkinson and Mark Coward
y = 0 +1x +
where y is the dependent variable, x is the independent variable, and the s are the
regression parameters (the intercept and the slope of the line of best fit). The model
for multiple linear regression is:
y = 0 + 1 x 1 + 2 x 2 + ... + p x p +
Both Regression and General Linear Model can estimate and test simple and multiple
linear regression models. Regression is easier to use than General Linear Model when
you are doing simple regression, multiple regression, or stepwise regression because it
has fewer options. To include interaction terms in your model or for mixture models, use
General Linear Model. With Regression, all independent variables must be continuous;
in General Linear Model, you can identify categorical independent variables and
SYSTAT will generate a set of design variables for each. Both General Linear Model
and Regression allow you to save residuals. In addition, you can test a variety of
hypotheses concerning the regression coefficients using General Linear Model.
The ability to do stepwise regression is available in three ways: use the default
values, specify your own selection criteria, or at each step, interactively select a
variable to add or remove from the model.
2
2
For each model you fit in REGRESS, SYSTAT reports R , adjusted R , the
standard error of the estimate, and an ANOVA table for assessing the fit of the model.
I-399
I-400
Chapter 14
For each variable in the model, the output includes the estimate of the regression
coefficient, the standard error of the coefficient, the standardized coefficient, tolerance,
and a t statistic for measuring the usefulness of the variable in the model.
each observation, Cooks distance measure, and the standard error of predicted
values.
n Residuals/Data. Saves the residual statistics given by Residuals plus all the
variables in the working data file, including any transformed data values.
I-401
Linear Models I: Linear Regression
Residual of Y = CONSTANT + X2 + X3
Residual of X1 = CONSTANT + X2 + X3
Residual of Y = CONSTANT + X1 + X3
Residual of X2 = CONSTANT + X1 + X3
Residual of Y = CONSTANT + X1 + X2
Residual of X3 = CONSTANT + X1 + X2
n Partial/Data. Saves partial residuals plus all the variables in the working data file,
Regression Options
To open the Options dialog box, click Options in the Regression dialog box.
You can specify a tolerance level, select complete or stepwise entry, and specify entry
and removal criteria.
I-402
Chapter 14
Tolerance. Prevents the entry of a variable that is highly correlated with the
independent variables already included in the model. Enter a value between 0 and 1.
Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the
correlation required to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Mixture model. Constrains the independent variables to sum to a constant.
n Stepwise. Variables are entered or removed from the model one at a time.
The following alternatives are available for stepwise entry and removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT
from your model. For Forward, SYSTAT automatically adds a variable to the
model at each step.
n Interactive. At each step in the model building, you select the variable to enter or
I-403
Linear Models I: Linear Regression
Using Commands
First, specify your data with USE filename. Continue with:
REGRESS
MODEL var=CONSTANT + var1 + var2 + / N=n
SAVE filename / COEF MODEL RESID DATA PARTIAL
ESTIMATE / TOL=n
(use START instead of ESTIMATE for stepwise model building)
START / FORWARD BACKWARD TOL=n ENTER=p REMOVE=p ,
FENTER=n FREMOVE=n FORCE=n
STEP / AUTO ENTER=p REMOVE=p FENTER=n FREMOVE=n
STOP
Usage Considerations
Types of data. Input can be the usual cases-by-variables data file or a covariance,
correlation, or sum of squares and cross-products matrix. Using matrix input requires
specification of the sample size which generated the matrix.
Print options. Using PRINT = MEDIUM, the output includes eigenvalues of XX,
condition indices, and variance proportions. PRINT = LONG adds the correlation matrix
of the regression coefficients to this output.
Quick Graphs. SYSTAT plots the residuals against the predicted values.
Saving files. You can save the results of the analysis (predicted values, residuals, and
diagnostics that identify unusual cases) for further use in examining assumptions.
BY groups. Separate regressions result for each level of any BY variables.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. REGRESS uses the FREQ variable to duplicate cases. This inflates
the degrees of freedom to be the sum of the number of frequencies.
Case weights. REGRESS weights cases using the WEIGHT variable for rectangular
data. You can perform cross-validation if the weight variable is binary and coded 0 or
1. SYSTAT computes predicted values for cases with zero weight even though they are
not used to estimate the regression parameters.
I-404
Chapter 14
Examples
Example 1
Simple Linear Regression
In this example, we explore the relation between gross domestic product per capita
(GDP_CAP) and spending on the military (MIL) for 57 countries that report this
information to the United Nationswe want to determine whether a measure of the
financial well being of a country is useful for predicting its military expenditures. Our
model is:
I-405
Linear Models I: Linear Regression
To obtain the scatterplot, we created a new variable, NAME$, that had missing values for
all countries except Libya and Iraq. We then used the new variable to label plot points.
Iraq and Libya stand apart from the other countriesthey spend considerably more
for the military than countries with similar GDP_CAP values. The smoother indicates
that the relationship between the two variables is fairly linear. Distressing, however, is
the fact that many points clump in the lower left corner. Many data analysts would
want to study the data after log-transforming both variables. We do this in another
example, but now we estimate the coefficients for the data as recorded.
To fit a simple linear regression model to the data, the input is:
REGRESS
USE ourworld
MODEL mil = CONSTANT + gdp_cap
ESTIMATE
N: 56
Multiple R: 0.646
Effect
CONSTANT
GDP_CAP
Effect
CONSTANT
GDP_CAP
Coefficient
Std Error
41.857
0.019
24.838
0.003
Coefficient
Lower
41.857
0.019
< 95%>
-7.940
0.013
.
1.000
P(2 Tail)
1.685
6.220
0.098
0.000
Upper
91.654
0.025
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
717100.891
1
717100.891
38.683
0.000
Residual
1001045.288
54
18537.876
-------------------------------------------------------------------------------
Durbin-Watson D Statistic
First Order Autocorrelation
2.046
-0.032
(Studentized Residual =
(Studentized Residual =
6.956)
4.348)
I-406
Chapter 14
SYSTAT reports that data are missing for one case. In the next line, it reports that 56
cases are used (N = 56). In the regression calculations, SYSTAT uses only the cases
that have complete data for the variables in the model. However, when only the
dependent variable is missing, SYSTAT computes a predicted value, its standard error,
and a leverage diagnostic for the case. In this sample, Afghanistan did not report
military spending.
When there is only one independent variable, Multiple R (0.646) is the simple
correlation between MIL and GDP_CAP. Squared multiple R (0.417) is the square of
this value, and it is the proportion of the total variation in the military expenditures
accounted for by GDP_CAP (GDP_CAP explains 41.7% of the variability of MIL).
Use Sum-of-Squares in the analysis of variance table to compute it:
717100.891 / (717100.891 + 1001045.288)
Adjusted squared multiple R is of interest for models with more than one independent
variable. Standard error of estimate (136.154) is the square root of the residual mean
square (18537.876) in the ANOVA table.
The estimates of the regression coefficients are 41.857 and 0.019, so the equation is:
mil = 41.857 + 0.019 * gdp_cap
The standard errors (Std Error) of the estimated coefficients are in the next column and
the standardized coefficients (Std Coef) follow. The latter are called beta weights by
some social scientists. Tolerance is not relevant when there is only one predictor.
I-407
Linear Models I: Linear Regression
Next are t statistics (t)the first (1.685) tests the significance of the difference of
the constant from 0 and the second (6.220) tests the significance of the slope, which is
equivalent to testing the significance of the correlation between military spending and
GDP_CAP.
F-ratio in the analysis of variance table is used to test the hypothesis that the slope
is 0 (or, for multiple regression, that all slopes are 0). The F is large when the
independent variable(s) helps to explain the variation in the dependent variable. Here,
there is a significant linear relation between military spending and GDP_CAP. Thus,
we reject the hypothesis that the slope of the regression line is zero (F-ratio = 38.683,
p value (P) < 0.0005).
It appears from the results above that GDP_CAP is useful for predicting spending
on the militarythat is, countries that are financially sound tend to spend more on the
military than poorer nations. These numbers, however, do not provide the complete
picture. Notice that SYSTAT warns us that two countries (Iraq and Libya) with
unusual values could be distorting the results. We recommend that you consider
transforming the data and that you save the residuals and other diagnostic statistics.
Example 2
Transformations
The data in the scatterplot in the simple linear regression example are not well suited
for linear regression, as the heavy concentration of points in the lower left corner of the
graph shows. Here are the same data plotted in log units:
REGRESS
USE ourworld
PLOT MIL*GDP_CAP / SMOOTH=LOWESS TENSION =0.500,
XLABEL=GDP per capita,
XLOG=10 YLABEL=Military Spending YLOG=10,
SYMBOL=4,2,3,
SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.450
I-408
Chapter 14
Except possibly for Iraq and Libya, the configuration of these points is better for linear
modeling than that for the untransformed data.
We now transform both the y and x variables and refit the model. The input is:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
MODEL log_mil = CONSTANT + log_gdp
ESTIMATE
N: 56
Multiple R: 0.857
Effect
CONSTANT
LOG_GDP
Coefficient
Std Error
-1.308
0.909
0.257
0.075
Coefficient
-1.308
0.909
Lower
< 95%>
-1.822
0.760
Upper
-0.793
1.058
.
1.000
t
-5.091
12.201
P(2 Tail)
0.000
0.000
I-409
Linear Models I: Linear Regression
Analysis of Variance
Source
Sum-of-Squares
Regression
17.868
Residual
6.481
*** WARNING ***
Case
22 is an outlier
Durbin-Watson D Statistic
First Order Autocorrelation
df
Mean-Square
F-ratio
1
54
17.868
0.120
148.876
0.000
(Studentized Residual =
4.004)
1.810
0.070
The Squared multiple R for the variables in log units is 0.734 (versus 0.417 for the
untransformed values). That is, we have gone from explaining 41.7% of the variability
of military spending to 73.4% by using the log transformations. The F-ratio is now
148.876it was 38.683. Notice that we now have only one outlier (Iraq).
The Calculator
But what is the estimated model now?
log_mil = 1.308 + 0.909 log_gdp
However, many people dont think in log units. Lets transform this equation
(exponentiate each side of the equation):
I-410
Chapter 14
Example 3
Residuals and Diagnostics for Simple Linear Regression
In this example, we continue with the transformations example and save the residuals
and diagnostics along with the data. Using the saved statistics, we create stem-and-leaf
plots of the residuals and Studentized residuals. In addition, lets plot the Studentized
residuals (to identify outliers in the y space) against leverage (to identify outliers in the
x space) and use Cooks distance measure to scale the size of each plot symbol. In a
second plot, we display the corresponding country names. The input is:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
MODEL log_mil = CONSTANT + log_gdp
SAVE myresult / DATA RESID
ESTIMATE
USE myresult
STATS
STEM residual student
PLOT STUDENT*LEVERAGE / SYMBOL=4,2,3 SIZE=cook
PLOT student*leverage / LABEL=country$ SYMBOL=4,2,3
I-411
Linear Models I: Linear Regression
-6
42
-5
6
-4
42
-3
554000
-2 H 65531
-1
9876433
-0 M 98433200
0
222379
1
1558
2 H 009
3
0113369
4
27
5
1
6
7
7
* * * Outside Values * * *
12
1
1 cases with missing values excluded from plot.
-1
986
-1
32000
-0 H 88877766555
-0 M 443322111000
0 M 000022344
0 H 555889999
1
0223
1
5
2
3
* * * Outside Values * * *
4
0
1 cases with missing values excluded from plot.
In the stem-and-leaf plots, Iraqs residual is 1.216 and is identified as an Outside Value.
The value of its Studentized residual is 4.004, which is very extreme for the t
distribution.
The case with the most influence on the estimates of the regression coefficients
stands out at the top left (that is, it has the largest plot symbol). From the second plot,
we identify this country as Iraq. Its value of Cooks distance measure is large because
its Studentized residual is extreme. On the other hand, Ethiopia (furthest to the right),
I-412
Chapter 14
the case with the next most influence, has a large value of Cooks distance because its
value of leverage is large. Gambia has the third largest Cook value, and Libya, the
fourth.
Deleting an Outlier
Residual plots identify Iraq as the case with the greatest influence on the estimated
coefficients. Lets remove this case from the analysis and check SYSTATs warnings.
The input is:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
SELECT mil < 700
MODEL log_mil = CONSTANT + log_gdp
ESTIMATE
SELECT
N: 55
Multiple R: 0.886
Coefficient
Std Error
-1.353
0.916
0.227
0.066
.
1.000
P(2 Tail)
-5.949
13.896
0.000
0.000
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
18.129
1
18.129
193.107
0.000
Residual
4.976
53
0.094
------------------------------------------------------------------------------Durbin-Watson D Statistic
First Order Autocorrelation
1.763
0.086
I-413
Linear Models I: Linear Regression
student
mil
gdp_cap
COOK
LEVERAGE
0.013
0.032
0.023
0.043
0.000
0.044
0.000
0.045
STUDENT
-0.891
-1.011
-0.001
-0.119
MIL
95.833
127.237
283.939
269.608
GDP_CAP
8970.885
13500.299
13724.502
14363.064
2.348
0.473
.
640.513
8.846
.
4738.055
201.798
189.128
(etc.)
Libya
Somalia
Afghanistan
0.056
0.009
.
0.022
0.072
0.075
(etc.)
The value of MIL for Afghanistan is missing, so Cooks distance measure and
Studentized residuals are not available (periods are inserted for these values in the
listing).
Example 4
Multiple Linear Regression
In this example, we build a multiple regression model to predict total employment
using values of six independent variables. The data were originally used by Longley
(1967) to test the robustness of least-squares packages to multicollinearity and other
sources of ill-conditioning. SYSTAT can print the estimates of the regression
coefficients with more correct digits than the solution provided by Longley himself
if you adjust the number of decimal places. By default, the first three digits after the
decimal are displayed. After the output is displayed, you can use General Linear Model
to test hypotheses involving linear combinations of regression coefficients.
I-414
Chapter 14
2
0.082
6
0.000
7
0.000
1
1.000
2
9.142
6
1048.080
7
43275.046
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME
1
0.000
0.000
0.000
0.000
0.000
0.000
0.000
2
0.000
0.000
0.000
0.014
0.092
0.000
0.000
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME
6
0.000
0.505
0.328
0.225
0.000
0.831
0.000
7
1.000
0.038
0.655
0.689
0.302
0.160
1.000
3
0.046
4
0.011
5
0.000
3
12.256
4
25.337
5
230.424
3
0.000
0.000
0.000
0.001
0.064
0.000
0.000
4
0.000
0.000
0.001
0.065
0.427
0.000
0.000
5
0.000
0.457
0.016
0.006
0.115
0.010
0.000
Condition indices
Variance proportions
N: 16
Multiple R: 0.998
Effect
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME
Coefficient
Std Error
-3482258.635
15.062
-0.036
-2.020
-1.033
-0.051
1829.151
890420.384
84.915
0.033
0.488
0.214
0.226
455.478
.
0.007
0.001
0.030
0.279
0.003
0.001
t
-3.911
0.177
-1.070
-4.136
-4.822
-0.226
4.016
P(2 Tail)
0.004
0.863
0.313
0.003
0.001
0.826
0.003
I-415
Linear Models I: Linear Regression
Effect
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME
Coefficient
Lower
< 95%>
Upper
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME
1.000
-0.649
-0.555
-0.349
0.659
0.186
1.000
0.946
0.469
-0.833
-0.802
1.000
0.619
-0.758
-0.824
1.000
-0.189
-0.549
POPULATN
1.000
0.388
TIME
POPULATN
TIME
1.000
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
1.84172E+08
6 3.06954E+07
330.285
0.000
Residual
836424.056
9
92936.006
------------------------------------------------------------------------------Durbin-Watson D Statistic
First Order Autocorrelation
2.559
-0.348
SYSTAT computes the eigenvalues by scaling the columns of the X matrix so that the
diagonal elements of XX are 1s and then factoring the XX matrix. In this example,
most of the eigenvalues of XX are nearly 0, showing that the predictor variables
comprise a relatively redundant set.
Condition indices are the square roots of the ratios of the largest eigenvalue to each
successive eigenvalue. A condition index greater than 15 indicates a possible problem,
and an index greater than 30 suggests a serious problem with collinearity (Belsley,
Kuh, and Welsh, 1980). The condition indices in the Longley example show a
tremendous collinearity problem.
Variance proportions are the proportions of the variance of the estimates accounted
for by each principal component associated with each of the above eigenvalues. You
should begin to worry about collinearity when a component associated with a high
condition index contributes substantially to the variance of two or more variables. This
is certainly the case with the last component of the Longley data. TIME, GNP, and
UNEMPLOY load highly on this component. See Belsley, Kuh, and Welsch (1980) for
more information about these diagnostics.
I-416
Chapter 14
Adjusted squared multiple R is 0.992. The formula for this statistic is:
adj. sq. multiple R = R 2
( p 1)
* (1 R 2 )
( n p)
where n is the number of cases and p is the number of predictors, including the
constant.
Notice the extremely small tolerances in the output. Tolerance is 1 minus the
multiple correlation between a predictor and the remaining predictors in the model.
These tolerances signal that the predictor variables are highly intercorrelateda
worrisome situation. This multicollinearity can inflate the standard errors of the
coefficients, thereby attenuating the associated F statistics, and can threaten
computational accuracy.
Finally, SYSTAT produces the Correlation matrix of regression coefficients. In the
Longley data, these estimates are highly correlated, further indicating that there are too
many correlated predictors in the equation to provide stable estimates.
Scatterplot Matrix
Examining a scatterplot matrix of the variables in the model is often a beneficial first
step in any multiple regression analysis. Nonlinear relationships and correlated
predictors, both of which cause problems for multiple linear regression, can be
uncovered before fitting the model. The input is:
USE longley
SPLOM DEFLATOR GNP UNEMPLOY ARMFORCE POPULATN TIME TOTAL / HALF
DENSITY=HIST
I-417
Linear Models I: Linear Regression
TOTAL
TIME
GNP
DEFLATOR
DEFLATOR
GNP
TIME
TOTAL
Notice the severely nonlinear distributions of ARMFORCE with the other variables, as
well as the near perfect correlations among several of the predictors. There is also a
sharp discontinuity between post-war and 1950s behavior on ARMFORCE.
Example 5
Automatic Stepwise Regression
Following is an example of forward automatic stepping using the LONGLEY data. The
input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
START / FORWARD
STEP / AUTO
STOP
I-418
Chapter 14
0.000 R-Square =
Coefficient
0.000
Std Error
Std Coef
Tol.
df
In
___
1 Constant
Out
Part. Corr.
___
2 DEFLATOR
0.971
.
.
1.00000
1 230.089
0.000
3 GNP
0.984
.
.
1.00000
1 415.103
0.000
4 UNEMPLOY
0.502
.
.
1.00000
1
4.729
0.047
5 ARMFORCE
0.457
.
.
1.00000
1
3.702
0.075
6 POPULATN
0.960
.
.
1.00000
1 166.296
0.000
7 TIME
0.971
.
.
1.00000
1 233.704
0.000
-------------------------------------------------------------------------------
0.967
Effect
Coefficient
Std Error
0.035
0.002
In
___
1 Constant
3 GNP
Std Coef
Tol.
0.984 1.00000
df
1 415.103
0.000
Out
Part. Corr.
___
2 DEFLATOR
-0.187
.
.
0.01675
1
0.473
0.504
4 UNEMPLOY
-0.638
.
.
0.63487
1
8.925
0.010
5 ARMFORCE
0.113
.
.
0.80069
1
0.167
0.689
6 POPULATN
-0.598
.
.
0.01774
1
7.254
0.018
7 TIME
-0.432
.
.
0.00943
1
2.979
0.108
------------------------------------------------------------------------------Step # 2 R = 0.990 R-Square =
Term entered: UNEMPLOY
0.981
Effect
Coefficient
Std Error
0.038
-0.544
0.002
0.182
In
___
1 Constant
3 GNP
4 UNEMPLOY
Std Coef
Tol.
1.071 0.63487
-0.145 0.63487
df
1 489.314
1
8.925
0.000
0.010
Out
Part. Corr.
___
2 DEFLATOR
-0.073
.
.
0.01603
1
0.064
0.805
5 ARMFORCE
-0.479
.
.
0.48571
1
3.580
0.083
6 POPULATN
-0.164
.
.
0.00563
1
0.334
0.574
7 TIME
0.308
.
.
0.00239
1
1.259
0.284
-------------------------------------------------------------------------------
I-419
Linear Models I: Linear Regression
0.985
Effect
Coefficient
Std Error
0.041
-0.797
-0.483
0.002
0.213
0.255
In
___
1
3
4
5
Constant
GNP
UNEMPLOY
ARMFORCE
Std Coef
Tol.
df
1.154 0.31838
-0.212 0.38512
-0.096 0.48571
1 341.684
1 13.942
1
3.580
0.000
0.003
0.083
Out
Part. Corr.
___
2 DEFLATOR
0.163
.
.
0.01318
1
0.299
0.596
6 POPULATN
-0.376
.
.
0.00509
1
1.813
0.205
7 TIME
0.830
.
.
0.00157
1 24.314
0.000
-------------------------------------------------------------------------------
0.995
Effect
Coefficient
Std Error
Std Coef
Tol.
df
-0.040
-2.088
-1.015
1887.410
0.016
0.290
0.184
382.766
-1.137
-0.556
-0.201
2.559
0.00194
0.07088
0.31831
0.00157
1
1
1
1
5.953
51.870
30.496
24.314
0.033
0.000
0.000
0.000
In
___
1
3
4
5
7
Constant
GNP
UNEMPLOY
ARMFORCE
TIME
Out
Part. Corr.
___
2 DEFLATOR
0.143
.
.
0.01305
1
0.208
0.658
6 POPULATN
-0.150
.
.
0.00443
1
0.230
0.642
------------------------------------------------------------------------------Dep Var: TOTAL
N: 16
Multiple R: 0.998
Coefficient
Std Error
-3598729.374
-0.040
-2.088
-1.015
1887.410
740632.644
0.016
0.290
0.184
382.766
.
0.002
0.071
0.318
0.002
P(2 Tail)
-4.859
-2.440
-7.202
-5.522
4.931
0.001
0.033
0.000
0.000
0.000
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
1.84150E+08
4 4.60375E+07
589.757
0.000
Residual
858680.406
11
78061.855
-------------------------------------------------------------------------------
I-420
Chapter 14
F, so SYSTAT enters it at step 1. Note at this step that the partial correlation, Part.
Corr., is the simple correlation of each predictor with TOTAL.
n With GNP in the equation, UNEMPLOY is now the best candidate.
n The F for ARMFORCE is 3.58 when GNP and UNEMPLOY are included in the
model.
n SYSTAT finishes by entering TIME.
In four steps, SYSTAT entered four predictors. None was removed, resulting in a final
equation with a constant and four predictors. For this final model, SYSTAT uses all
cases with complete data for GNP, UNEMPLOY, ARMFORCE, and TIME. Thus, when
some values in the sample are missing, the sample size may be larger here than for the
last step in the stepwise process (there, cases are omitted if any value is missing among
the six candidate variables). If you dont want to stop here, you could move more
variables in (or out) using interactive stepping.
Example 6
Interactive Stepwise Regression
Interactive stepping helps you to explore model building in more detail. With data that
are as highly intercorrelated as the LONGLEY data, interactive stepping reveals the
dangers of thinking that the automated result is the only acceptable subset model. In
this example, we use interactive stepping to explore the LONGLEY data further. That
is, after specifying a model that includes all of the candidate variables available, we
request backward stepping by selecting Stepwise, Backward, and Interactive in the
Regression Options dialog box. After reviewing the results at each step, we use Step
to move a variable in (or out) of the model. When finished, we select Stop for the final
model. To begin interactive stepping, the input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
START / BACK
I-421
Linear Models I: Linear Regression
Constant
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME
Out
___
0.998 R-Square =
0.995
Coefficient
Std Error
Std Coef
Tol.
df
15.062
-0.036
-2.020
-1.033
-0.051
1829.151
84.915
0.033
0.488
0.214
0.226
455.478
0.046
-1.014
-0.538
-0.205
-0.101
2.480
0.00738
0.00056
0.02975
0.27863
0.00251
0.00132
1
1
1
1
1
1
0.031
1.144
17.110
23.252
0.051
16.127
0.863
0.313
0.003
0.001
0.826
0.003
Part. Corr.
none
-------------------------------------------------------------------------------
We begin with all variables in the model. We remove DEFLATOR because it has an
unusually low tolerance and F value.
Type:
STEP deflator
0.995
Effect
Coefficient
Std Error
Std Coef
Tol.
df
-0.032
-1.972
-1.020
-0.078
1814.101
0.024
0.386
0.191
0.162
425.283
-0.905
-0.525
-0.202
-0.154
2.459
0.00097
0.04299
0.31723
0.00443
0.00136
1
1
1
1
1
1.744
26.090
28.564
0.230
18.196
0.216
0.000
0.000
0.642
0.002
In
___
1
3
4
5
6
7
Constant
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME
Out
Part. Corr.
___
2 DEFLATOR
0.059
.
.
0.00738
1
0.031
0.863
-------------------------------------------------------------------------------
I-422
Chapter 14
0.995
Effect
Coefficient
Std Error
Std Coef
Tol.
df
-0.040
-2.088
-1.015
1887.410
0.016
0.290
0.184
382.766
-1.137
-0.556
-0.201
2.559
0.00194
0.07088
0.31831
0.00157
1
1
1
1
5.953
51.870
30.496
24.314
0.033
0.000
0.000
0.000
In
___
1
3
4
5
7
Constant
GNP
UNEMPLOY
ARMFORCE
TIME
Out
Part. Corr.
___
2 DEFLATOR
0.143
.
.
0.01305
1
0.208
6 POPULATN
-0.150
.
.
0.00443
1
0.230
0.642
-------------------------------------------------------------------------------
0.658
GNP and TIME both have low tolerance values. They could be highly correlated with
one another, so we will take each out and examine the behavior of the other when we do.
Type:
STEP time
STEP time
STEP gnp
0.985
Effect
Coefficient
Std Error
0.041
-0.797
-0.483
0.002
0.213
0.255
In
___
1
3
4
5
Constant
GNP
UNEMPLOY
ARMFORCE
Std Coef
Tol.
1.154 0.31838
-0.212 0.38512
-0.096 0.48571
df
1 341.684
1 13.942
1
3.580
0.000
0.003
0.083
Out
Part. Corr.
___
2 DEFLATOR
0.163
.
.
0.01318
1
0.299
0.596
6 POPULATN
-0.376
.
.
0.00509
1
1.813
0.205
7 TIME
0.830
.
.
0.00157
1 24.314
0.000
-------------------------------------------------------------------------------
I-423
Linear Models I: Linear Regression
0.995
Effect
Coefficient
Std Error
Std Coef
Tol.
df
-0.040
-2.088
-1.015
1887.410
0.016
0.290
0.184
382.766
-1.137
-0.556
-0.201
2.559
0.00194
0.07088
0.31831
0.00157
1
1
1
1
5.953
51.870
30.496
24.314
0.033
0.000
0.000
0.000
In
___
1
3
4
5
7
Constant
GNP
UNEMPLOY
ARMFORCE
TIME
Out
Part. Corr.
___
2 DEFLATOR
0.143
.
.
0.01305
1
0.208
0.658
6 POPULATN
-0.150
.
.
0.00443
1
0.230
0.642
------------------------------------------------------------------------------Step # 5 R = 0.996 R-Square =
Term removed: GNP
0.993
Effect
Coefficient
Std Error
-1.470
-0.772
956.380
0.167
0.184
35.525
In
___
1
4
5
7
Constant
UNEMPLOY
ARMFORCE
TIME
Std Coef
Tol.
df
-0.391 0.30139
-0.153 0.44978
1.297 0.25701
1 77.320
1 17.671
1 724.765
0.000
0.001
0.000
Out
Part. Corr.
___
2 DEFLATOR
-0.031
.
.
0.01385
1
0.011
0.920
3 GNP
-0.593
.
.
0.00194
1
5.953
0.033
6 POPULATN
-0.505
.
.
0.00889
1
3.768
0.078
-------------------------------------------------------------------------------
We are comfortable with the tolerance values in both models with three variables. With
TIME in the model, the smallest F is 17.671, and with GNP in the model, the smallest
F is 3.580. Furthermore, with TIME, the squared multiple correlation is 0.993, and with
GNP, it is 0.985. Lets stop the stepping and view more information about the last
model.
Type:
STOP
N: 16
Multiple R: 0.996
Coefficient
Std Error
-1797221.112
-1.470
-0.772
956.380
68641.553
0.167
0.184
35.525
.
0.301
0.450
0.257
t
-26.183
-8.793
-4.204
26.921
P(2 Tail)
0.000
0.000
0.001
0.000
I-424
Chapter 14
Effect
CONSTANT
UNEMPLOY
ARMFORCE
TIME
Coefficient
Lower
< 95%>
Upper
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
1.83685E+08
3 6.12285E+07
555.209
0.000
Residual
1323360.743
12
110280.062
-------------------------------------------------------------------------------
Our final model includes only UNEMPLOY, ARMFORCE, and TIME. Notice that its
multiple correlation (0.996) is not significantly smaller than that for the automated
stepping (0.998). Following are the commands we used:
REGRESS
USE longley
MODEL total=constant + deflator + gnp + unemploy +,
armforce + populatn + time
START / BACK
STEP deflator
STEP populatn
STEP time
STEP time
STEP gnp
STOP
Example 7
Testing whether a Single Coefficient Equals Zero
Most regression programs print tests of significance for each coefficient in an equation.
SYSTAT has a powerful additional featurepost hoc tests of regression coefficients.
To demonstrate these tests, we use the LONGLEY data and examine whether the
DEFLATOR coefficient differs significantly from 0. The input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
EFFECT = deflator
TEST
I-425
Linear Models I: Linear Regression
N: 16
Multiple R: 0.998
Coefficient
Std Error
-3482258.635
15.062
-0.036
-2.020
-1.033
-0.051
1829.151
890420.384
84.915
0.033
0.488
0.214
0.226
455.478
.
0.007
0.001
0.030
0.279
0.003
0.001
P(2 Tail)
-3.911
0.177
-1.070
-4.136
-4.822
-0.226
4.016
0.004
0.863
0.313
0.003
0.001
0.826
0.003
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
1.84172E+08
6 3.06954E+07
330.285
0.000
Residual
836424.056
9
92936.006
------------------------------------------------------------------------------Test for effect called:
DEFLATOR
Test of Hypothesis
Source
Hypothesis
Error
SS
2923.976
836424.056
df
1
9
MS
2923.976
92936.006
F
0.031
P
0.863
-------------------------------------------------------------------------------
Notice that the error sum of squares (836424.056) is the same as the output residual
sum of squares at the bottom of the ANOVA table. The probability level (0.863) is the
same also. This probability level (> 0.05) indicates that the regression coefficient for
DEFLATOR does not differ from 0.
You can test all of the coefficients in the equation this way, individually, or choose
All to generate separate hypothesis tests for each predictor or type:
HYPOTHESIS
ALL
TEST
I-426
Chapter 14
Example 8
Testing whether Multiple Coefficients Equal Zero
You may wonder why you need to bother with testing when the regression output gives
you hypothesis test results. Try the following hypothesis test:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
EFFECT = deflator & gnp
TEST
DEFLATOR
and
GNP
A Matrix
1
2
1
0.0
0.0
2
1.000
0.0
1
2
6
0.0
0.0
7
0.0
0.0
3
0.0
1.000
4
0.0
0.0
5
0.0
0.0
Test of Hypothesis
Source
Hypothesis
Error
SS
149295.592
836424.056
df
2
9
MS
74647.796
92936.006
F
0.803
P
0.478
-------------------------------------------------------------------------------
Here, the error sum of squares is the same as that for the model, but the hypothesis sum
of squares is different. We just tested the hypothesis that the DEFLATOR and GNP
coefficients simultaneously are 0.
The A matrix printed above the test specifies the hypothesis that we tested. It has
two degrees of freedom (see the F statistic) because the A matrix has two rowsone
for each coefficient. If you know some matrix algebra, you can see that the matrix
product AB using this A matrix and B as a column matrix of regression coefficients
picks up only two coefficients: DEFLATOR and GNP. Notice that our hypothesis had
the following matrix equation: AB = 0, where 0 is a null matrix.
If you dont know matrix algebra, dont worry; the ampersand method is equivalent.
You can ignore the A matrix in the output.
I-427
Linear Models I: Linear Regression
You may not want to test this kind of hypothesis on the LONGLEY data, but there are
important applications in the analysis of variance where you might.
Example 9
Testing Nonzero Null Hypotheses
You can test nonzero null hypotheses with a D matrix, often in combination using
CONTRAST or AMATRIX. Here, we test whether the DEFLATOR coefficient
significantly differs from 30:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
AMATRIX [0 1 0 0 0 0 0]
DMATRIX [30]
TEST
I-428
Chapter 14
2
1.000
6
0.0
Null hypothesis value for D
30.000
Test of Hypothesis
Source
Hypothesis
Error
SS
4
0.0
5
0.0
7
0.0
df
2876.128
836424.056
3
0.0
1
9
MS
2876.128
92936.006
0.031
0.864
The commands that test whether DEFLATOR differs from 30 can be performed more
efficiently using SPECIFY:
HYPOTHESIS
SPECIFY deflator=30
TEST
Example 10
Regression with Ecological or Grouped Data
If you have aggregated data, weight the regression by a count variable. This variable
should represent the counts of observations (n) contributing to the ith case. If n is not
an integer, SYSTAT truncates it to an integer before using it as a weight. The regression
results are identical to those produced if you had typed in each case.
We use, for this example, an ecological or grouped data file, PLANTS. The input is:
REGRESS
USE plants
FREQ=count
MODEL co2 = CONSTANT + species
ESTIMATE
N: 76
Multiple R: 0.757
Coefficient
Std Error
13.738
-0.466
0.204
0.047
.
1.000
t
67.273
-9.961
P(2 Tail)
0.000
0.000
I-429
Linear Models I: Linear Regression
Effect
CONSTANT
SPECIES
Coefficient
Lower
13.738
-0.466
< 95%>
13.331
-0.559
Upper
14.144
-0.372
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
Regression
52.660
1
52.660
99.223
0.000
Residual
39.274
74
0.531
-------------------------------------------------------------------------------
Example 11
Regression without the Constant
To regress without the constant (intercept) term, or through the origin, remove the
constant from the list of independent variables. REGRESS adjusts accordingly. The
input is:
REGRESS
MODEL dependent = var1 + var2
ESTIMATE
Some users are puzzled when they see a model without a constant having a higher
multiple correlation than a model that includes a constant. How can a regression with
fewer parameters predict better than another? It doesnt. The total sum of squares
must be redefined for a regression model with zero intercept. It is no longer centered
about the mean of the dependent variable. Other definitions of sums of squares can lead
to strange results, such as negative multiple correlations. If your constant is actually
near 0, then including or excluding the constant makes little difference in the output.
Kvlseth (1985) discusses the issues involved in summary statistics for zero-intercept
2
regression models. The definition of R used in SYSTAT is Kvlseths formula 7. This
was chosen because it retains its PRE (percentage reduction of error) interpretation and
is guaranteed to be in the (0,1) interval.
How, then, do you test the significance of a constant in a regression model? Include
a constant in the model as usual and look at its test of significance.
If you have a zero-intercept model where it is appropriate to compute a coefficient
of determination and other summary statistics about the centered data, use General
Linear Model and select Mixture model. This option provides Kvlseths formula 1 for
2
R and uses centered total sum of squares for other summary statistics.
I-430
Chapter 14
References
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying
influential data and sources of collinearity. New York: John Wiley & Sons, Inc.
Flack, V. F. and Chang, P. C. (1987). Frequency of selecting noise variables in subset
regression analysis: A simulation study. The American Statistician, 41, 8486.
Freedman, D. A. (1983). A note on screening regression equations. The American
Statistician, 37, 152155.
Hocking, R. R. (1983). Developments in linear regression methodology: 195982.
Technometrics, 25, 219230.
Lovell, M. C. (1983). Data Mining. The Review of Economics and Statistics, 65, 112.
Rencher, A. C. and Pun, F. C. (1980). Inflation of R-squared in best subset regression.
Technometrics, 22, 4954.
Velleman, P. F. and Welsch, R. E. (1981). Efficient computing of regression diagnostics.
The American Statistician, 35, 234242.
Weisberg, S. (1985). Applied linear regression. New York: John Wiley & Sons, Inc.
Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin,
86, 168174.
Wilkinson, L. and Dallal, G. E. (1982). Tests of significance in forward selection
regression with an F-to-enter stopping rule. Technometrics, 24, 2528.
Chapter
15
Linear Models II:
Analysis of Variance
Leland Wilkinson and Mark Coward
I-431
I-432
Chapter 15
Dependent. The variable(s) you want to examine. The dependent variable(s) should be
continuous and numeric (for example, INCOME). For MANOVA (multivariate
analysis of variance), select two or more dependent variables.
Factor. One or more categorical variables (grouping variables) that split your cases into
two or more groups.
Missing values. Includes a separate category for cases with a missing value for the
variable(s) identified with Factor.
Covariates. A covariate is a quantitative independent variable that adds unwanted
variability to the dependent variable. An analysis of covariance (ANCOVA) adjusts or
removes the variability in the dependent variable due to the covariate (for example,
variability in cholesterol level might be removed by using AGE as a covariate).
I-433
Linear Models II: Analysis of Variance
Post hoc Tests. Post hoc tests determine which pairs of means differ significantly. The
following alternatives are available:
n Bonferroni. Multiple comparison test based on Students t statistic. Adjusts the
observed significance level for the fact that multiple comparisons are made.
n Tukey. Uses the Studentized range statistic to make all pairwise comparisons
between groups and sets the experimentwise error rate to the error rate for the
collection for all pairwise comparisons. When testing a large number of pairs of
means, Tukey is more powerful than Bonferroni. For a small number of pairs,
Bonferroni is more powerful.
n LSD. Least significant difference pairwise multiple comparison test. Equivalent to
multiple t tests between all pairs of groups. The disadvantage of this test is that no
attempt is made to adjust the observed significance level for multiple comparisons.
n Scheff. The significance level of Scheffs test is designed to allow all possible
Cooks D, and the standard error of predicted values. Only the predicted values and
residuals are appropriate for ANOVA.
n Residuals/Data. Saves the statistics given by Residuals plus all of the variables in
I-434
Chapter 15
An ANOVA model must be estimated before any hypothesis tests can be performed.
Contrasts can be defined across the categories of a grouping factor or across the levels
of a repeated measure.
n Effects. Specify the factor (that is, grouping variable) to which the contrast applies.
Selecting All yields a separate test of the effect of each factor in the ANOVA model,
as well as tests of all interactions between those factors.
n Within. Use when specifying a contrast across the levels of a repeated measures
I-435
Linear Models II: Analysis of Variance
Specify
To specify hypothesis test coefficients, click Specify in the ANOVA Hypothesis Test
dialog box.
To specify coefficients for a hypothesis test, use cell identifiers. Common hypothesis
tests include contrasts across marginal means or tests of simple effects. For a two-way
factorial ANOVA design with DISEASE (four categories) and DRUG (three
categories), you could contrast the marginal mean for the first level of drug against the
third level by specifying:
DRUG[1] = DRUG[3]
Note that square brackets enclose the value of the category (for example, for
GENDER$, specify GENDER$[male]). For the simple contrast of the first and third
levels of DRUG for the second disease only:
DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]
I-436
Chapter 15
Contrast
To specify contrasts, click Contrast in the ANOVA Hypothesis Test dialog box.
categories (or levels), you can specify your own coefficients, such as 3 1 1 3, by
typing these values in the Custom text box.
n Difference. Compare each level with its adjacent level.
n Polynomial. Generate orthogonal polynomial contrasts (to test linear, quadratic,
example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as
the metric.
n Sum. In a repeated measures ANOVA, total the values for each subject.
Repeated Measures
In a repeated measures design, the same variable is measured several times for each
subject (case). A paired-comparison t test is the most simple form of a repeated
measures design (for example, each subject has a before and after measure).
I-437
Linear Models II: Analysis of Variance
SYSTAT derives values from your repeated measures and uses them in analysis of
variance computations to test changes across the repeated measures (within subjects)
as well as differences between groups of subjects (between subjects). Tests of the
within-subjects values are called polynomial test of order 1, 2, ..., up to k, where k is
one less than the number of repeated measures. The first polynomial is used to test
linear changes: do the repeated responses increase (or decrease) around a line with a
significant slope? The second polynomial tests whether the responses fall along a
quadratic curve, and so on.
To obtain a repeated measures analysis of variance, from the menus choose:
Statistics
Analysis of Variance (ANOVA)
Estimate Model
Optionally, you can assign a name for each set of repeated measures, specify the
number of levels, and specify the metric for unevenly spaced repeated measures.
n Name. Name that identifies each set of repeated measures.
n Levels. Number of repeated measures in the set. For example, if you have three
I-438
Chapter 15
n Metric. Metric that indicates the spacing between unevenly spaced measurements.
For example, if measurements were taken at the third, fifth, and ninth weeks, the
metric would be 3, 5, 9.
Using Commands
ANOVA
USE filename
CATEGORY / MISS
DEPEND / REPEAT NAMES
BONF or TUKEY or LSD or SCHEFFE
SAVE filename / ADJUST, MODEL, RESID, DATA
ESTIMATE
Usage Considerations
Types of data. ANOVA requires a rectangular data file.
Print options. If PRINT=SHORT, output includes an ANOVA table. The MEDIUM length
adds least-squares means to the output. LONG adds estimates of the coefficients.
Quick Graphs. ANOVA plots the group means against the groups.
Saving files. ANOVA can save predicted values, residuals, Studentized residuals,
leverages, Cooks D, standard error of predicted values, adjusted cell means, and
estimates of the coefficients.
BY groups. ANOVA performs separate analyses for each level of any BY variables.
I-439
Linear Models II: Analysis of Variance
Examples
Example 1
One-Way ANOVA
How does equipment influence typing performance? This example uses a one-way
design to compare average typing speed for three groups of typists. Fourteen beginning
typists were randomly assigned to three types of machines and given speed tests.
Following are their typing speeds in words per minute:
Electric
Plain old
Word processor
52
47
51
49
53
52
43
47
44
67
73
70
75
64
The data are stored in the SYSTAT data file named TYPING. The average speeds for
the typists in the three groups are 50.4, 46.5, and 69.8 words per minute, respectively.
To test the hypothesis that the three samples have the same population average speed,
the input is:
USE typing
ANOVA
CATEGORY equipmnt$
DEPEND speed
ESTIMATE
N: 14
Multiple R: 0.95
Analysis of Variance
Source
EQUIPMNT$
Error
Sum-of-Squares
1469.36
151.00
df
2
11
Mean-Square
734.68
13.73
F-ratio
53.52
P
0.00
I-440
Chapter 15
For the dependent variable SPEED, SYSTAT reads 14 cases. The multiple correlation
(Multiple R) for SPEED with the two design variables for EQUIPMNT$ is 0.952. The
square of this correlation (Squared multiple R) is 0.907. The grouping structure
explains 90.7% of the variability of SPEED.
The layout of the ANOVA table is standard in elementary texts; you will find
formulas and definitions there. F-ratio is the Mean-Square for EQUIPMNT$ divided
by the Mean-Square for Error. The distribution of the F ratio is sensitive to the
assumption of equal population group variances. The p value is the probability of
exceeding the F ratio when the group means are equal. The p value printed here is
0.000, so it is less than 0.0005. If the population means are equal, it would be very
unusual to find sample means that differ as much as theseyou could expect such a
large F ratio fewer than five times out of 10,000.
The Quick Graph illustrates this finding. Although the typists using electric and
plain old typewriters have similar average speeds (50.4 and 46.5, respectively), the
word processor group has a much higher average speed.
I-441
Linear Models II: Analysis of Variance
In this example, we use the Bonferroni method for the typing speed data used in the
one-way ANOVA example. As an aid in interpretation, we order the equipment
categories from least to most advanced. The input is:
USE typing
ORDER equipmnt$ / SORT=plain old electric, word process
ANOVA
CATEGORY equipmnt$
DEPEND speed / BONF
ESTIMATE
SYSTAT assigns a number to each of the three groups and uses those numbers in the
output panels that follow:
COL/
ROW EQUIPMNT$
1 plain old
2 electric
3 word process
Using least squares means.
Post Hoc test of SPEED
-----------------------------------------------------------------------Using model MSE of 13.727 with 11 df.
Matrix of pairwise mean differences:
1
2
3
1
0.0
3.90
23.30
0.0
19.40
0.0
Bonferroni Adjustment.
Matrix of pairwise comparison probabilities:
1
2
3
1
1.00
0.43
0.00
2
1.00
0.00
3
1.00
In the first column, you can read differences in average typing speed for the group
using plain old typewriters. In the second row, you see that they average 3.9 words per
minute fewer than those using electric typewriters; but in the third row, you see that
they average 23.3 minutes fewer than the group using word processors. To see whether
these differences are significant, look at the probabilities in the corresponding locations
at the bottom of the table.
The probability associated with 3.9 is 0.43, so you are unable to detect a difference
in performance between the electric and plain old groups. The probabilities in the third
row are both 0.00, indicating that the word processor group averages significantly
more words per minute than the electric and plain old groups.
I-442
Chapter 15
Example 2
ANOVA Assumptions and Contrasts
An important assumption in analysis of variance is that the population variances are
equalthat is, that the groups have approximately the same spread. When variances
differ markedly, a transformation may remedy the problem. For example, sometimes it
helps to take the square root of each value of the outcome variable (or log transform
each value) and use the transformed value in the analysis.
In this example, we use a subset of the cases from the SURVEY2 data file to address
the question, For males, does average income vary by education? We focus on those
who:
n Did not graduate from high school (HS dropout)
n Graduated from high school (HS grad)
n Attended some college (Some college)
n Graduated from college (College grad)
n Have an M.A. or Ph.D. (Degree +)
For each male subject (case) in the SURVEY2 data file, use the variables INCOME and
EDUC$. The means, standard deviations, and sample sizes for the five groups are
shown below:
mean
sd
n
HS dropout
HS grad
Some college
College grad
Degree +
$13,389
$21,231
$29,294
$30,937
$38,214
10,639
13,176
16,465
16,894
18,230
18
39
17
16
14
Visually, as you move across the groups, you see that average income increases. But
considering the variability within each group, you might wonder if the differences are
significant. Also, there is a relationship between the means and standard deviationsas
the means increase, so do the standard deviations. They should be independent. If you
take the square root of each income value, there is less variability among the standard
deviations, and the relation between the means and standard deviations is weaker:
HS dropout
HS grad
Some college
College grad
Degree +
mean
3.371
4.423
5.190
5.305
6.007
sd
1.465
1.310
1.583
1.725
1.516
I-443
Linear Models II: Analysis of Variance
A bar chart for the data will show the effect of the transformation. The input is:
USE survey2
SELECT sex$ = Male
LABEL educatn / 1,2=HS dropout, 3=HS grad
4=Some college, 5=College grad
6,7=Degree +
CATEGORY educatn
BEGIN
BAR income * educatn / SERROR FILL=.5 LOC=-3IN,0IN
BAR income * educatn / SERROR FILL=.35 YPOW=.5,
LOC=3IN,0IN
END
60
40
INCOME
INCOME
50
40
30
20
30
20
10
10
0
HS
dr
ou
op
t
HS
gr
ad
So
c
me
d
e
e+
ra
eg
re
oll
eg
eg
eg
D
l
l
Co
EDUCATN
HS
o
dr
ut
po
HS
a
gr
c
me
So
oll
e
eg
lle
Co
ge
gr
ad
gr
De
ee
EDUCATN
In the chart on the left, you can see a relation between the height of the bars (means)
and the length of the error bars (standard errors). The smaller means have shorter error
bars than the larger means. After transformation, there is less difference in length
among the error bars. The transformation aids in eliminating the dependency between
the group and the standard deviation.
To test for differences among the means:
ANOVA
LET sqrt_inc = SQR(income)
DEPEND sqrt_inc
ESTIMATE
I-444
Chapter 15
N: 104
Multiple R: 0.49
Analysis of Variance
Source
Sum-of-Squares
EDUCATN
Error
df
Mean-Square
68.62
17.16
216.26
99
2.18
F-ratio
7.85
P
0.00
The ANOVA table using the transformed income as the dependent variable suggests a
significant difference among the four means (p < 0.0005).
1
1.000
0.100
0.004
0.002
0.000
0.0
0.116
0.817
0.0
0.701
0.0
1.000
0.387
0.268
0.007
1.000
0.999
0.545
1.000
0.694
1.000
I-445
Linear Models II: Analysis of Variance
The layout of the output panels for the Tukey method is the same as that for the
Bonferroni method. Look first at the probabilities at the bottom of the table. Four of the
probabilities indicate significant differences (they are less than 0.05). In the first
column, row 3, the average income for high school dropouts differs from those with
some college (p = 0.004), from college graduates (p = 0.002), and also from those with
advanced degrees (p < 0.0005). The fifth row shows that the differences between those
with advanced degrees and the high school graduates are significant (p = 0.008).
Contrasts
In this example, the five groups are ordered by their level of education, so you use these
coefficients to test linear and quadratic contrasts:
Linear
Quadratic
Then you ask, Is there a linear increase in average income across the five ordered
levels of education? A quadratic change? The input follows:
HYPOTHESIS
NOTE Test of linear contrast,
across ordered group means
EFFECT = educatn
CONTRAST [2 1 0 1 2]
TEST
HYPOTHESIS
NOTE 'Test of quadratic contrast',
'across ordered group means'
EFFECT = educatn
CONTRAST [2 1 2 1 2]
TEST
SELECT
I-446
Chapter 15
EDUCATN
A Matrix
1
2
-4.00
0.0
3
-3.00
4
-2.00
5
-1.00
Test of Hypothesis
Source
Hypothesis
Error
SS
df
63.54
216.26
MS
1
99
63.54
2.18
29.09
0.00
EDUCATN
A Matrix
1
2
0.0
0.0
3
-3.00
4
-4.00
5
-3.00
Test of Hypothesis
Source
Hypothesis
Error
SS
2.20
216.26
df
1
99
MS
2.20
2.18
1.01
0.32
The F statistic for testing the linear contrast is 29.089 (p value < 0.0005); for testing
the quadratic contrast, it is 1.008 (p value = 0.32). Thus, you can report that there is a
highly significant linear increase in average income across the five levels of education
and that you have not found a quadratic component in this increase.
I-447
Linear Models II: Analysis of Variance
Example 3
Two-Way ANOVA
Consider the following two-way analysis of variance design from Afifi and Azen
(1972), cited in Kutner (1974), and reprinted in BMDP manuals. The dependent
variable, SYSINCR, is the change in systolic blood pressure after administering one of
four different drugs to patients with one of three different diseases. Patients were
assigned randomly to one of the possible drugs. The data are stored in the SYSTAT file
AFIFI.
To obtain a least-squares two-way analysis of variance:
USE afifi
ANOVA
CATEGORY drug disease
DEPEND sysincr
SAVE myresids / RESID
ESTIMATE
DATA
N: 58
Multiple R: 0.675
Source
Sum-of-Squares
Analysis of Variance
df
Mean-Square
F-ratio
9.046
1.883
1.067
DRUG
DISEASE
DRUG*DISEASE
2997.472
415.873
707.266
3
2
6
999.157
207.937
117.878
Error
5080.817
46
110.453
P
0.000
0.164
0.396
I-448
Chapter 15
q
Least Squares Means
2
41
41
30
30
SYSINCR
SYSINCR
19
-3
19
2
3
DRUG$
-3
2
3
DRUG$
3
41
SYSINCR
30
19
-3
2
3
DRUG$
I-449
Linear Models II: Analysis of Variance
drugs 3 and 4 are much lower. ANOVA tests for significance the differences illustrated
in this plot.
For the DISEASE plot, we see a gradual decrease in blood pressure change across
the three diseases. However, this effect is not significant; there is not enough variation
among these means to overcome the variation due to individual differences.
In addition the plot for each factor, SYSTAT also produces plots of the average
blood pressure change at each level of DRUG for each level of disease. Use these plots
to illustrate interaction effects. Although the interaction effect is not significant in this
example, we can still examine these plots.
In general, we see a decline in blood pressure change across drugs. (Keep in mind
that the drugs are only artificially ordered. We could reorder the drugs, and although
the ANOVA results wouldnt change, the plots would differ.) The similarity of the
plots illustrates the nonsignificant interaction.
A close correspondence exists between the factor plots and the interaction plots. The
means plotted in the factor plot for DISEASE correspond to the weighted average of
the four points in each of the interaction plots. Similarly, each mean plotted in the
DRUG factor plot corresponds to the weighted average of the three corresponding
points across interaction plots. Consequently, the significant DRUG effect can be seen
in the differing means in each interaction plot. Can you see the nonsignificant
DISEASE effect in the interaction plots?
Least-Squares ANOVA
If you have an orthogonal design (equal number of cases in every cell), you will find
that the ANOVA table is the same one you get with any standard program. SYSTAT
can handle non-orthogonal designs, however (as in the present example). To
understand the sources for sums of squares, you must know something about leastsquares ANOVA.
As with one-way ANOVA, your specifying factor levels causes SYSTAT to create
dummy variables out of the classifying input variable. SYSTAT creates one fewer
dummy variables than categories specified.
Coding of the dummy variables is the classic analysis of variance parameterization,
in which the sum of effects estimated for a classifying variable is 0 (Scheff, 1959). In
I-450
Chapter 15
our example, DRUG has four categories; therefore, SYSTAT creates three dummy
variables with the following scores for subjects at each level:
1
0
0
1
0
1
0
1
0
0
1
1
Because DISEASE has three categories, SYSTAT creates two dummy variables to be
coded as follows:
1
0
1
0
1
1
Now, because there are no continuous predictors in the model (unlike the analysis of
covariance), you have a complete design matrix of dummy variables as follows (DRUG
is labeled with an a, DISEASE with a b, and the grand mean with an m):
Treatment
A
B
1
1
1
2
1
3
2
1
2
2
2
3
3
1
3
2
3
3
4
1
4
2
4
3
Mean
m
1
1
1
1
1
1
1
1
1
1
1
1
a1
1
1
1
0
0
0
0
0
0
1
1
1
DRUG
a2 a3
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
1
0
1
1 1
1 1
1 1
DISEASE
Interaction
b1
b2 a1b1 a1b2 a2b1 a2b2 a3b1 a3b2
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
1
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
1
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1
This example is used to explain how SYSTAT gets an error term for the ANOVA table.
Because it is a least-squares program, the error term is taken from the residual sum of
squares in the regression onto the above dummy variables. For non-orthogonal designs,
this choice is identical to that produced by BMDP2V and SPSS GLM with Type III sums
of squares. These, in general, will be the hypotheses you want to test on unbalanced
I-451
Linear Models II: Analysis of Variance
experimental data. You can construct other types of sums of squares by using an A matrix
or by running your ANOVA model using the Stepwise options in GLM. Consult the
references if you do not already know what these sums of squares mean.
0 1]
HYPOTHESIS
EFFECT = drug
CONTRAST [1 1
TEST
-1 1]
You need four numbers in each contrast because DRUG has four levels. You cannot use
CONTRAST to specify coefficients for interaction terms. It creates an A matrix only for
main effects. Following are the results of the above hypothesis tests:
Test for effect called:
DRUG
A Matrix
1
0.0
2
1.000
3
0.0
4
-1.000
6
0.0
7
0.0
8
0.0
9
0.0
11
0.0
12
0.0
5
0.0
10
0.0
Test of Hypothesis
Source
Hypothesis
Error
SS
1697.545
5080.817
df
1
46
MS
1697.545
110.453
F
15.369
P
0.000
-------------------------------------------------------------------------------
I-452
Chapter 15
DRUG
A Matrix
1
0.0
2
2.000
3
1.000
4
1.000
6
0.0
7
0.0
8
0.0
9
0.0
11
0.0
12
0.0
5
0.0
10
0.0
Test of Hypothesis
Source
Hypothesis
Error
SS
df
1178.892
5080.817
1
46
MS
1178.892
110.453
10.673
0.002
DRUG
A Matrix
1
0.0
2
2.000
3
2.000
4
0.0
5
0.0
6
0.0
7
0.0
8
0.0
9
0.0
10
0.0
11
0.0
12
0.0
Test of Hypothesis
Source
Hypothesis
Error
SS
2982.934
5080.817
df
1
46
MS
2982.934
110.453
F
27.006
P
0.000
Notice the A matrices in the output. SYSTAT automatically takes into account the
degree of freedom lost in the design coding. Also, notice that you do not need to
normalize contrasts or rows of the A matrix to unit vector length, as in some ANOVA
programs. If you use (2 0 2 0) or (0.707 0 0.707 0) instead of (1 0 1 0), you get the
same sum of squares.
For the comparison of the first and third drugs, the F statistic is 15.369 (p value
< 0.0005), indicating that these two drugs differ. Looking at the Quick Graphs
produced earlier, we see that the change in blood pressure was much smaller for the
third drug.
Notice that in the A matrix created by the contrast of the first and fourth drugs, you
get (2 1 1) in place of the three design variables corresponding to the appropriate
columns of the A matrix. Because you selected the reduced form for coding of design
variables in which sums of effects are 0, you have the following restriction for the
DRUG effects:
1 + 2 + 3 + 4 = 0
I-453
Linear Models II: Analysis of Variance
where each value is the effect for that level of DRUG. This means that
4 = ( 1 + 2 + 3 )
and the contrast DRUG(1) DRUG(4) is equivalent to
1 [ ( 1 + 2 + 3 )] = 0
which is
2 1 + 2 + 3
For the final contrast, SYSTAT transforms the (1 1 1 1) specification into contrast
coefficients of (2 2 0) for the dummy coded variables. The p value (< 0.0005) indicates
that the first two drugs differ from the last two drugs.
Simple Effects
You can do simple contrasts between drugs within levels of disease (although the lack
of a significant DRUG * DISEASE interaction does not justify it). To show how it is
done, consider a contrast between the first and third levels of DRUG for the first
DISEASE only. You must specify the contrast in terms of the cell means. Use the
terminology:
MEAN (DRUG index, DISEASE index) = M{i,j}
You want to contrast cell means M{1,1} and M{3,1}. These are composed of:
M{11
, } = + 1 + 1 + 11
M{31
, } = + 3 + 1 + 31
Now, if you consider the coding of the variables, you can construct an A matrix that
picks up the appropriate columns of the design matrix. Here are the column labels of
I-454
Chapter 15
the design matrix (a means DRUG and b means DISEASE) to serve as a column ruler
over the A matrix specified in the hypothesis.
m
a1
a2
a3
b1
b2
a1b1
a1b2
a2b1
a2b2
a3b1
a3b2
0 1
0 1
0]
2
1.000
3
0.0
6
0.0
7
1.000
8
0.0
11
-1.000
4
-1.000
5
0.0
9
0.0
10
0.0
12
0.0
Test of Hypothesis
Source
Hypothesis
Error
SS
338.000
5080.817
df
1
46
MS
338.000
110.453
F
3.060
P
0.087
After you understand how SYSTAT codes design variables and how the model
sentence orders them, you can take any standard ANOVA text like Winer (1971) or
Scheff (1959) and construct an A matrix for any linear contrast.
I-455
Linear Models II: Analysis of Variance
Finally, here is the simple contrast of the first and third levels of DRUG for the first
DISEASE only:
HYPOTHESIS
SPECIFY drug[1] disease[1] = drug[3] disease[1]
TEST
Screening Results
Lets examine the AFIFI data in more detail. To analyze the residuals to examine the
ANOVA assumptions, first plot the residuals against estimated values (cell means) to
check for homogeneity of variance. Use the Studentized residuals to reference them
against a t distribution. In addition, stem-and-leaf plots of the residuals and boxplots of
the dependent variable aid in identifying outliers. The input is:
USE afifi
ANOVA
CATEGORY drug disease
DEPEND sysincr
SAVE myresids / RESID DATA
ESTIMATE
DENSITY sysincr * drug / BOX
USE myresids
PLOT student*estimate / SYM=1 FILL=1
STATISTICS
STEM student
I-456
Chapter 15
STUDENT, N = 58
-2
6
-2
-1
987666
-1
410
-0 H 9877765
-0
4322220000
0 M 001222333444
0 H 55666888
1
011133444
1
55
The plots suggest the presence of an outlier. The smallest value in the stem-and-leaf
plot seems to be out of line. A t statistic value of 2.647 corresponds to p < 0.01, and
you would not expect a value this small to show up in a sample of only 58 independent
values. In the scatterplot, the point corresponding to this value appears at the bottom
and badly skews the data in its cell (which happens to be DRUG1, DISEASE3). The
outlier in the first group clearly stands out in the boxplot, too.
To see the effect of this outlier, delete the observation with the outlying Studentized
residual. Then, run the analysis again. Following is the ANOVA output for the revised
data:
Dep Var: SYSINCR
N:
57
Multiple R:
.710
Squared Multiple R:
.503
Analysis of Variance
Source
Sum-of-Squares
DF
Mean-Square
F-Ratio
DRUG
DISEASE
DRUG*DISEASE
3344.064
232.826
676.865
3
2
6
1114.688
116.413
112.811
11.410
1.192
1.155
Error
4396.367
45
97.697
P
0.000
0.313
0.347
The differences are not substantial. Nevertheless, notice that the DISEASE effect is
substantially attenuated when only one case out of 58 is deleted. Daniel (1960) gives
an example in which one outlying case alters the fundamental conclusions of a
designed experiment. The F test is robust to certain violations of assumptions, but
factorial ANOVA is not robust against outliers. You should routinely do these plots for
ANOVA.
I-457
Linear Models II: Analysis of Variance
Example 4
Single-Degree-of-Freedom Designs
The data in the REACT file involve yields of a chemical reaction under various
combinations of four binary factors (A, B, C, and D). Two reactions were observed
under each combination of experimental factors, so the number of cases per cell is two.
To analyze these data in a four-way ANOVA:
USE react
ANOVA
USE react
CATEGORY a, b, c, d
DEPEND yield
ESTIMATE
You can see the advantage of ANOVA over GLM when you have several factors; you
have to select only the main effects. With GLM, you have to specify the interactions
and identify which variables are categorical (that is, A, B, C, and D). The following
example is the full model using GLM:
MODEL yield = CONSTANT + a + b + c + d +,
a*b + a*c + a*d + b*c + b*d + c*d +,
a*b*c + a*b*d + a*c*d + b*c*d +,
a*b*c*d
N: 32
Multiple R: 0.755
Analysis of Variance
Source
A
B
C
D
A*B
A*C
A*D
B*C
B*D
C*D
A*B*C
A*B*D
A*C*D
B*C*D
A*B*C*D
Error
Sum-of-Squares
df
Mean-Square
F-ratio
369800.000
1458.000
5565.125
172578.125
87153.125
137288.000
328860.500
61952.000
3200.000
3160.125
81810.125
4753.125
415872.000
4.500
15051.125
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
369800.000
1458.000
5565.125
172578.125
87153.125
137288.000
328860.500
61952.000
3200.000
3160.125
81810.125
4753.125
415872.000
4.500
15051.125
4.651
0.018
0.070
2.170
1.096
1.727
4.136
0.779
0.040
0.040
1.029
0.060
5.230
0.000
0.189
1272247.000
16
79515.437
P
0.047
0.894
0.795
0.160
0.311
0.207
0.059
0.390
0.844
0.844
0.326
0.810
0.036
0.994
0.669
The output shows a significant main effect for the first factor (A) plus one significant
interaction (A*C*D).
I-458
Chapter 15
Assessing Normality
Lets look at the study more closely. Because this is a single-degree-of-freedom study
(a 2n factorial), each effect estimate is normally distributed if the usual assumptions for
the experiment are valid. All of the effects estimates, except the constant, have zero
mean and common variance (because dummy variables were used in their
computation). Thus, you can compare them to a normal distribution. SYSTAT
remembers your last selections, so the input is:
SAVE effects / COEF
ESTIMATE
This reestimates the model and saves the regression coefficients (effects). The file has
one case with 16 variables (CONSTANT plus 15 effects). The effects are labeled X(1),
X(2), and so on because they are related to the dummy variables, not the original
variables A, B, C, and D. Lets transpose this file into a new file containing only the 15
effects and create a probability plot of the effects. The input is:
USE effects
DROP constant
TRANSPOSE
SELECT case > 1
PPLOT col(1) / FILL=1 SYMBOL=1,
XLABEL=Estimates of Effects
These effects are indistinguishable from a random normal variable. They plot almost
on a straight line. What does it mean for the study and for the significant F tests?
I-459
Linear Models II: Analysis of Variance
Its time to reveal that the data were produced by a random number generator.
n If you are doing a factorial analysis of variance, the p values you see on the output
are not adjusted for the number of factors. If you do a three-way design, look at
seven tests (excluding the constant). For a four-way design, examine 15 tests. Out
of 15 F tests on random data, expect to find at least one test approaching
significance. You have two significant and one almost significant, which is not far
out of line. The probabilities for each separate F test need to be corrected for the
experimentwise error rate. Some authors devote entire chapters to fine distinctions
between multiple comparison procedures and then illustrate them within a
multifactorial design not corrected for the experimentwise error rate just
demonstrated. Remember that a factorial design is a multiple comparison. If you
have a single-degree-of-freedom study, use the procedure you used to draw a
probability plot of the effects. Any effect that is really significant will become
obvious.
n If you have a factorial study with more degrees of freedom on some factors, use the
Bonferroni critical value for deciding which effects are significant. It guarantees
that the Type I error rate for the study will be no greater than the level you choose.
In the above example, this value is 0.05 / 15 (that is, 0.003).
n Multiple F tests based on a common denominator (mean-square error in this
example) are correlated. This complicates the problem further. In general, the
greater the discrepancy between numerator and denominator degrees of freedom
and the smaller the denominator degrees of freedom, the greater the dependence of
the tests. The Bonferroni tests are best in this situation, although Feingold and
Korsog (1986) offer some useful alternatives.
Example 5
Mixed Models
Mixed models involve combinations of fixed and random factors in an ANOVA. Fixed
factors are assumed to be composed of an exhaustive set of categories (for example,
males and females), while random factors have category levels that are assumed to
have been randomly sampled from a larger population of categories (for example,
classrooms or word stems). Because of the mixing of fixed and random components,
expected mean squares for certain effects are different from those for fully fixed or
fully random designs. SYSTAT can handle mixed models because you can specify
error terms for specific hypotheses.
I-460
Chapter 15
For example, lets analyze the AFIFI data with a mixed model instead of a fully
fixed factorial. Here, you are interested in the four drugs as wide-spectrum disease
killers. Because each drug is now thought to be effective against diseases in general,
you have sampled three random diseases to assess the drugs. This implies that
DISEASE is a random factor and DRUG remains a fixed factor. In this case, the error
term for DRUG is the DRUG * DISEASE interaction. To begin, run the same analysis
we performed in the two-way example to get the ANOVA table. To test for the DRUG
effect, specify drug * disease as the error term in a hypothesis test. The input is:
USE afifi
ANOVA
CATEGORY drug, disease
DEPEND sysincr
ESTIMATE
HYPOTHESIS
EFFECT = drug
ERROR = drug*disease
TEST
N: 58
Multiple R: 0.675
Analysis of Variance
Source
df
Mean-Square
F-ratio
DRUG
DISEASE
DRUG*DISEASE
Sum-of-Squares
2997.472
415.873
707.266
3
2
6
999.157
207.937
117.878
9.046
1.883
1.067
Error
5080.817
46
110.453
P
0.000
0.164
0.396
DRUG
Test of Hypothesis
Source
Hypothesis
Error
SS
2997.472
707.266
df
3
6
MS
999.157
117.878
F
8.476
P
0.014
Notice that the SS, df, and MS for the error term in the hypothesis test correspond to the
values for the interaction in the ANOVA table.
I-461
Linear Models II: Analysis of Variance
Example 6
Separate Variance Hypothesis Tests
The data in the MJ20 data file are from Milliken and Johnson (1984). They are the
results of a paired-associate learning task. GROUP describes the type of drug
administered; LEARNING is the amount of material learned during testing. First we
perform Levenes test (Levene, 1960) to determine if the variances are equal across
cells. The input is:
USE mj20
ANOVA
SAVE mjresids / RESID DATA
DEPEND learning
CATEGORY group
ESTIMATE
USE mjresids
LET residual = ABS(residual)
CATEGORY group
DEPEND residual
ESTIMATE
N: 29
Multiple R: 0.675
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
6.966
GROUP
30.603
10.201
Error
36.608
25
1.464
P
0.001
Notice that the F is significant, indicating that the separate variances test is advisable.
Lets do several single-degree-of-freedom tests, following Milliken and Johnson. The
first is for comparing all drugs against the control; the second tests the hypothesis that
groups 2 and 3 together are not significantly different from group 4. The input is:
USE mj20
ANOVA
CATEGORY group
DEPEND learning
ESTIMATE
HYPOTHESIS
SPECIFY 3*group[1] = group[2] +group[3] + group[4] / SEPARATE
TEST
HYPOTHESIS
SPECIFY 2*group[4] = group[2] +group[3] / SEPARATE
TEST
I-462
Chapter 15
Following is the output. The ANOVA table has been omitted because it is not valid
when variances are unequal.
Using separate variances estimate for error term.
Hypothesis.
A Matrix
1
0.0
Null hypothesis value for D
0.0
Test of Hypothesis
Source
Hypoth
Error
SS
242.720
95.085
2
-4.000
df
1
7.096
3
0.0
MS
4
0.0
242.720
13.399
18.115
0.004
SS
65.634
61.852
2
2.000
df
1
16.792
3
3.000
MS
65.634
3.683
4
3.000
F
17.819
P
0.001
Example 7
Analysis of Covariance
Winer (1971) uses the COVAR data file for an analysis of covariance in which X is the
covariate and TREAT is the treatment. Cases do not need to be ordered by the grouping
variable TREAT.
Before analyzing the data with an analysis of covariance model, be sure there is no
significant interaction between the covariate and the treatment. The assumption of no
interaction is often called the homogeneity of slopes assumption because it is
tantamount to saying that the slope of the regression line of the dependent variable onto
the covariate should be the same in all cells of the design.
I-463
Linear Models II: Analysis of Variance
Parallelism is easy to test with a preliminary model. Use GLM to estimate this
model with the interaction between treatment (TREAT) and covariate (X) in the model.
The input is:
USE covar
GLM
CATEGORY treat
MODEL y = CONSTANT + treat + x + treat*x
ESTIMATE
N: 21
Multiple R: 0.921
Analysis of Variance
Source
Sum-of-Squares
TREAT
X
TREAT*X
df
Mean-Square
F-ratio
6.693
15.672
0.667
2
1
2
3.346
15.672
0.334
5.210
24.399
0.519
9.635
15
0.642
Error
P
0.019
0.000
0.605
The probability value for the treatment by covariate interaction is 0.605, so the
assumption of homogeneity of slopes is plausible.
Now, fit the usual analysis of covariance model by specifying:
USE covar
ANOVA
PRINT=MEDIUM
CATEGORY treat
DEPEND y
COVARIATE x
ESTIMATE
For incomplete factorials and similar designs, you still must specify a model (using
GLM) to do analysis of covariance.
The output follows:
Dep Var: Y
N: 21
Multiple R: 0.916
Analysis of Variance
Source
df
Mean-Square
F-ratio
TREAT
X
Sum-of-Squares
16.932
16.555
2
1
8.466
16.555
13.970
27.319
Error
10.302
17
0.606
P
0.000
0.000
-------------------------------------------------------------------------------
I-464
Chapter 15
SE
0.307
0.309
0.294
N
7
7
7
The treatment adjusted for the covariate is significant. There is a significant difference
among the three treatment groups. Also, notice that the coefficient for the covariate is
significant (F = 27.319, p < 0.0005). If it were not, the analysis of covariance could be
taking away a degree of freedom without reducing mean-square error enough to help you.
SYSTAT computes the adjusted cell means the same way it computes estimates
when saving residuals. Model terms (main effects and interactions) that do not contain
categorical variables (covariates) are incorporated into the equation by adding the
product of the coefficient and the mean of the term for computing estimates. The grand
mean (CONSTANT) is included in computing the estimates.
Example 8
One-Way Repeated Measures
In this example, six rats were weighed at the end of each of five weeks. A plot of each
rats weight over the duration of the experiment follows:
12
Measure
10
8
6
4
2
0
)
)
)
)
)
(4 T(5
(2 T(3
(1
HT
HT
H
HT GH
G
G
G
G
EI
EI
EI
EI
EI
W
W
W
W
W
Trial
ANOVA is the simplest way to analyze this one-way model. Because we have no
categorical variable(s), SYSTAT generates only the constant (grand mean) in the
I-465
Linear Models II: Analysis of Variance
WEIGHT(2)
5.833
WEIGHT(3)
7.167
WEIGHT(4)
8.000
WEIGHT(5)
8.333
SS
134.467
41.933
df
MS
4
20
33.617
2.097
F
16.033
G-G
H-F
0.000
0.004
0.002
Greenhouse-Geisser Epsilon:
0.3420
Huynh-Feldt Epsilon
:
0.4273
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
Time
Error
SS
114.817
14.883
df
1
5
MS
114.817
2.977
F
38.572
P
0.002
SS
18.107
12.821
df
1
5
MS
18.107
2.564
F
7.061
P
0.045
SS
1.350
9.950
df
MS
1
5
1.350
1.990
F
0.678
P
0.448
I-466
Chapter 15
SS
0.193
4.279
df
MS
1
5
0.193
0.856
F
0.225
P
0.655
0.011
0.989
86.014
Hypoth. df
4
4
4
Error df
2
2
2
F
43.007
43.007
43.007
P
0.023
0.023
0.023
The Huynh-Feldt p value (0.002) does not differ from the p value for the F statistic to
any significant degree. Compound symmetry appears to be satisfied and weight
changes significantly over the five trials.
The polynomial tests indicate that most of the trials effect can be accounted for by
a linear trend across time. In fact, the sum of squares for TIME is 134.467, and the sum
of squares for the linear trend is almost as large (114.817). Thus, the linear polynomial
accounts for roughly 85% of the change across the repeated measures.
Alternatively, you could request a hypothesis test, specifying the metric for the
polynomials:
HYPOTHESIS
WITHIN='Time'
CONTRAST / POLYNOMIAL
TEST
METRIC=1,2,3,4,10
I-467
Linear Models II: Analysis of Variance
The last point has been spread out further to the right. The output follows:
Univariate and Multivariate Repeated Measures Analysis
Within Subjects
--------------Source
Time
Error
SS
134.467
41.933
df
MS
4
20
33.617
2.097
F
16.033
G-G
H-F
0.000
0.004
0.002
Greenhouse-Geisser Epsilon:
0.3420
Huynh-Feldt Epsilon
:
0.4273
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
Time
Error
SS
67.213
14.027
df
1
5
MS
67.213
2.805
F
23.959
P
0.004
SS
62.283
2.887
df
1
5
MS
62.283
0.577
F
107.867
P
0.000
Hypoth. df
0.011
4
0.989
4
86.014
4
Error df
2
2
2
F
43.007
43.007
43.007
P
0.023
0.023
0.023
The significance tests for the linear and quadratic trends differ from those for the
evenly spaced polynomials. Before, the linear trend was strongest; now, the quadratic
polynomial has the most significant results (F = 107.9, p < 0.0005).
You may have noticed that although the univariate F tests for the polynomials are
different, the multivariate test is unchanged. The latter measures variation across all
components. The ANOVA table for the combined components is not affected by the
metric of the polynomials.
I-468
Chapter 15
Difference Contrasts
If you do not want to use polynomials, you can specify a C matrix that contrasts
adjacent weeks. After estimating the model, input the following:
HYPOTHESIS
WITHIN=Time
CONTRAST / DIFFERENCE
TEST
1
2
3
4
2
-1.000
1.000
0.0
0.0
3
0.0
-1.000
1.000
0.0
4
0.0
0.0
-1.000
1.000
5
0.0
0.0
0.0
-1.000
Univariate F Tests
Effect
SS
MS
66.667
19.333
1
5
66.667
3.867
17.241
0.009
10.667
1.333
1
5
10.667
0.267
40.000
0.001
4.167
36.833
1
5
4.167
7.367
0.566
0.486
0.667
1.333
1
5
0.667
0.267
2.500
0.175
Error
Error
Error
Error
df
0.011
43.007
df =
4,
Prob =
0.023
Pillai Trace =
F-Statistic =
0.989
43.007
df =
4,
Prob =
0.023
Hotelling-Lawley Trace =
F-Statistic =
86.014
43.007
df =
4,
Prob =
0.023
Notice the C matrix that this command generates. In this case, each of the univariate F
tests covers the significance of the difference between the adjacent weeks indexed by
the C matrix. For example, F = 17.241 shows that the first and second weeks differ
significantly. The third and fourth weeks do not differ (F = 0.566). Unlike polynomials,
these contrasts are not orthogonal.
I-469
Linear Models II: Analysis of Variance
Summing Effects
To sum across weeks:
HYPOTHESIS
WITHIN=Time
CONTRAST / SUM
TEST
2
1.000
3
1.000
4
1.000
5
1.000
Test of Hypothesis
Source
Hypothesis
Error
SS
df
6080.167
102.833
1
5
MS
6080.167
20.567
295.632
P
0.000
In this example, you are testing whether the overall weight (across weeks) significantly
differs from 0. Naturally, the F value is significant. Notice the C matrix that is
generated. It is simply a set of 1s that, in the equation BC' = 0, sum all the coefficients
in B. In a group-by-trials design, this C matrix is useful for pooling trials and analyzing
group effects.
Custom Contrasts
To test any arbitrary contrast effects between dependent variables, you can use C
matrix, which has the same form (without a column for the CONSTANT) as A matrix.
The following commands test a linear trend across the five trials:
HYPOTHESIS
CMATRIX [2 1 0 1 2]
TEST
2
-1.000
3
0.0
4
1.000
5
2.000
Test of Hypothesis
Source
Hypothesis
Error
SS
1148.167
148.833
df
1
5
MS
1148.167
29.767
F
38.572
P
0.002
I-470
Chapter 15
Example 9
Repeated Measures ANOVA for One Grouping Factor and
One Within Factor with Ordered Levels
The following example uses estimates of population for 1983, 1986, and 1990 and
projections for 2020 for 57 countries from the OURWORLD data file. The data are log
transformed before analysis. Here you compare trends in population growth for
European and Islamic countries. The variable GROUP$ contains codes for these
groups plus a third code for New World countries (we exclude these countries from this
analysis). To create a bar chart of the data after using YLOG to log transform them:
USE ourworld
SELECT group$ <> NewWorld
BAR pop_1983 .. pop_2020 / REPEAT OVERLAY YLOG,
GROUP=group$ SERROR
FILL=.35, .8
Measure
100.0
10.0
GROUP$
Islamic
Europe
1.0
90
83
86
20
19
19
19
20
P_
P_
P_
P_
O
O
O
O
P
P
P
P
Trial
I-471
Linear Models II: Analysis of Variance
SS
0.675
0.583
0.062
df
1
1
34
MS
0.675
0.583
0.002
F
370.761
320.488
P
0.000
0.000
SS
0.132
0.128
0.049
df
1
1
34
MS
0.132
0.128
0.001
F
92.246
89.095
P
0.000
0.000
SS
0.028
0.027
0.010
df
1
1
34
MS
0.028
0.027
0.000
F
96.008
94.828
P
0.000
0.000
Hypoth. df
0.063
3
0.937
3
14.781
3
Error df
32
32
32
F
157.665
157.665
157.665
P
0.000
0.000
0.000
Hypoth. df
0.076
3
0.924
3
12.219
3
Error df
32
32
32
F
130.336
130.336
130.336
P
0.000
0.000
0.000
The within-subjects results indicate highly significant linear, quadratic, and cubic
changes across time. The pattern of change across time for the two groups also differs
significantly (that is, the TIME * GROUP$ interactions are highly significant for all
three tests).
Notice that there is a larger gap in time between 1990 and 2020 than between the
other values. Lets incorporate real time in the analysis with the following
specification:
DEPEND pop_1983 pop_1986 pop_1990 pop_2020 / REPEAT=4(83,86,90,120),
NAME=TIME
ESTIMATE
I-472
Chapter 15
SS
0.831
0.737
0.089
df
1
1
34
MS
0.831
0.737
0.003
F
317.273
281.304
P
0.000
0.000
SS
0.003
0.001
0.025
df
1
1
34
MS
0.003
0.001
0.001
F
4.402
1.562
P
0.043
0.220
SS
0.000
0.000
0.006
df
1
1
34
MS
0.000
0.000
0.000
F
1.653
1.733
P
0.207
0.197
When the values for POP_2020 are positioned on a real time line, the tests for
quadratic and cubic polynomials are no longer significant. The test for the linear
TIME * GROUP$ interaction, however, remains highly significant, indicating that the
slope across time for the Islamic group is significantly steeper than that for the
European countries.
Example 10
Repeated Measures ANOVA for Two Grouping Factors and
One Within Factor
Repeated measures enables you to handle grouping factors automatically. The
following example is from Winer (1971). There are two grouping factors (ANXIETY
and TENSION) and one trials factor in the file REPEAT1. Following is a dot display of
the average responses across trials for each of the four combinations of ANXIETY and
TENSION.
I-473
Linear Models II: Analysis of Variance
1,1
1,2
1,2
20
20
15
15
Measure
Measure
1,1
10
10
0
1)
2)
3)
4)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR
Trial
Trial
2,1
2,2
20
20
15
15
Measure
Measure
4)
1)
2)
3)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR
10
10
0
4)
1)
2)
3)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR
1)
2)
3)
4)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR
Trial
Trial
The model also includes an interaction between the grouping factors (ANXIETY *
TENSION). The output follows:
Univariate and Multivariate Repeated Measures Analysis
Between Subjects
---------------Source
ANXIETY
TENSION
ANXIETY
*TENSION
Error
SS
df
MS
10.083
8.333
1
1
10.083
8.333
0.978
0.808
0.352
0.395
80.083
82.500
1
8
80.083
10.313
7.766
0.024
I-474
Chapter 15
Within Subjects
--------------Source
Trial
Trial
*ANXIETY
Trial
*TENSION
Trial
*ANXIETY
*TENSION
Error
SS
df
MS
G-G
H-F
991.500
330.500
152.051
0.000
0.000
0.000
8.417
2.806
1.291
0.300
0.300
0.301
12.167
4.056
1.866
0.162
0.197
0.169
12.750
52.167
3
24
4.250
2.174
1.955
0.148
0.185
0.155
Greenhouse-Geisser Epsilon:
0.5361
Huynh-Feldt Epsilon
:
0.9023
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
Trial
Trial
*ANXIETY
Trial
*TENSION
Trial
*ANXIETY
*TENSION
Error
SS
df
984.150
MS
984.150
247.845
0.000
1.667
1.667
0.420
0.535
10.417
10.417
2.623
0.144
9.600
31.767
1
8
9.600
3.971
2.418
0.159
SS
df
MS
6.750
6.750
3.411
0.102
3.000
3.000
1.516
0.253
0.083
0.083
0.042
0.843
0.333
15.833
1
8
0.333
1.979
0.168
0.692
df
MS
0.600
SS
0.600
1.051
0.335
3.750
3.750
6.569
0.033
1.667
1.667
2.920
0.126
2.817
4.567
1
8
2.817
0.571
4.934
0.057
-------------------------------------------------------------------------------
I-475
Linear Models II: Analysis of Variance
Hypoth. df
0.015
3
0.985
3
63.843
3
Error df
6
6
6
Hypoth. df
Error df
0.244
0.756
3.091
0.361
0.639
1.773
3
3
3
Hypoth. df
3
3
3
Hypoth. df
0.328
0.672
2.050
3
3
3
6
6
6
Error df
6
6
6
Error df
6
6
6
F
127.686
127.686
127.686
P
0.000
0.000
0.000
6.183
6.183
6.183
0.029
0.029
0.029
3.546
3.546
3.546
0.088
0.088
0.088
4.099
4.099
4.099
0.067
0.067
0.067
In the within-subjects table, you see that the trial effect is highly significant (F = 152.1, p <
0.0005). Below that table, we see that the linear trend across trials (Polynomial Order 1)
is highly significant (F = 247.8, p < 0.0005). The hypothesis sums of squares for the linear,
quadratic, and cubic polynomials sum to the total hypothesis sum of squares for trials (that
is, 984.15 + 6.75 + 0.60 = 991.5). Notice that the total sum of squares is 991.5, while that
for the linear trend is 984.15. This means that the linear trend accounts for more than 99%
of the variability across the four trials. The assumption of compound symmetry is not
required for the test of linear trendso you can report that there is a highly significant
linear decrease across the four trials (F = 247.8, p < 0.0005).
Example 11
Repeated Measures ANOVA for Two Trial Factors
Repeated measures enables you to handle several trials factors, so we include an
example with two trial factors. It is an experiment from Winer (1971), which has one
grouping factor (NOISE) and two trials factors (PERIODS and DIALS). The trials
factors must be sorted into a set of dependent variables (one for each pairing of the two
factors groups). It is useful to label the levels with a convenient mnemonic. The file is
set up with variables P1D1 through P3D3. Variable P1D2 indicates a score in the
PERIODS = 1, DIALS = 2 cell. The data are in the file REPEAT2.
I-476
Chapter 15
Notice that REPEAT specifies that the two trials factors have three levels each. ANOVA
assumes the subscript of the first factor will vary slowest in the ordering of the
dependent variables. If you have two repeated factors (DAY with four levels and AMPM
with two levels), you should select eight dependent variables and type Repeat=4,2. The
repeated measures are selected in the following order:
DAY1_AM
DAY1_PM
DAY2_AM
DAY2_PM
DAY3_AM
DAY3_PM
DAY4_AM
DAY4_PM
From this indexing, it generates the proper main effects and interactions. When more
than one trial factor is present, ANOVA lists each dependent variable and the
associated level on each factor. The output follows:
Dependent variable means
P1D1
48.000
P1D2
52.000
P1D3
63.000
P2D1
37.167
P2D3
54.167
P3D1
27.000
P3D2
32.500
P3D3
42.500
P2D2
42.167
SS
df
468.167
2491.111
MS
1
4
468.167
622.778
0.752
0.435
Within Subjects
--------------Source
period
period*NOISE
Error
SS
3722.333
333.000
234.889
Greenhouse-Geisser Epsilon:
Huynh-Feldt Epsilon
:
dial
2370.333
dial*NOISE
50.333
Error
105.556
df
MS
2
2
8
1861.167
166.500
29.361
2
2
8
0.6476
1.0000
1185.167
25.167
13.194
G-G
H-F
63.389
5.671
0.000
0.029
0.000
0.057
0.000
0.029
89.823
1.907
0.000
0.210
0.000
0.215
0.000
0.210
I-477
Linear Models II: Analysis of Variance
Greenhouse-Geisser Epsilon:
Huynh-Feldt Epsilon
:
period*dial
10.667
period*dial
*NOISE
11.333
Error
127.111
0.9171
1.0000
4
2.667
0.336
0.850
0.729
0.850
4
16
2.833
7.944
0.357
0.836
0.716
0.836
Greenhouse-Geisser Epsilon:
0.5134
Huynh-Feldt Epsilon
:
1.0000
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
SS
df
MS
period
period*NOISE
Error
3721.000
225.000
202.667
1
1
4
3721.000
225.000
50.667
73.441
4.441
0.001
0.103
dial
dial*NOISE
Error
2256.250
6.250
37.333
1
1
4
2256.250
6.250
9.333
241.741
0.670
0.000
0.459
0.375
0.375
0.045
0.842
1.042
33.333
1
4
1.042
8.333
0.125
0.742
period*dial
period*dial
*NOISE
Error
SS
df
MS
period
period*NOISE
Error
1.333
108.000
32.222
1
1
4
1.333
108.000
8.056
0.166
13.407
0.705
0.022
dial
dial*NOISE
Error
114.083
44.083
68.222
1
1
4
114.083
44.083
17.056
6.689
2.585
0.061
0.183
3.125
3.125
0.815
0.418
0.125
15.333
1
4
0.125
3.833
0.033
0.865
period*dial
period*dial
*NOISE
Error
df
MS
6.125
SS
6.125
0.750
0.435
3.125
32.667
1
4
3.125
8.167
0.383
0.570
df
MS
SS
period*dial
1.042
1
1.042
0.091
0.778
period*dial
*NOISE
7.042
1
7.042
0.615
0.477
Error
45.778
4
11.444
-------------------------------------------------------------------------------
I-478
Chapter 15
0.051
0.949
18.764
Hypoth. df
2
2
2
Error df
3
3
3
F
28.145
28.145
28.145
P
0.011
0.011
0.011
0.156
0.844
5.407
Hypoth. df
2
2
2
Error df
3
3
3
F
8.111
8.111
8.111
P
0.062
0.062
0.062
0.016
0.984
60.971
Hypoth. df
2
2
2
Error df
3
3
3
F
91.456
91.456
91.456
P
0.002
0.002
0.002
0.565
0.435
0.770
Hypoth. df
2
2
2
Error df
3
3
3
F
1.155
1.155
1.155
P
0.425
0.425
0.425
Error df
1
1
1
F
331.445
331.445
331.445
P
0.041
0.041
0.041
Error df
1
1
1
F
581.875
581.875
581.875
P
0.031
0.031
0.031
Example 12
Repeated Measures Analysis of Covariance
To do repeated measures analysis of covariance, where the covariate varies within
subjects, you would have to set up your model like a split plot with a different record
for each measurement.
This example is from Winer (1971). This design has two trials (DAY1 and DAY2),
one covariate (AGE), and one grouping factor (SEX). The data are in the file WINER.
I-479
Linear Models II: Analysis of Variance
NAME=day
DAY(2)
11.875
SS
df
SEX
AGE
Error
44.492
166.577
61.298
MS
1
1
5
44.492
166.577
12.260
1
1
1
5
22.366
0.494
0.127
1.250
3.629
13.587
0.115
0.014
Within Subjects
--------------Source
day
day*SEX
day*AGE
Error
SS
22.366
0.494
0.127
6.248
Greenhouse-Geisser Epsilon:
Huynh-Feldt Epsilon
:
df
MS
F
17.899
0.395
0.102
P
0.008
0.557
0.763
G-G
H-F
.
.
.
.
.
.
.
.
The F statistics for the covariate and its interactions, namely AGE (13.587) and
DAY * AGE (0.102), are not ordinarily published; however, they help you
understand the adjustment made by the covariate.
This analysis did not test the homogeneity of slopes assumption. If you want to test
the homogeneity of slopes assumption, run the following model in GLM first:
MODEL day(1 .. 2) = CONSTANT + sex + age + sex*age / REPEAT
I-480
Chapter 15
To use GLM:
GLM
USE winer
CATEGORY sex
MODEL day(1 .. 2) = CONSTANT + sex + age / REPEAT
ESTIMATE
NAME=day
Example 13
Multivariate Analysis of Variance
The data in the file MANOVA comprise a hypothetical experiment on rats assigned
randomly to one of three drugs. Weight loss in grams was observed for the first and
second weeks of the experiment. The data were analyzed in Morrison (1976) with a
two-way multivariate analysis of variance (a two-way MANOVA.)
You can use ANOVA to set up the MANOVA model for complete factorials:
USE manova
ANOVA
CATEGORY sex, drug
DEPEND week(1 .. 2)
ESTIMATE
Notice that the only difference between an ANOVA and MANOVA model is that the
latter has more than one dependent variable. The output includes:
Dependent variable means
WEEK(1)
9.750
Estimates of effects
B = (XX)
WEEK(2)
8.667
-1
XY
WEEK(1)
CONSTANT
WEEK(2)
9.750
8.667
SEX
0.167
0.167
DRUG
-2.750
-1.417
DRUG
-2.250
-0.167
SEX
DRUG
1
1
-0.667
-1.167
SEX
DRUG
1
2
-0.417
-0.417
I-481
Linear Models II: Analysis of Variance
Notice that each column of the B matrix is now assigned to a separate dependent
variable. It is as if we had done two runs of an ANOVA. The numbers in the matrix are
the analysis of variance effects estimates.
You can also use GLM to set up the MANOVA model. With this approach, the
design does not have to be a complete factorial. With commands:
GLM
USE manova
CATEGORY sex, drug
MODEL week(1 .. 2) = CONSTANT + sex + drug + sex*drug
ESTIMATE
Testing Hypotheses
With more than one dependent variable, you do not get a single ANOVA table; instead,
each hypothesis is tested separately. Here are three hypotheses. Extended output for the
second hypothesis is used to illustrate the detailed output.
HYPOTHESIS
EFFECT = sex
TEST
PRINT = LONG
HYPOTHESIS
EFFECT = drug
TEST
PRINT = SHORT
HYPOTHESIS
EFFECT = sex*drug
TEST
SEX
Univariate F Tests
Effect
SS
df
MS
WEEK(1)
Error
0.667
94.500
1
18
0.667
5.250
0.127
0.726
WEEK(2)
Error
0.667
114.000
1
18
0.667
6.333
0.105
0.749
0.993
0.064
df =
2,
17
Prob =
0.938
Pillai Trace =
F-Statistic =
0.007
0.064
df =
2,
17
Prob =
0.938
Hotelling-Lawley Trace =
0.008
F-Statistic =
0.064
df =
2, 17
Prob =
0.938
-------------------------------------------------------------------------------
I-482
Chapter 15
DRUG
1
2
WEEK(2)
-1.417
-0.167
-1
Inverse contrast A(XX)
A
1
0.083
-0.042
1
2
2
0.083
H = BA(A(XX)
WEEK(1)
301.000
97.500
WEEK(1)
WEEK(2)
-1
-1
A) AB
WEEK(2)
36.333
WEEK(1)
WEEK(2)
WEEK(2)
114.000
Univariate F Tests
Effect
SS
df
MS
WEEK(1)
Error
301.000
94.500
2
18
150.500
5.250
28.667
0.000
WEEK(2)
Error
36.333
114.000
2
18
18.167
6.333
2.868
0.083
0.169
12.199
df =
4,
34
Prob =
0.000
Pillai Trace =
F-Statistic =
0.880
7.077
df =
4,
36
Prob =
0.000
Hotelling-Lawley Trace =
F-Statistic =
4.640
18.558
df =
4,
32
Prob =
0.000
2, M =-0.5, N =
7.5 Prob =
0.000
THETA =
0.821 S =
36.491
df = 4
Roots 2 through 2
Chi-Square Statistic =
1.262
df = 1
I-483
Linear Models II: Analysis of Variance
Canonical Correlations
1
2
0.906
0.244
Dependent variable canonical coefficients standardized
by conditional (within groups) standard deviations
1
1.437
-0.821
WEEK(1)
WEEK(2)
2
-0.352
1.231
WEEK(1)
WEEK(2)
2
0.555
0.971
SS
df
MS
WEEK(1)
Error
14.333
94.500
2
18
7.167
5.250
1.365
0.281
WEEK(2)
Error
32.333
114.000
2
18
16.167
6.333
2.553
0.106
0.774
1.159
df =
4,
34
Prob =
0.346
Pillai Trace =
F-Statistic =
0.227
1.152
df =
4,
36
Prob =
0.348
Hotelling-Lawley Trace =
F-Statistic =
0.290
1.159
df =
4,
32
Prob =
0.347
2, M =-0.5, N =
7.5 Prob =
0.295
THETA =
0.221 S =
Matrix formulas (that are something long) make explicit the hypothesis being tested.
For MANOVA, hypotheses are tested with sums-of-squares and cross-products
matrices. Before printing the multivariate tests, however, SYSTAT prints the univariate
tests. Each of these F statistics is constructed in the same way as the ANOVA model.
The sums of squares for hypothesis and error are taken from the diagonals of the
respective sum of product matrices. The univariate F test for the WEEK(1) DRUG
effect, for example, is computed from 301.0 2 over 94.5 18 , or hypothesis mean
square divided by error mean square.
The next statistics printed are for the multivariate hypothesis. Wilks lambda
(likelihood-ratio criterion) varies between 0 and 1. Schatzoff (1966) has tables for its
percentage points. The following F statistic is Raos approximate (sometimes exact) F
statistic corresponding to the likelihood-ratio criterion (see Rao, 1973). Pillais trace
and its F approximation are taken from Pillai (1960). The Hotelling-Lawley trace and
I-484
Chapter 15
its F approximation are documented in Morrison (1976). The last statistic is the largest
root criterion for Roys union-intersection test (see Morrison, 1976). Charts of the
percentage points of this statistic, found in Morrison and other multivariate texts, are
taken from Heck (1960).
The probability value printed for THETA is not an approximation. It is what you find
in the charts. In the first hypothesis, all the multivariate statistics have the same value
for the F approximation because the approximation is exact when there are only two
2
groups (see Hotellings T in Morrison, 1976). In these cases, THETA is not printed
because it has the same probability value as the F statistic.
Because we requested extended output for the second hypothesis, we get additional
material.
Canonical Coefficients
Dimensions with insignificant chi-square statistics in the prior tests should be ignored
in general. Corresponding to each canonical correlation is a canonical variate, whose
coefficients have been standardized by the within-groups standard deviations (the
default). Standardization by the sample standard deviation is generally used for
canonical correlation analysis or multivariate regression when groups are not present
to introduce covariation among variates. You can standardize these variates by the total
(sample) standard deviations with:
STANDARDIZE = TOTAL
inserted prior to TEST. Continue with the other test specifications described earlier.
Finally, the canonical loadings are printed. These are correlations and, thus, provide
information different from the canonical coefficients. In particular, you can identify
I-485
Linear Models II: Analysis of Variance
Computation
Algorithms
Centered sums of squares and cross-products are accumulated using provisional
algorithms. Linear systems, including those involved in hypothesis testing, are solved
by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved
with Householder tridiagonalization and implicit QL iterations. For further
information, see Wilkinson and Reinsch (1971) or Chambers (1977).
References
Afifi, A. A. and Azen, S. P. (1972). Statistical analysis: A computer-oriented approach.
New York: Academic Press.
Affifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Bartlett, M. S. (1947). Multivariate analysis. Journal of the Royal Statistical Society, Series
B, 9, 176197.
Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New York:
McGraw-Hill.
Cochran and Cox. (1957). Experimental designs. 2nd ed. New York: John Wiley & Sons,
Inc.
Daniel, C. (1960). Locating outliers in factorial experiments. Technometrics, 2, 149156.
Feingold, M. and Korsog, P. E. (1986). The correlation and dependence between two f
statistics with the same denominator. The American Statistician, 40, 218220.
Heck, D. L. (1960). Charts of some upper percentage points of the distribution of the largest
characteristic root. Annals of Mathematical Statistics, 31, 625642.
Hocking, R. R. (1985). The analysis of linear models. Monterey, Calif.: Brooks/Cole.
John, P. W. M. (1971). Statistical design and analysis of experiments. New York:
MacMillan, Inc.
Kutner, M. H. (1974). Hypothesis testing in linear models (Eisenhart Model I). The
American Statistician, 28, 98100.
I-486
Chapter 15
Levene, H. (1960). Robust tests for equality of variance. I. Olkin, ed., Contributions to
Probability and Statistics. Palo Alto, Calif.: Stanford University Press, 278292.
Miller, R. (1985). Multiple comparisons. Kotz, S. and Johnson, N. L., eds., Encyclopedia
of Statistical Sciences, vol. 5. New York: John Wiley & Sons, Inc., 679689.
Milliken, G. A. and Johnson, D. E. (1984). Analysis of messy data, Vol. 1: Designed
Experiments. New York: Van Nostrand Reinhold Company.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Neter, J., Wasserman, W., and Kutner, M. (1985). Applied linear statistical models, 2nd ed.
Homewood, Ill.: Richard D. Irwin, Inc.
Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The
Statistical Center, University of Phillipines.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John
Wiley & Sons, Inc.
Schatzoff, M. (1966). Exact distributions of Wilks likelihood ratio criterion. Biometrika,
53, 347358.
Scheff, H. (1959). The analysis of variance. New York: John Wiley & Sons, Inc.
Searle, S. R. (1971). Linear models. New York: John Wiley & Sons, Inc.
Searle, S. R. (1987). Linear models for unbalanced data. New York: John Wiley & Sons,
Inc.
Speed, F. M. and Hocking, R. R. (1976). The use of the r( )- notation with unbalanced data.
The American Statistician, 30, 3033.
Speed, F. M., Hocking, R. R., and Hackney, O. P. (1978). Methods of analysis of linear
models with unbalanced data. Journal of the American Statistical Association, 73,
105112.
Wilkinson, L. (1975). Response variable hypotheses in the multivariate analysis of
variance. Psychological Bulletin, 82, 408412.
Wilkinson, L. (1977). Confirmatory rotation of MANOVA canonical variates. Multivariate
Behavioral Research, 12, 487494.
Winer, B. J. (1971). Statistical principles in experimental design, 2nd ed. New York:
McGraw-Hill.
Chapter
16
Linear Models III:
General Linear Models
Leland Wilkinson and Mark Coward
General Linear Model (GLM) can estimate and test any univariate or multivariate
general linear model, including those for multiple regression, analysis of variance or
covariance, and other procedures such as discriminant analysis and principal
components. With the general linear model, you can explore randomized block
designs, incomplete block designs, fractional factorial designs, Latin square designs,
split plot designs, crossover designs, nesting, and more. The model is:
Y = XB + e
I-487
I-488
Chapter 16
ABC = D
You can specify any multivariate linear model with General Linear Model. You must
select the variables to include in the desired model.
I-489
Linear Models III: Genera l Linea r Models
Dependent(s). The variable(s) you want to examine. The dependent variable(s) should
be continuous numeric variables (for example, income).
Independent(s). Select one or more continuous or categorical variables (grouping
variables). Independent variables that are not denoted as categorical are considered
covariates. Unlike ANOVA, GLM does not automatically include and test all
interactions. With GLM, you have to build your model. If you want interactions or
nested variables in your model, you need to build these components.
Model. The following model options allow you to include a constant in your model, do
a means model, specify the sample size, and weight cell means:
n Include constant. The constant is an optional parameter. Deselect Include constant
to obtain a model through the origin. When in doubt, include the constant.
n Means. Specifies a fully factorial design using means coding.
n Cases. When your data file is a symmetric matrix, specify the sample size that
In addition, you can save residuals and other data to a new data file. The following
alternatives are available:
n Residuals. Saves predicted values, residuals, Studentized residuals, and the
I-490
Chapter 16
Categorical Variables
You can specify numeric or character-valued categorical (grouping) variables that
define cells. You want to categorize an independent variable when it has several
categories such as education levels, which could be divided into the following
categories: less than high school, some high school, finished high school, some
college, finished bachelors degree, finished masters degree, and finished doctorate.
On the other hand, a variable such as age in years would not be categorical unless age
were broken up into categories such as under 21, 2165, and over 65.
To specify categorical variables, click the Categories button in the General Linear
Model dialog box.
Types of Categories. You can elect to use one of two different coding methods:
n Effect. Produces parameter estimates that are differences from group means.
n Dummy. Produces dummy codes for the design variables instead of effect codes.
I-491
Linear Models III: Genera l Linea r Models
Repeated Measures
In a repeated measures design, the same variable is measured several times for each
subject (case). A paired-comparison t test is the most simple form of a repeated
measures design (for example, each subject has a before and after measure).
SYSTAT derives values from your repeated measures and uses them in general
linear model computations to test changes across the repeated measures (within
subjects) as well as differences between groups of subjects (between subjects). Tests
of the within-subjects values are called Polynomial Test Of Order 1, 2,..., up to k,
where k is one less than the number of repeated measures. The first polynomial is used
to test linear changes: Do the repeated responses increase (or decrease) around a line
with a significant slope? The second polynomial tests if the responses fall along a
quadratic curve, etc.
To open the Repeated Measures dialog box, click Repeated in the General Linear
Model dialog box.
If you select Perform repeated measures analysis, SYSTAT treats the dependent
variables as a set of repeated measures. Optionally, you can assign a name for each set
of repeated measures, specify the number of levels, and specify the metric for unevenly
spaced repeated measures.
Name. Name that identifies each set of repeated measures.
Levels. Number of repeated measures in the set. For example, if you have three
dependent variables that represent measurements at different times, the number of
levels is 3.
Metric. Metric that indicates the spacing between unevenly spaced measurements. For
example, if measurements were taken at the third, fifth, and ninth weeks, the metric
would be 3, 5, 9.
I-492
Chapter 16
Stepwise Options. The following alternatives are available for stepwise entry and
removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT
I-493
Linear Models III: Genera l Linea r Models
n Forward. Begins with no variables in the model. At each step, SYSTAT adds the
from your model. For Forward, at each step, SYSTAT automatically adds a
variable to the model.
n Interactive. At each step in the model building, you select the variable to enter into
Pairwise Comparisons
Once you determine that your groups are different, you may want to compare pairs of
groups to determine which pairs differ.
To open the Pairwise Comparisons dialog box, from the menus choose:
Statistics
General Linear Model (GLM)
Pairwise Comparisons
I-494
Chapter 16
Groups. You must specify the variable that defines the groups.
Test. General Linear Model provides several post hoc tests to compare levels of this
variable.
n Bonferroni. Multiple comparison test based on Students t statistic. Adjusts the
observed significance level for the fact that multiple comparisons are made.
n Tukey. Uses the Studentized range statistic to make all pairwise comparisons
between groups and sets the experimentwise error rate to the error rate for the
collection for all pairwise comparisons. When testing a large number of pairs of
means, Tukey is more powerful than Bonferroni. For a small number of pairs,
Bonferroni is more powerful.
n Dunnett. The Dunnett test is available only with one-way designs. Dunnett
compares a set of treatments against a single control mean that you specify. You
can choose a two-sided or one-sided test. To test that the mean at any level (except
the control category) of the experimental groups is not equal to that of the control
category, select 2-sided. To test if the mean at any level of the experimental groups
is smaller (or larger) than that of the control category, select 1-sided.
n Fishers LSD. Least significant difference pairwise multiple comparison test.
Equivalent to multiple t tests between all pairs of groups. The disadvantage of this
test is that no attempt is made to adjust the observed significance level for multiple
comparisons.
n Scheff. The significance level of Scheffs test is designed to allow all possible
I-495
Linear Models III: Genera l Linea r Models
Error Term. You can either use the mean square error specified by the model or you can
enter the mean square error.
n Model MSE. Uses the mean square error from the general linear model that you ran.
n MSE and df. You can specify your own mean square error term and degrees of
freedom for mixed models with random factors, split-plot designs, and crossover
designs with carry-over effects.
Hypothesis Tests
Contrasts are used to test relationships among cell means. The post hoc tests in GLM
Pairwise Comparison are the most simple form because they compare two means at a
time. However, general contrasts can involve any number of means in the analysis.
To test hypotheses, from the menus choose:
Statistics
General Linear Model (GLM)
Hypothesis Test
Contrasts can be defined across the categories of a grouping factor or across the levels
of a repeated measure.
I-496
Chapter 16
Effects. Specify the factor (grouping variable) to which the contrast applies. For
principal components, specify the grouping variable for within-groups components (if
any). For canonical correlation, select All to test all of the effects in the model.
Within. Use when specifying a contrast across the levels of a repeated measures factor.
Enter the name assigned to the set of repeated measures in the Repeated Measures
subdialog box.
Error Term. You can specify which error term to use for the hypothesis tests.
n Model MSE. Uses the mean square error from the general linear model that you ran.
n MSE and df. You can specify your own mean square error and degrees of freedom
interaction error terms in all tests. Specify interactions using an asterisk between
variables.
Priors. Prior probabilities for discriminant analysis. Type a value for each group,
separated by spaces. These probabilities should add to 1. For example, if you have
three groups, priors might be 0.5, 0.3, and 0.2.
Standardize. You can standardize canonical coefficients using the total sample or a
within-groups covariance matrix.
n Within groups is usually used in discriminant analysis to make comparisons easier
I-497
Linear Models III: Genera l Linea r Models
Specify
To specify contrasts for between-subjects effects, click Specify in the Hypothesis Test
dialog box.
You can use GLMs cell means language to define contrasts across the levels of a
grouping variable in a multivariate model. For example, for a two-way factorial
ANOVA design with DISEASE (four categories) and DRUG (three categories), you
could contrast the marginal mean for the first level of drug against the third level by
specifying:
DRUG[1] = DRUG[3]
Note that square brackets enclose the value of the category (for example, for
GENDER$, specify GENDER$[male]). For the simple contrast of the first and third
levels of DRUG for the second disease only, specify:
DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]
In addition, you can specify the error term to use for the contrasts.
Pooled. Uses the error term from the current model.
Separate. Generates a separate variances error term.
I-498
Chapter 16
Contrast
Contrast generates a contrast for a grouping factor or a repeated measures factor. To
open the Contrast dialog box, click Contrast in the Hypothesis Test dialog box.
example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as
the metric.
Sum. In a repeated measures ANOVA, totals the values for each subject.
I-499
Linear Models III: Genera l Linea r Models
These matrices (A, C, and D) may be specified in several alternative ways; if they are
not specified, they have default values. To specify an A matrix, click A matrix in the
Hypothesis Test dialog box.
A is a matrix of linear weights contrasting the coefficient estimates (the rows of B).
The A matrix has as many columns as there are regression coefficients (including the
constant) in your model. The number of rows in A determine how many degrees of
freedom your hypothesis involves. The A matrix can have several different forms, but
these are all submatrices of an identity matrix and are easily formed using Hypothesis
Test.
To specify a C matrix, click C matrix in the Hypothesis Test dialog box.
I-500
Linear Models III: General Linear Models
I-501
Chapter 16
The C matrix is used to test hypotheses for repeated measures analysis of variance
designs and models with multiple dependent variables. C has as many columns as there
are dependent variables. For most multivariate models, C is an identity matrix.
To specify a D matrix, click D matrix in the Hypothesis Test dialog box.
D is a null hypothesis matrix (usually a null matrix). The D matrix, if you use it, must
have the same number of rows as A. For univariate multiple regression, D has only one
column. For multivariate models (multiple dependent variables), the D matrix has one
column for each dependent variable.
A matrix and D matrix are often used to test hypotheses in regression. Linear
hypotheses in regression have the form A = D, where A is the matrix of linear weights
on coefficients across the independent variables (the rows of ), is the matrix of
regression coefficients, and D is a null hypothesis matrix (usually a null matrix). The
A and D matrices can be specified in several alternative ways, and if they are not
specified, they have default values.
I-502
Linear Models III: General Linear Models
Using Commands
Select the data with USE filename and continue with:
GLM
MODEL varlist1 = CONSTANT + varlist2 + var1*var2 + ,
var3(var4) / REPEAT=m,n, REPEAT=m(x1,x2,),
n(y1,y2,) NAMES=name1,name2, , MEANS,
WEIGHT N=n
CATEGORY grpvarlist / MISS EFFECT or DUMMY
SAVE filename / COEF MODEL RESID DATA PARTIAL ADJUSTED
ESTIMATE / MIX TOL=n
Usage Considerations
Types of data. Normally, you analyze raw cases-by-variables data with General Linear
Model. You can, however, use a symmetric matrix data file (for example, a covariance
matrix saved in a file from Correlations) as input. If you use a matrix as input, you must
specify a value for Cases when estimating the model (under Group in the General
I-503
Chapter 16
Linear Model dialog box) to specify the sample size of the data file that generated the
matrix. The number you specify must be an integer greater than 2.
Be sure to include the dependent as well as independent variables in your matrix.
SYSTAT picks out the dependent variable you name in your model.
SYSTAT uses the sample size to calculate degrees of freedom in hypothesis tests.
SYSTAT also determines the type of matrix (SSCP, covariance, and so on) and adjusts
appropriately. With a correlation matrix, the raw and standardized coefficients are the
same; therefore, you cannot include a constant when using SSCP, covariance, or
correlation matrices. Because these matrices are centered, the constant term has
already been removed.
The triangular matrix input facility is useful for meta-analysis of published data
and missing value computations; however, you should heed the following warnings:
First, if you input correlation matrices from textbooks or articles, you may not get the
same regression coefficients as those printed in the source. Because of round-off error,
printed and raw data can lead to different results. Second, if you use pairwise deletion
with Correlations, the degrees of freedom for hypotheses will not be appropriate. You
may not even be able to estimate the regression coefficients because of singularities.
In general, correlation matrices containing missing data produce coefficient
estimates and hypothesis tests that are optimistic. You can correct for this by
specifying a sample size smaller than the number of actual observations (preferably set
it equal to the smallest number of cases used for any pair of variables), but this is a
guess that you can refine only by doing Monte Carlo simulations. There is no simple
solution. Beware, especially, of multivariate regressions (MANOVA and others) with
missing data on the dependent variables. You can usually compute coefficients, but
hypothesis testing produces results that are suspect.
Print options. General Linear Model produces extended output if you set the output length
to LONG or if you select Save scores and results in the Hypothesis Test dialog box.
For model estimation, extended output adds the following: total sum of product
matrix, residual (or pooled within groups) sum of product matrix, residual (or pooled
within groups) covariance matrix, and the residual (or pooled within groups)
correlation matrix.
For hypothesis testing, extended output adds A, C, and D matrices, the matrix of
contrasts, and the inverse of the cross products of contrasts, hypothesis and error sum
of product matrices, tests of residual roots, canonical correlations, coefficients, and
loadings.
I-504
Linear Models III: General Linear Models
Quick Graphs. If no variables are categorical, GLM produces Quick Graphs of residuals
versus predicted values. For categorical predictors, GLM produces graphs of the least
squares means for the levels of the categorical variable(s).
Saving files. Several sets of output can be saved to a file. The actual contents of the
saved file depend on the analysis. Files may include estimated regression coefficients,
model variables, residuals, predicted values, diagnostic statistics, canonical variable
scores, and posterior probabilities (among other statistics).
BY groups. Each level of any BY variables yields a separate analysis.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. GLM uses the FREQUENCY variable, if present, to duplicate cases.
Case weights. GLM uses the values of any WEIGHT variables to weight each case.
Examples
Example 1
One-Way ANOVA
The following data, KENTON, are from Neter, Wasserman, and Kutner (1985). The
data comprise unit sales of a cereal product under different types of package designs.
Ten stores were selected as experimental units. Each store was randomly assigned to
sell one of the package designs (each design was sold at two or three stores).
PACKAGE
SALES
1
1
2
2
2
3
3
3
4
4
12
18
14
12
13
19
17
21
24
30
I-505
Chapter 16
Numbers are used to code the four types of package designs; alternatively, you could
have used words. Neter, Wasserman, and Kutner report that cartoons are part of
designs 1 and 3 but not designs 2 and 4; designs 1 and 2 have three colors; and designs
3 and 4 have five colors. Thus, string codes for PACKAGE$ might have been Cart 3,
NoCart 3, Cart 5, and NoCart 5. Notice that the data does not need to be ordered
by PACKAGE as shown here. The input for a one-way analysis of variance is:
USE kenton
GLM
CATEGORY package
MODEL sales=CONSTANT + package
GRAPH NONE
ESTIMATE
N: 10
Multiple R: 0.921
Analysis of Variance
Source
PACKAGE
Error
Sum-of-Squares
df
Mean-Square
F-ratio
258.000
86.000
11.217
46.000
7.667
P
0.007
This is the standard analysis of variance table. The F ratio (11.217) appears significant,
so you could conclude that the package designs differ significantly in their effects on
sales, provided the assumptions are valid.
I-506
Linear Models III: General Linear Models
Comparisons for the pairwise methods are made across all pairs of least-squares
group means for the design term that is specified. For a multiway design, marginal cell
means are computed for the effects specified before the comparisons are made.
To determine significant differences, simply look for pairs with probabilities below
your critical value (for example, 0.05 or 0.01). All multiple comparison methods
handle unbalanced designs correctly.
After you estimate your ANOVA model, it is easy to do post hoc tests. To do a
Tukey HSD test, first estimate the model, then specify these commands:
HYPOTHESIS
POST package / TUKEY
TEST
1
0.0
-2.000
4.000
12.000
0.0
8.000
0.0
1.000
0.130
0.006
1.000
0.071
1.000
0.0
6.000
14.000
1
1.000
0.856
0.452
0.019
Results show that sales for the fourth package design (five colors and no cartoons) are
significantly larger than those for packages 1 and 2. None of the other pairs differ
significantly.
I-507
Chapter 16
Contrasts
This example uses two contrasts:
n We compare the first and third packages using coefficients of (1, 0, 1, 0).
n We compare the average performance of the first three packages with the last,
package
[1 0 1 0]
package
[1 1 1 3]
For each hypothesis, we specify one contrast, so the test has one degree of freedom;
therefore, the contrast matrix has one row of numbers. These numbers are the same
ones you see in ANOVA textbooks, although ANOVA offers one advantageyou do
not have to standardize them so that their sum of squares is 1. The output follows:
Test for effect called:
PACKAGE
A Matrix
1
0.0
2
1.000
3
0.0
4
-1.000
Test of Hypothesis
Source
Hypothesis
Error
SS
df
19.200
46.000
1
6
MS
19.200
7.667
2.504
0.165
PACKAGE
A Matrix
1
0.0
2
4.000
3
4.000
4
4.000
Test of Hypothesis
Source
Hypothesis
Error
SS
204.000
46.000
df
1
6
MS
204.000
7.667
F
26.609
P
0.002
For the first contrast, the F statistic (2.504) is not significant, so you cannot conclude
that the impact of the first and third package designs on sales is significantly different.
I-508
Linear Models III: General Linear Models
Incidentally, the A matrix contains the contrast. The first column (0) corresponds to the
constant in the model, and the remaining three columns (1 0 1) correspond to the
dummy variables for PACKAGE.
The last package design is significantly different from the other three taken as a
group. Notice that the A matrix looks much different this time. Because the effects sum
to 0, the last effect is minus the sum of the other three; that is, letting i denote the
effect for level i of package,
1 + 2 + 3 + 4 = 0
so
4 = (1 + 2 + 3)
which is
1 + 2 + 3 3(1 2 3)
which simplifies to
4*1 + 4*2 + 4*3
Orthogonal Polynomials
Constructing orthogonal polynomials for between-group factors is useful when the
levels of a factor are ordered. To construct orthogonal polynomials for your betweengroups factors:
HYPOTHESIS
EFFECT = package
CONTRAST / POLYNOMIAL ORDER=2
TEST
I-509
Chapter 16
PACKAGE
A Matrix
1
0.0
2
0.0
3
-1.000
4
-1.000
Test of Hypothesis
Source
Hypothesis
Error
SS
60.000
46.000
df
1
6
MS
60.000
7.667
F
7.826
P
0.031
Make sure that the levels of the factorafter they are sorted by the procedure
numerically or alphabeticallyare ordered meaningfully on a latent dimension. If you
need a specific order, use LABEL or ORDER; otherwise, the results will not make sense.
In the example, the significant quadratic effect is the result of the fourth package
having a much larger sales volume than the other three.
I-510
Linear Models III: General Linear Models
SALES
1
2
3
4
5
6
7
8
9
10
12
18
14
12
13
19
17
21
24
30
X(1)
X(2)
X(3)
1
1
0
0
0
0
0
0
1
1
0
0
1
1
1
0
0
0
1
1
0
0
0
0
0
1
1
1
1
1
The variables X(1), X(2), and X(3) are the effects coding dummy variables generated
by the procedure. All cases in the first cell are associated with dummy values 1 0 0;
those in the second cell with 0 1 0; the third, 0 0 1; and the fourth, 1 1 1. Other leastsquares programs use different methods to code dummy variables. The coding used by
SYSTAT is most widely used and guarantees that the effects sum to 0.
If you had used dummy coding, these dummy variables would be saved:
SALES
X(1)
X(2)
X(3)
12
18
14
12
13
19
19
17
21
24
30
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
This coding yields parameter estimates that are the differences between the mean for
each group and the mean of the last group.
I-511
Chapter 16
Example 2
Randomized Block Designs
A randomized block design is like a factorial design without an interaction term. The
following example is from Neter, Wasserman, and Kutner (1985). Five blocks of
judges were given the task of analyzing three treatments. Judges are stratified within
blocks, so the interaction of blocks and treatments cannot be analyzed. These data are
in the file BLOCK. The input is:
USE block
GLM
CATEGORY block, treat
MODEL judgment = CONSTANT + block + treat
ESTIMATE
You must use GLM instead of ANOVA because you do not want the BLOCK*TREAT
interaction in the model. The output is:
Dep Var: JUDGMENT
N: 15
Multiple R: 0.970
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
14.358
33.989
BLOCK
TREAT
171.333
202.800
4
2
42.833
101.400
Error
23.867
2.983
P
0.001
0.000
Example 3
Incomplete Block Designs
Randomized blocks can be used in factorial designs. Here is an example from John
(1971). The data (in the file JOHN) involve an experiment with three treatment factors
(A, B, and C) plus a blocking variable with eight levels. Notice that data were collected
on 32 of the possible 64 experimental situations.
I-512
Linear Models III: General Linear Models
BLOCK
BLOCK
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
1
2
2
1
1
2
2
1
2
1
2
2
1
2
1
1
2
2
1
2
1
1
2
1
1
2
2
1
1
2
2
101
373
398
291
312
106
265
450
106
306
324
449
272
89
407
338
5
5
5
5
6
6
6
6
7
7
7
7
8
8
8
8
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2
1
2
1
2
2
1
2
1
1
1
2
2
2
2
1
1
87
324
279
471
323
128
423
334
131
103
445
437
324
361
302
272
N: 32
Multiple R: 0.994
Analysis of Variance
Source
df
Mean-Square
F-ratio
BLOCK
A
B
C
A*B
A*C
B*C
A*B*C
Sum-of-Squares
2638.469
3465.281
161170.031
278817.781
28.167
1802.667
11528.167
45.375
7
1
1
1
1
1
1
1
376.924
3465.281
161170.031
278817.781
28.167
1802.667
11528.167
45.375
1.182
10.862
505.209
873.992
0.088
5.651
36.137
0.142
0.364
0.004
0.000
0.000
0.770
0.029
0.000
0.711
Error
5423.281
17
319.017
I-513
Chapter 16
Example 4
Fractional Factorial Designs
Sometimes a factorial design involves so many combinations of treatments that certain
cells must be left empty to save experimental resources. At other times, a complete
randomized factorial study is designed, but loss of subjects leaves one or more cells
completely missing. These models are similar to incomplete block designs because not
all effects in the full model can be estimated. Usually, certain interactions must be left
out of the model.
The following example uses some experimental data that contain values in only 8
out of 16 possible cells. Each cell contains two cases. The pattern of nonmissing cells
makes it possible to estimate only the main effects plus three two-way interactions. The
data are in the file FRACTION.
A
1
1
2
2
2
2
1
1
2
2
1
1
1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
7
3
1
2
12
13
14
15
8
6
12
10
6
4
6
7
I-514
Linear Models III: General Linear Models
We must use GLM instead of ANOVA to omit the higher-way interactions that ANOVA
automatically generates. The output is:
Dep Var: Y
N: 16
Multiple R: 0.972
Analysis of Variance
Source
A
B
C
D
A*B
A*C
B*C
Error
Sum-of-Squares
df
Mean-Square
F-ratio
16.000
4.000
49.000
4.000
182.250
12.250
2.250
1
1
1
1
1
1
1
16.000
4.000
49.000
4.000
182.250
12.250
2.250
8.000
2.000
24.500
2.000
91.125
6.125
1.125
16.000
2.000
P
0.022
0.195
0.001
0.195
0.000
0.038
0.320
When missing cells turn up by chance rather than by design, you may not know which
interactions to eliminate. When you attempt to fit the full model, SYSTAT informs you
that the design is singular. In that case, you may need to try several models before
finding an estimable one. It is usually best to begin by leaving out the highest-order
interaction (A*B*C*D in this example). Continue with subset models until you get an
ANOVA table.
Looking for an estimable model is not the same as analyzing the data with stepwise
regression because you are not looking at p values. After you find an estimable model,
stop and settle with the statistics printed in the ANOVA table.
Example 5
Nested Designs
Nested designs resemble factorial designs with certain cells missing (incomplete
factorials). This is because one factor is nested under another, so that not all
combinations of the two factors are observed. For example, in an educational study,
classrooms are usually nested under schools because it is impossible to have the same
classroom existing at two different schools (except as antimatter). The following
I-515
Chapter 16
example (in which teachers are nested within schools) is from Neter, Wasserman, and
Kutner (1985). The data (learning scores) look like this:
TEACHER1
TEACHER2
SCHOOL1
25
29
14
11
SCHOOL2
11
6
22
18
SCHOOL3
17
20
5
2
In the study, there are actually six teachers, not just two; thus, the design really looks
like this:
TEACHER1 TEACHER2 TEACHER3 TEACHER4 TEACHER5 TEACHER6
SCHOOL1
25
29
14
11
SCHOOL2
11
6
SCHOOL3
17
20
1
1
2
2
3
3
4
4
5
5
6
6
22
18
SCHOOL
LEARNING
1
1
1
1
2
2
2
2
3
3
3
3
25
29
14
11
11
6
22
18
17
20
5
2
5
2
I-516
Linear Models III: General Linear Models
N: 12
Multiple R: 0.972
Analysis of Variance
Source
Sum-of-Squares
SCHOOL
TEACHER(SCHOOL)
Error
df
Mean-Square
F-ratio
156.500
567.500
2
3
78.250
189.167
11.179
27.024
42.000
7.000
P
0.009
0.001
Your data can use any codes for TEACHER, including a separate code for every teacher
in the study, as long as each different teacher within a given school has a different code.
GLM will use the nesting specified in the MODEL statement to determine the pattern of
nesting. You can, for example, allow teachers in different schools to share codes.
This example is a balanced nested design. Unbalanced designs (unequal number of
cases per cell) are handled automatically in SYSTAT because the estimation method
is least squares.
Example 6
Split Plot Designs
The split plot design is closely related to the nested design. In the split plot, however,
plots are often considered a random factor; therefore, you have to construct different
error terms to test different effects. The following example involves two treatments: A
(between plots) and B (within plots). The numbers in the cells are the YIELD of the
crop within plots.
A1
A2
PLOT1
PLOT2
PLOT3
PLOT4
B1
B2
B3
B4
I-517
Chapter 16
Here are the data from the PLOTS data file in the form needed by SYSTAT:
PLOT
YIELD
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
0
0
5
3
3
1
5
4
4
2
7
8
5
4
6
6
To analyze this design, you need two different error terms. For the between-plots
effects (A), you need plots within A. For the within-plots effects (B and A*B), you
need B by plots within A.
First, fit the saturated model with all the effects and then specify different error
terms as needed. The input is:
USE plots
GLM
CATEGORY plot, a, b
MODEL yield = CONSTANT + a + b + a*b + plot(a) + b*plot(a)
ESTIMATE
N: 16
Multiple R: 1.000
Analysis of Variance
Source
A
B
A*B
PLOT(A)
B*PLOT(A)
Error
Sum-of-Squares
27.563
42.688
2.188
3.125
7.375
0.0
df
Mean-Square
1
3
3
2
6
27.563
14.229
0.729
1.562
1.229
F-ratio
.
.
.
.
.
P
.
.
.
.
.
I-518
Linear Models III: General Linear Models
You do not get a full ANOVA table because the model is perfectly fit. The coefficient
of determination (Squared multiple R) is 1. Now you have to use some of the effects
as error terms.
Between-Plots Effects
Lets test for between-plots effects, namely A. The input is:
HYPOTHESIS
EFFECT = a
ERROR = plot(a)
TEST
Test of Hypothesis
Source
Hypothesis
Error
SS
df
27.563
3.125
1
2
MS
27.563
1.562
F
17.640
P
0.052
Within-Plots Effects
To do the within-plots effects (B and A*B), the input is:
HYPOTHESIS
EFFECT = b
ERROR = b*plot(a)
TEST
HYPOTHESIS
EFFECT = a*b
ERROR = b*plot(a)
TEST
Test of Hypothesis
Source
Hypothesis
Error
SS
42.687
7.375
df
3
6
MS
14.229
1.229
F
11.576
P
0.007
-------------------------------------------------------------------------------
I-519
Chapter 16
A*B
Test of Hypothesis
Source
Hypothesis
Error
SS
df
MS
2.188
7.375
3
6
0.729
1.229
F
0.593
P
0.642
Here, we find a significant effect due to factor B (p = 0.007), but the interaction is not
significant (p = 0.642).
This analysis is the same as that for a repeated measures design with subjects as
PLOT, groups as A, and trials as B. Because this method becomes unwieldy for a large
number of plots (subjects), SYSTAT offers a more compact method for repeated
measures analysis as an alternative.
Example 7
Latin Square Designs
A Latin square design imposes a pattern on treatments in a factorial design to save
experimental effort or reduce within cell error. As in the nested design, not all
combinations of the square and other treatments are measured, so the model lacks
certain interaction terms between squares and treatments. GLM can analyze these
designs easily if an extra variable denoting the square is included in the file. The
following fixed effects example is from Neter, Wasserman, and Kutner (1985). The
SQUARE variable is represented in the cells of the design. For simplicity, the
dependent variable, RESPONSE, has been left out.
day1 day2 day3
week1
day4
day5
week2
week3
week4
week5
I-520
Linear Models III: General Linear Models
You would set up the data as shown below (the LATIN file).
DAY WEEK
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
SQUARE
D
C
A
E
B
C
B
D
A
E
A
E
B
C
D
B
A
E
D
C
E
D
C
B
A
RESPONSE
18
17
14
21
17
13
34
21
16
15
7
29
32
27
13
17
13
24
31
25
21
26
26
31
7
I-521
Chapter 16
N: 25
Multiple R: 0.931
Analysis of Variance
Source
Sum-of-Squares
df
Mean-Square
F-ratio
1.306
7.599
10.580
DAY
WEEK
SQUARE
82.000
477.200
664.400
4
4
4
20.500
119.300
166.100
Error
188.400
12
15.700
P
0.323
0.003
0.001
Example 8
Crossover and Changeover Designs
In crossover designs, an experiment is divided into periods, and the treatment of a
subject changes from one period to the next. Changeover studies often use designs
similar to a Latin square. A problem with these designs is that there may be a residual
or carry-over effect of a treatment into the following period. This can be minimized by
extending the interval between experimental periods; however, this is not always
feasible. Fortunately, there are methods to assess the magnitude of any carry-over
effects that may be present.
Two-period crossover designs can be analyzed as repeated-measures designs. More
complicated crossover designs can also be analyzed by SYSTAT, and carry-over
effects can be assessed. Cochran and Cox (1957) present a study of milk production by
cows under three different feed schedules: A (roughage), B (limited grain), and C (full
grain). The design of the study has the form of two ( 3 3 ) Latin squares:
COW
Latin square 1
Latin square 2
Period
II
III
IV
VI
I-522
Linear Models III: General Linear Models
1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
6
6
6
SQUARE PERIOD
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
2
3
1
3
1
2
1
3
2
2
1
3
3
2
1
1
1
2
1
2
2
1
2
1
1
1
2
1
2
1
1
2
2
0
1
2
0
2
3
0
3
1
0
1
3
0
2
1
0
3
2
38
25
15
109
86
39
124
72
27
86
76
46
75
35
34
101
63
1
PERIOD is nested within each Latin square (the periods for cows in one square are
unrelated to the periods in the other). The variable RESIDUAL indicates the treatment
of the preceding period. For the first period for each cow, there is no preceding period.
The input is:
USE williams
GLM
CATEGORY cow, period, square, residual, carry, feed
MODEL milk = CONSTANT + cow + feed +,
period(square) + residual(carry)
ESTIMATE
I-523
Chapter 16
N: 18
Multiple R: 0.995
Analysis of Variance
Source
Sum-of-Squares
COW
FEED
PERIOD(SQUARE)
RESIDUAL(CARRY)
df
Mean-Square
F-ratio
3835.950
2854.550
3873.950
616.194
5
2
4
2
767.190
1427.275
968.488
308.097
15.402
28.653
19.443
6.185
199.250
49.813
Error
P
0.010
0.004
0.007
0.060
N: 18
Multiple R: 0.533
Analysis of Variance
Source
Sum-of-Squares
COW
Error
df
Mean-Square
F-ratio
5781.111
1156.222
0.952
14581.333
12
1215.111
P
0.484
COW
Test of Hypothesis
Source
Hypothesis
Error
SS
5781.111
199.252
df
5
4
MS
1156.222
49.813
F
23.211
P
0.005
I-524
Linear Models III: General Linear Models
The remaining term, PERIOD, requires a different model. PERIOD is nested within
SQUARE.
USE williams
GLM
CATEGORY period square
MODEL milk = CONSTANT + period(square)
ESTIMATE
HYPOTHESIS
EFFECT = period(square)
ERROR = 49.813(4)
TEST
N: 18
Multiple R: 0.751
Analysis of Variance
Source
Sum-of-Squares
PERIOD(SQUARE)
Error
df
Mean-Square
F-ratio
11489.111
2872.278
4.208
8873.333
13
682.564
P
0.021
-------------------------------------------------------------------------------
>
HYPOTHESIS
>
EFFECT = period(square)
>
ERROR = 49.813(4)
>
TEST
Test for effect called:
PERIOD(SQUARE)
Test of Hypothesis
Source
Hypothesis
Error
SS
11489.111
199.252
df
4
4
MS
2872.278
49.813
F
57.661
P
0.001
Example 9
Missing Cells Designs (the Means Model)
When cells are completely missing in a factorial design, parameterizing a model can
be difficult. The full model cannot be estimated. GLM offers a means model
parameterization so that missing cell parameters can be dropped automatically from
the model, and hypotheses for main effects and interactions can be tested by specifying
cells directly. Examine Searle (1987), Hocking (1985), or Milliken and Johnson (1984)
for more information in this area.
I-525
Chapter 16
Widely favored for this purpose by statisticians (Searle, 1987; Hocking, 1985;
Milliken and Johnson, 1984), the means model allows:
n Tests of hypotheses in missing cells designs (using Type IV sums of squares)
n Tests of simple hypotheses (for example, within levels of other factors)
n The use of population weights to reflect differences in subclass sizes
Effects coding is the default for GLM. Alternatively, means models code predictors as
cell means rather than effects, which differ from a grand mean. The constant is omitted,
and the predictors are 1 for a case belonging to a given cell and 0 for all others. When
cells are missing, GLM automatically excludes null columns and estimates the
submodel.
The categorical variables are specified in the MODEL statement differently for a
means model than for an effects model. Here are some examples:
MODEL y = a*b / MEANS
MODEL y = group*age*school$ / MEANS
The first two models generate fully factorial designs (A by B and group by AGE by
SCHOOL$). Notice that they omit the constant and main effects parameters because the
means model does not include effects or a grand mean. Nevertheless, the number of
parameters is the same in the two models. The following are the effects model and the
means model, respectively, for a 2 3 design (two levels of A and three levels of B):
MODEL y = CONSTANT + A + B + A*B
a1
b1
b2
a1b1
a1b2
1
1
1
2
2
2
1
2
3
1
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
0
1
1
0
1
1
1
0
1
1
0
1
0
1
1
0
1
1
I-526
Linear Models III: General Linear Models
a1b1
a1b2
a1b3
a2b1
a2b2
a2b3
1
1
1
2
2
2
1
2
3
1
2
3
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Means and effects models can be blended for incomplete factorials and others designs.
All crossed terms (for example, A*B) will be coded with means design variables
(provided the MEANS option is present), and the remaining terms will be coded as
effects. The constant must be omitted, even in these cases, because it is collinear with
the means design variables. All covariates and effects that are coded factors must
precede the crossed factors in the MODEL statement.
Here is an example, assuming A has four levels, B has two, and C has three. In this
design, there are 24 possible cells, but only 12 are nonmissing. The treatment
combinations are partially balanced across the levels of B and C.
MODEL y = A + B*C / MEANS
a1
a2
a3
b1c1
b1c2
b1c3
b2c1
b2c2
b2c3
1
3
2
4
1
4
2
3
2
4
1
3
1
1
1
1
1
1
2
2
2
2
2
2
1
1
2
2
3
3
1
1
2
2
3
3
1
0
0
1
1
1
0
0
0
1
1
0
0
0
1
1
0
1
1
0
1
1
0
0
0
1
0
1
0
1
0
1
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
I-527
Chapter 16
Group 1
4
10
13
15
12
11
14
Empty cells denote age/race combinations for which no data were collected. Numbers
within cells refer to cell designations in the Fisher LSD pairwise mean comparisons at
the end of this example.
First, fit the model. The input is:
USE mj202
GLM
CATEGORY group age race$
MODEL diff = group*age*race$ / MEANS
ESTIMATE
N: 107
Multiple R: 0.538
***WARNING***
Missing cells encountered. Tests of factors will not appear.
Ho: All means equal.
Unweighted Means Model
Analysis of Variance
Source
Model
Error
Sum-of-Squares
1068.546
2627.472
df
14
92
Mean-Square
76.325
28.559
F-ratio
2.672
P
0.003
I-528
Linear Models III: General Linear Models
We need to test the GROUP main effect. The following notation is equivalent to
Milliken and Johnsons. Because of the missing cells, the GROUP effect must be
computed over means that are balanced across the other factors.
In the drawing at the beginning of this example, notice that this specification
contrasts all the numbered cells in group 0 (except 2) with all the numbered cells in
group 1 (except 8 and 15). The input is:
HYPOTHESIS
NOTE GROUP MAIN EFFECT
SPECIFY ,
group[0] age[1] race$[W]
group[0] age[3] race$[B]
group[0] age[3] race$[W]
group[1] age[1] race$[W]
group[1] age[3] race$[B]
group[1] age[3] race$[W]
TEST
+
+
+
+
+
+
group[0]
group[0]
group[0]
group[1]
group[1]
group[1]
age[2]
age[3]
age[4]
age[2]
age[3]
age[4]
race$[W]
race$[H]
race$[B]
race$[W]
race$[H]
race$[B]
2
0.0
6
-1.000
7
-1.000
11
1.000
Null hypothesis value for D
0.0
Test of Hypothesis
12
1.000
Source
Hypothesis
Error
SS
75.738
2627.472
df
1
92
MS
75.738
28.559
3
-1.000
8
0.0
13
1.000
4
-1.000
5
-1.000
9
1.000
10
1.000
14
1.000
F
2.652
15
0.0
P
0.107
+,
+,
=,
+,
+,
I-529
Chapter 16
The computations for the AGE main effect are similar to those for the GROUP main
effect:
HYPOTHESIS
NOTE AGE MAIN EFFECT
SPECIFY ,
GROUP[1] AGE[1] RACE$[B] + GROUP[1] AGE[1] RACE$[W] =,
GROUP[1] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,
GROUP[0] AGE[2] RACE$[B] + GROUP[1] AGE[2] RACE$[W] =,
GROUP[0] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,
GROUP[0]
GROUP[1]
GROUP[0]
GROUP[1]
TEST
AGE[3]
AGE[3]
AGE[4]
AGE[4]
1
0.0
0.0
0.0
2
0.0
-1.000
0.0
1
2
3
6
0.0
0.0
0.0
7
0.0
1.000
1.000
11
0.0
0.0
-1.000
1
2
3
12
0.0
0.0
0.0
3
0.0
0.0
0.0
4
0.0
0.0
-1.000
5
0.0
0.0
0.0
8
-1.000
0.0
0.0
9
-1.000
0.0
0.0
10
0.0
-1.000
0.0
13
0.0
0.0
-1.000
14
1.000
0.0
1.000
15
1.000
1.000
1.000
D Matrix
1
2
3
0.0
0.0
0.0
Test of Hypothesis
Source
Hypothesis
Error
SS
41.526
2627.472
df
3
92
MS
13.842
28.559
F
0.485
P
0.694
The GROUP by AGE interaction requires more complex balancing than the main
effects. It is derived from a subset of the means in the following specified combination.
Again, check Milliken and Johnson to see the correspondence.
I-530
Linear Models III: General Linear Models
BY AGE INTERACTION
age[1]
age[1]
age[3]
age[3]
race$[W]
race$[W]
race$[B]
race$[B]
group[0]
group[1]
group[0]
group[1]
age[3]
age[3]
age[4]
age[4]
race$[W] ,
race$[W] +,
race$[B] ,
race$[B]=0.0;,
age[2]
age[2]
age[3]
age[3]
race$[W]
race$[W]
race$[B]
RACE$[B]
group[0]
group[1]
group[0]
group[1]
age[3]
age[3]
age[4]
age[4]
race$[W] ,
race$[W] +,
race$[B] ,
race$[B]=0.0;,
1
-1.000
0.0
0.0
1
2
3
6
1.000
1.000
0.0
1
2
3
11
1.000
1.000
1.000
1
2
3
0.0
0.0
0.0
2
0.0
0.0
0.0
7
1.000
1.000
1.000
12
0.0
0.0
0.0
3
0.0
-1.000
0.0
4
-1.000
-1.000
-1.000
8
0.0
0.0
0.0
5
0.0
0.0
0.0
9
1.000
0.0
0.0
13
-1.000
-1.000
0.0
10
0.0
1.000
0.0
14
-1.000
-1.000
-1.000
15
0.0
0.0
0.0
D Matrix
Test of Hypothesis
Source
Hypothesis
Error
SS
91.576
2627.472
df
3
92
MS
30.525
28.559
F
1.069
P
0.366
I-531
Chapter 16
The following commands are needed to produce the rest of Milliken and Johnsons
results. The remaining output is not listed.
HYPOTHESIS
NOTE RACE$
SPECIFY ,
group[0]
group[1]
group[1]
group[0]
group[1]
group[1]
MAIN EFFECT
age[2]
age[1]
age[4]
age[2]
age[1]
age[4]
group[0] age[3]
group[0] age[3]
TEST
HYPOTHESIS
NOTE GROUP*RACE$
SPECIFY ,
group[0] age[3]
group[1] age[3]
group[0] age[3]
group[1] age[3]
TEST
HYPOTHESIS
NOTE 'AGE*RACE$'
SPECIFY ,
group[1] age[1]
group[1] age[4]
race$[B] + group[0]
race$[B] + group[1]
race$[B] =,
race$[W] + group[0]
race$[W] + group[1]
race$[W];,
age[3] race$[B] +,
age[3] race$[B] +,
age[3] race$[W] +,
age[3] race$[W] +,
I-532
Linear Models III: General Linear Models
The following is the matrix of comparisons printed by GLM. The matrix of mean
differences has been omitted.
COL/
ROW GROUP
AGE
1 0
1
2 0
2
3 0
2
4 0
3
5 0
3
6 0
3
7 0
4
8 1
1
9 1
1
10 1
2
11 1
3
12 1
3
13 1
3
14 1
4
15 1
4
Using unweighted means.
Post Hoc test of DIFF
RACE$
W
B
W
B
H
W
B
B
W
W
B
H
W
B
W
1
1.000
0.662
0.638
0.725
0.324
0.521
0.706
0.197
0.563
0.049
0.018
0.706
0.018
0.914
0.090
6
1.000
0.971
0.292
0.860
0.026
0.010
0.971
0.000
0.610
0.059
11
1.000
0.213
0.466
0.082
0.219
1.000
0.974
0.323
0.455
0.827
0.901
0.274
0.778
0.046
0.016
0.901
0.007
0.690
0.096
7
1.000
0.295
0.461
0.850
0.912
0.277
0.791
0.042
0.015
0.912
0.005
0.676
0.090
8
1.000
0.161
0.167
0.527
0.082
0.342
0.004
0.002
0.527
0.000
0.908
0.008
9
1.000
0.497
0.703
0.780
0.709
0.575
0.283
0.703
0.456
0.403
0.783
10
1.000
0.543
0.939
0.392
0.213
1.000
0.321
0.692
0.516
12
1.000
0.514
0.836
0.451
0.543
0.717
0.288
0.930
13
1.000
0.303
0.134
0.939
0.210
0.594
0.447
14
1.000
0.425
0.392
0.798
0.168
0.619
15
1.000
0.321
0.692
0.516
1.000
0.124
0.344
1.000
0.238
1.000
Within group 0 (cells 17), there are no significant pairwise differences in average test
score changes. The same is true within group 1 (cells 815).
I-533
Chapter 16
Example 10
Covariance Alternatives to Repeated Measures
Analysis of covariance offers an alternative to repeated measures in a pre-post design.
You can use the pre-test as a covariate in predicting the post-test. This example shows
how to do a two-group, pre-post design:
GLM
USE filename
CATEGORY group
MODEL post = CONSTANT + group + pre
ESTIMATE
When using this design, be sure to check the homogeneity of slopes assumption. Use
the following commands to check that the interaction term, GROUP*PRE, is not
significant:
GLM
USE filename
CATEGORY group
MODEL post = CONSTANT + group + pre + group*pre
ESTIMATE
Example 11
Weighting Means
Sometimes you want to weight the cell means when you test hypotheses in ANOVA.
Suppose you have an experiment in which a few rats died before its completion. You
do not want the hypotheses tested to depend upon the differences in cell sizes (which
are presumably random). Here is an example from Morrison (1976). The data
(MOTHERS) are hypothetical profiles on three scales of mothers in each of four
socioeconomic classes.
Morrison analyzes these data with the multivariate profile model for repeated
measures. Because the hypothesis of parallel profiles across classes is not rejected, you
can test whether the profiles are level. That is, do the scales differ when we pool the
classes together?
Pooling unequal classes can be done by weighting each according to sample size or
averaging the means of the subclasses. First, lets look at the model and test the
hypothesis of equality of scale parameters without weighting the cell means.
I-534
Linear Models III: General Linear Models
SCALE(2)
15.619
SCALE(3)
15.857
-1
Estimates of effects
B = (XX)
XY
SCALE(1)
CONSTANT
SCALE(2)
SCALE(3)
13.700
14.550
14.988
CLASS
4.300
5.450
4.763
CLASS
0.100
0.650
-0.787
CLASS
-0.700
-0.550
0.012
CONSTANT
C Matrix
1
1.000
0.0
1
2
2
-1.000
1.000
3
0.0
-1.000
Univariate F Tests
Effect
SS
MS
14.012
51.200
1
17
14.012
3.012
4.652
0.046
3.712
61.500
1
17
3.712
3.618
1.026
0.325
Error
Error
df
0.564
6.191
df =
2,
16
Prob =
0.010
Pillai Trace =
F-Statistic =
0.436
6.191
df =
2,
16
Prob =
0.010
Hotelling-Lawley Trace =
F-Statistic =
0.774
6.191
df =
2,
16
Prob =
0.010
I-535
Chapter 16
Notice that the dependent variable means differ from the CONSTANT. The
CONSTANT in this case is a mean of the cell means rather than the mean of all the
cases.
Because the sum of effects is 0 for a classification and because you do not have an
independent estimate of CLASS4, this expression is equivalent to
8*( + 1) + 5*( + 2) + 4*( + 3) + 4*( - 1 - 2 - 3)
2
4.000
3
1.000
1
1.000
0.0
2
-1.000
1.000
3
0.0
-1.000
C Matrix
1
2
Univariate F Tests
4
0.0
I-536
Linear Models III: General Linear Models
Effect
1
Error
2
Error
SS
25.190
51.200
df
1
17
1.190
61.500
1
17
MS
25.190
3.012
F
8.364
P
0.010
1.190
3.618
0.329
0.574
0.501
7.959
df =
2,
16
Prob =
0.004
Pillai Trace =
F-Statistic =
0.499
7.959
df =
2,
16
Prob =
0.004
Hotelling-Lawley Trace =
F-Statistic =
0.995
7.959
df =
2,
16
Prob =
0.004
This is the multivariate F statistic that Morrison gets. For these data, we prefer the
weighted means analysis because these differences in cell frequencies probably reflect
population base rates. They are not random.
Example 12
Hotellings T-Square
You can use General Linear Model to calculate Hotellings T-square statistic.
One-Sample Test
For example, to get a one-sample test for the variables X and Y, select both X and Y as
dependent variables.
GLM
USE filename
MODEL x, y = CONSTANT
ESTIMATE
The F test for CONSTANT is the statistic you want. It is the same as the Hotellings T2
for the hypothesis that the population means for X and Y are 0.
You can also test against the hypothesis that the means of X and Y have particular
nonzero values (for example, 10 and 15) by using:
HYPOTHESIS
DMATRIX [10 15]
TEST
I-537
Chapter 16
Two-Sample Test
For a two-sample test, you must provide a categorical independent variable that
represents the two groups. The input is:
GLM
CATEGORY group
MODEL x,y = CONSTANT + group
ESTIMATE
Example 13
Discriminant Analysis
This example uses the IRIS data file. Fisher used these data to illustrate his
discriminant function. To define the model:
USE iris
GLM
CATEGORY species
MODEL sepallen sepalwid
species
ESTIMATE
HYPOTHESIS
EFFECT = species
SAVE canon
TEST
petallen
petalwid = CONSTANT +,
SYSTAT saves the canonical scores associated with the hypothesis. The scores are
stored in subscripted variables named FACTOR. Because the effects involve a
categorical variable, the Mahalanobis distances (named DISTANCE) and posterior
probabilities (named PROB) are saved in the same file. These distances are computed
in the discriminant space itself. The closer a case is to a particular groups location in
that space, the more likely it is that it belongs to that group. The probability of group
membership is computed from these distances. A variable named PREDICT that
contains the predicted group membership is also added to the file.
The output follows:
Dependent variable means
SEPALLEN
5.843
SEPALWID
3.057
PETALLEN
3.758
PETALWID
1.199
I-538
Linear Models III: General Linear Models
-1
B = (XX) XY
Estimates of effects
SEPALLEN
CONSTANT
SEPALWID
PETALLEN
PETALWID
5.843
3.057
3.758
1.199
SPECIES
-0.837
0.371
-2.296
-0.953
SPECIES
0.093
-0.287
0.502
0.127
SPECIES
1
2
SEPALWID
0.371
-0.287
PETALLEN
-2.296
0.502
PETALWID
-0.953
0.127
-1
Inverse contrast A(XX)
A
1
0.013
-0.007
1
2
2
0.013
-1
SEPALLEN
SEPALWID
PETALLEN
PETALWID
H = BA(A(XX)
-1
A)
AB
SEPALWID
PETALLEN
PETALWID
11.345
-57.240
-22.933
437.103
186.774
80.413
SEPALWID
PETALLEN
PETALWID
16.962
8.121
4.808
27.223
6.272
6.157
SEPALLEN
SEPALWID
PETALLEN
PETALWID
Univariate F Tests
Effect
SS
df
MS
SEPALLEN
Error
63.212
38.956
2
147
31.606
0.265
119.265
0.000
SEPALWID
Error
11.345
16.962
2
147
5.672
0.115
49.160
0.000
PETALLEN
Error
437.103
27.223
2
147
218.551
0.185
1180.161
0.000
PETALWID
Error
80.413
6.157
2
147
40.207
0.042
960.007
0.000
I-539
Chapter 16
0.023
199.145
df =
8, 288
Prob =
0.000
Pillai Trace =
F-Statistic =
1.192
53.466
df =
8, 290
Prob =
0.000
Hotelling-Lawley Trace =
F-Statistic =
32.477
580.532
df =
8, 286
Prob =
0.000
THETA =
0.970 S =
0.0
546.115
df = 8
Roots 2 through 2
Chi-Square Statistic =
36.530
df = 3
Canonical Correlations
1
2
0.985
0.471
Dependent variable canonical coefficients standardized
by conditional (within groups) standard deviations
SEPALLEN
SEPALWID
PETALLEN
PETALWID
1
0.427
0.521
-0.947
-0.575
2
0.012
0.735
-0.401
0.581
1
-0.223
0.119
-0.706
-0.633
2
0.311
0.864
0.168
0.737
3
12.446
3.685
12.767
21.079
2
-72.853
3
-104.368
The multivariate tests are all significant. The dependent variable canonical coefficients
are used to produce discriminant scores. These coefficients are standardized by the
within-groups standard deviations so you can compare their magnitude across
variables with different scales. Because they are not raw coefficients, there is no need
for a constant. The scores produced by these coefficients have an overall zero mean and
a unit standard deviation within groups.
I-540
Linear Models III: General Linear Models
The group classification coefficients and constants comprise the Fisher discriminant
functions for classifying the raw data. You can apply these coefficients to new data and
assign each case to the group with the largest function value for that case.
1
2
3
Total
1
2
3
Total
+-------------------+
|
50
0
0 |
50
|
0
48
2 |
50
|
0
1
49 |
50
+-------------------+
50
49
51
150
Test statistic
Pearson Chi-square
Value
282.593
df
4.000
Prob
0.000
I-541
Chapter 16
FACTOR(2)
2
1
0
-1
GROUP
-2
-3
-10
-5
0
FACTOR(1)
10
3
2
1
Prior Probabilities
In this example, there were equal numbers of flowers in each group. Sometimes the
probability of finding a case in each group is not the same across groups. To adjust the
prior probabilities for this example, specify 0.5, 0.3, and 0.2 as the priors:
PRIORS 0.5 0.3 0.2
General Linear Model uses the probabilities you specify to compute the posterior
probabilities that are saved in the file under the variable PROB. Be sure to specify a
probability for each level of the grouping variable. The probabilities should add up to 1.
Example 14
Principal Components Analysis (Within Groups)
General Linear Model allows you to partial out effects based on grouping variables and
to factor residual correlations. If between-group variation is significant, the withingroup structure can differ substantially from the total structure (ignoring the grouping
variable). However, if you are just computing principal components on a single sample
(no grouping variable), you can obtain more detailed output using the Factor Analysis
procedure.
I-542
Linear Models III: General Linear Models
The following data (USSTATES) comprise death rates by cause from nine census
divisions of the country for that year. The divisions are in the column labeled DIV, and
the U.S. Post Office two-letter state abbreviations follow DIV. Other variables include
ACCIDENT, CARDIO, CANCER, PULMONAR, PNEU_FLU, DIABETES, LIVER,
STATE$, FSTROKE, MSTROKE.
The variation in death rates between divisions in these data is substantial. Here is a
grouped box plot of the second variable, CARDIO, by division. The other variables
show similar regional differences.
500
CARDIO
400
300
200
100
E
d fic tic
in
al
al
al
al tic
ntr ntr lan nta lan aci lan ntr ntr
Ce Ce At ou Eng P At Ce Ce
M w
S id
S
N
S
E
M
W W
Ne
DIVISION$
If you analyze these data ignoring DIVISION$, the correlations among death rates
would be due substantially to between-division differences. You might want to
examine the pooled within-region correlations to see if the structure is different when
divisional differences are statistically controlled. Accordingly, you will factor the
residual correlation matrix after regressing medical variables onto an index variable
denoting the census regions. The input is:
USE usstates
GLM
CATEGORY division
MODEL accident cardio cancer pulmonar pneu_flu,
diabetes liver fstroke mstroke = CONSTANT + division
ESTIMATE
HYPOTHESIS
EFFECT = division
FACTOR = ERROR
TYPE
= CORR
ROTATE = 2
TEST
I-543
Chapter 16
The hypothesis commands compute the principal components on the error (residual)
correlation matrix and rotate the first two components to a varimax criterion. For other
rotations, use the Factor Analysis procedure.
The FACTOR options can be used with any hypothesis. Ordinarily, when you test a
hypothesis, the matrix product INV(G)*H is factored and the latent roots of this matrix
are used to construct the multivariate test statistic. However, you can indicate which
matrixthe hypothesis (H) matrix or the error (G) matrixis to be factored. By
computing principal components on the hypothesis or error matrix separately, FACTOR
offers a direct way to compute principal components on residuals of any linear model
you wish to fit. You can use any A, C, and/or D matrices in the hypothesis you are
factoring, or you can use any of the other commands that create these matrices.
The hypothesis output follows:
Factoring Error Matrix
1
2
3
4
5
6
7
8
9
1
1.000
0.280
0.188
0.307
0.113
0.297
-0.005
0.402
0.495
6
7
8
9
6
1.000
-0.025
-0.151
-0.076
2
1.000
0.844
0.676
0.448
0.419
0.251
-0.202
-0.119
7
3
1.000
0.711
0.297
0.526
0.389
-0.379
-0.246
1.000
0.396
0.296
0.252
-0.190
-0.127
1.000
-0.225
-0.203
1.000
0.947
1.000
1
3.341
2
2.245
3
1.204
4
0.999
6
0.364
7
0.222
8
0.119
9
0.033
1.000
-0.123
-0.138
-0.110
-0.071
Latent roots
5
0.475
I-544
Linear Models III: General Linear Models
Loadings
1
2
3
4
5
6
7
8
9
1
0.191
0.870
0.934
0.802
0.417
0.512
0.391
-0.518
-0.418
2
0.798
0.259
0.097
0.247
0.146
0.218
-0.175
0.795
0.860
3
0.128
-0.097
0.112
-0.135
-0.842
0.528
0.400
0.003
0.025
4
-0.018
0.019
0.028
0.120
-0.010
-0.580
0.777
0.155
0.138
1
2
3
4
5
6
7
8
9
6
0.106
0.145
0.039
-0.499
0.216
0.093
0.154
-0.041
0.005
7
-0.100
-0.254
-0.066
0.085
0.220
0.241
0.159
0.056
0.035
8
-0.019
0.177
-0.251
0.044
-0.005
0.063
0.046
0.081
-0.101
9
-0.015
0.028
-0.058
0.015
-0.002
0.010
0.009
-0.119
0.117
5
-0.536
0.219
0.183
-0.071
-0.042
0.068
-0.044
0.226
0.204
Notice the sorted, rotated loadings. When interpreting these values, do not relate the
row numbers (1 through 9) to the variables. Instead, find the corresponding loading in
the Rotated Loadings table. The ordering of the rotated loadings corresponds to the
order of the model variables.
The first component rotates to a dimension defined by CANCER, CARDIO,
PULMONAR, and DIABETES; the second, by a dimension defined by MSTROKE and
FSTROKE (male and female stroke rates). ACCIDENT also loads on the second factor
but is not independent of the first. LIVER does not load highly on either factor.
I-545
Chapter 16
Example 15
Canonical Correlation Analysis
Suppose you have 10 dependent variables, MMPI(1) to MMPI(10), and 3 independent
variables, RATER(1) to RATER(3). Enter the following commands to obtain the
canonical correlations and dependent canonical coefficients:
USE datafile
GLM
MODEL mmpi(1 .. 10) = CONSTANT + rater(1) + rater(2) + rater(3)
ESTIMATE
PRINT=LONG
HYPOTHESIS
STANDARDIZE
EFFECT=rater(1) & rater(2) & rater(3)
TEST
The canonical correlations are displayed; if you want, you can rotate the dependent
canonical coefficients by using the Rotate option.
To obtain the coefficients for the independent variables, run GLM again with the model
reversed:
MODEL rater(1 .. 3) = CONSTANT + mmpi(1) + mmpi(2),
+ mmpi(3) + mmpi(4) + mmpi(5),
+ mmpi(6) + mmpi(7) + mmpi(8),
+ mmpi(9) + mmpi(10)
ESTIMATE
HYPOTHESIS
STANDARDIZE = TOTAL
EFFECT = mmpi(1) & mmpi(2) & mmpi(3) & mmpi(4) &,
mmpi(5) & mmpi(6) & mmpi(7) & mmpi(8) &,
mmpi(9) & mmpi(10)
TEST
Example 16
Mixture Models
Mixture models decompose the effects of mixtures of variables on a dependent
variable. They differ from ordinary regression models because the independent
variables sum to a constant value. The regression model, therefore, does not include a
constant, and the regression and error sums of squares have one less degree of freedom.
Marquardt and Snee (1974) and Diamond (1981) discuss these models and their
estimation.
I-546
Linear Models III: General Linear Models
Here is an example using the PUNCH data file from Cornell (1985). The study
involved effects of various mixtures of watermelon, pineapple, and orange juice on
taste ratings by judges of a fruit punch. The input is:
USE punch
GLM
MODEL taste = watrmeln + pineappl + orange + ,
watrmeln*pineappl + watrmeln*orange + ,
pineappl*orange
ESTIMATE / MIX
N: 18
Multiple R: 0.969
Coefficient
Std Error
4.600
6.333
7.100
0.134
0.134
0.134
0.667
0.667
0.667
34.322
47.255
52.975
P(2 Tail)
0.000
0.000
0.000
2.400
0.657
0.320
0.667
3.655
0.003
1.267
0.657
0.169
0.667
1.929
0.078
-2.200
0.657
-0.293
0.667
-3.351
0.006
Analysis of Variance
Source
Regression
Residual
Sum-of-Squares
9.929
0.647
df
Mean-Square
F-ratio
5
12
1.986
0.054
36.852
P
0.000
Not using a mixture model produces a much larger R (0.999) and an F value of
2083.371, both of which are inappropriate for these data. Notice that the Regression
Sum-of-Squares has five degrees of freedom instead of six as in the usual zero-intercept
regression model. We have lost one degree of freedom because the predictors sum to 1.
Example 17
Partial Correlations
Partial correlations are easy to compute with General Linear Model. The partial
correlation of two variables (a and b) controlling for the effects of a third (c) is the
correlation between the residuals of each (a and b) after each has been regressed on the
third (c). You can therefore use General Linear Model to compute an entire matrix of
partial correlations.
I-547
Chapter 16
For example, to compute the matrix of partial correlations for Y1, Y2, Y3, Y4, and
Y5, controlling for the effects of X, select Y1 through Y5 as dependent variables and X
as the independent variable. The input follows:
GLM
MODEL y(1 .. 5) = CONSTANT + x
PRINT=LONG
ESTIMATE
Look for the Residual Correlation Matrix in the output; it is the matrix of partial
correlations among the ys given x. If you want to compute partial correlations for
several xs, just select them (also) as independent variables.
Computation
Algorithms
Centered sums of squares and cross products are accumulated using provisional
algorithms. Linear systems, including those involved in hypothesis testing, are solved
by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved
with Householder tridiagonalization and implicit QL iterations. For further
information, see Wilkinson and Reinsch (1971) or Chambers (1977).
References
Chambers, J.M. (1977). Computational methods for data analysis. New York: John
Wiley & Sons, Inc.
Cochran, W. G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John
Wiley & Sons, Inc.
Cohen, J. and Cohen, P. (1983). Applied multiple regression/correlation analysis for the
behavioral sciences. 2nd ed. Hillsdale, N.J.: Lawrence Erlbaum.
Cornell, J.A. (1985). Mixture experiments. In Kotz, S., and Johnson, N.L. (Eds.),
Encyclopedia of statistical sciences, vol. 5, 569-579. New York: John Wiley & Sons,
Inc.
Dempster, A.P. (1969). Elements of continuous multivariate analysis. San Francisco:
Addison-Wesley.
Diamond, W.J. (1981). Practical experiment designs for engineers and scientists.
Belmont, CA: Lifetime Learning Publications.
I-548
Linear Models III: General Linear Models
Chapter
17
Logistic Regression
Dan Steinberg and Phillip Colla
econometric discrete choice model, general linear (Wald) hypothesis testing, score
tests, odds ratios and confidence intervals, forward, backward and interactive
stepwise regression, Pregibon regression diagnostics, prediction success and
classification tables, independent variable derivatives and elasticities, model-based
simulation of response curves, deciles of risk tables, options to specify start values
and to separate data into learning and test samples, quasi-maximum likelihood
standard errors, control of significance levels for confidence interval calculations,
zero/one dependent variable coding, choice of reference group in automatic dummy
variable generation, and integrated plotting tools.
Many of the results generated by modeling, testing, or diagnostic procedures can
be saved to SYSTAT data files for subsequent graphing and display with the graphics
routines. LOGIT and PROBIT are aliases to the categorical multivariate general
modeling module called CMGLH, just as ANOVA, GLM, and REGRESSION are aliases
to the multivariate general linear module called MGLH.
Statistical Background
The LOGIT module is SYSTATs comprehensive program for logistic regression
analysis and provides tools for model building, model evaluation, prediction,
simulation, hypothesis testing, and regression diagnostics. The program is designed
to be easy for the novice and can produce the results most analysts need with just three
simple commands. In addition, many advanced features are also included for
sophisticated research projects. Beginners can skip over any unfamiliar concepts and
gradually increase their mastery of logistic regression by working through the tools
incorporated here.
I-550
I-551
Chapter 17
LOGIT will estimate binary (Cox, 1970), multinomial (Anderson, 1972), conditional
logistic regression models (Breslow and Day, 1980), and the discrete choice model
(Luce, 1959; McFadden, 1973). The LOGIT framework is designed for analyzing the
determinants of a categorical dependent variable. Typically, the dependent variable is
binary and coded as 0 or 1; however, it may be multinomial and coded as an integer
ranging from 1 to k or 0 to k 1 .
Studies you can conduct with LOGIT include bioassay, epidemiology of disease
(cohort or case-control), clinical trials, market research, transportation research (mode
of travel), psychometric studies, and voter-choice analysis. The LOGIT module can also
be used to analyze ranked choice information once the data have been suitably
transformed (Beggs, Cardell, and Hausman, 1981).
This chapter contains a brief introduction to logistic regression and a description of
the commands and features of the module. If you are unfamiliar with logistic
regression, the textbook by Hosmer and Lemeshow (1989) is an excellent place to
begin; Breslow and Day (1980) provide an introduction in the context of case-control
studies; Train (1986) and Ben-Akiva and Lerman (1985) introduce the discrete-choice
model for econometrics; Wrigley (1985) discusses the model for geographers; and
Hoffman and Duncan (1988) review discrete choice in a demographic-sociological
context. Valuable surveys appear in Amemiya (1981), McFadden (1984, 1982, 1976),
and Maddala (1983).
Binary Logit
Although logistic regression may be applied to any categorical dependent variable, it
is most frequently seen in the analysis of binary data, in which the dependent variable
takes on only two values. Examples include survival beyond five years in a clinical
trial, presence or absence of disease, responding to a specified dose of a toxin, voting
for a political candidate, and participating in the labor force. The figure below
compares the ordinary least-squares linear model to the basic binary logit model on the
same data. Notice some features of the linear model in the upper panel of the figure:
n The linear model predicts values of y from minus to plus infinity. If the prediction
response. More generally, it does not appear to approach the data values very well.
We shouldnt blame the linear model for this; it is doing its job as a regression
estimator by shrinking back toward the mean of y for all x values (0.5). The linear
model is simply not designed to come near the data.
I-552
Logistic Regression
The lower panel illustrates a logistic model. By contrast, it is designed to fit binary
dataeither when y is assumed to represent a probability distribution or when it is
taken simply as a binary measure we are attempting to predict.
Despite the difference in their graphical appearance, the linear and logit models are
only slight variants of one another. Assuming the possibility of more than one predictor
(x) variable, the linear model is:
y = Xb + e
where y is a vector of observations, X is a matrix of predictor scores, and e is a vector
of errors.
The logit model is:
y = exp ( Xb + e ) [ 1 + exp ( Xb + e ) ]
where the exponential function is applied to the vector argument. Rearranging terms,
we have:
y ( 1 y ) = exp ( Xb + e )
and logging both sides of the equation, we have:
I-553
Chapter 17
Multinomial Logit
Multinomial logit is a logistic regression model having a dependent variable with
more than two levels (Agresti, 1990; Santer and Duffy, 1989; Nerlove and Press,
1973). Examples of such dependent variables include political preference (Democrat,
Republican, Independent), health status (healthy, moderately impaired, seriously
impaired), smoking status (current smoker, former smoker, never smoked), and job
classification (executive, manager, technical staff, clerical, other). Outside of the
difference in the number of levels of the dependent variable, the multinomial logit is
very similar to the binary logit, and most of the standard tools of interpretation,
analysis, and model selection can be applied. In fact, the polytomous unordered logit
we discuss here is essentially a combination of several binary logits estimated
simultaneously (Begg and Gray, 1984). We use the term polytomous to differentiate
this model from the conditional logistic regression and discrete choice models
discussed below.
There are important differences between binary and multinomial models. Chiefly,
the multinomial output is more complicated than that of the binary model, and care
must be taken in the interpretation of the results. Fortunately, LOGIT provides some
new tools that make the task of interpretation much easier. There is also a difference in
dependent variable coding. The binary logit dependent variable is normally coded 0 or
1, whereas the multinomial dependent can be coded 1, 2, ..., k , (that is, it starts at 1
rather than 0) or 0, 1, 2, ..., k 1 .
Conditional Logit
The conditional logistic regression model has become a major analytical tool in
epidemiology since the work of Prentice and Breslow (1978), Breslow et al. (1978),
Prentice and Pyke (1979), and the extended treatment of case-control studies in
Breslow and Day (1980). A mathematically similar model with the same name was
introduced independently and from a rather different perspective by McFadden (1973)
in econometrics. The models have since seen widespread use in the considerably
different contexts of biomedical research and social science, with parallel literatures on
sampling, estimation techniques, and statistical results. In epidemiology, conditional
logit is used to estimate relative risks in matched sample case-control studies (Breslow,
1982), whereas in econometrics a similar likelihood function is used to model
consumer choices as a function of the attributes of alternatives. We begin this section
with a treatment of the biomedical use of the conditional logistic model. A separate
I-554
Logistic Regression
section on the discrete choice model covers the econometric version and contains
certain fine points that may be of interest to all readers. A discussion of parallels in the
two literatures appears in Steinberg (1991).
In the traditional conditional logistic regression model, you are trying to measure
the risk of disease corresponding to different levels of exposure to risk factors. The data
have been collected in the form of matched sets of cases and controls, where the cases
have the disease, the controls do not, and the sets are matched on background variables
such as age, sex, marital status, education, residential location, and possibly other
health indicators. The matching variables combine to form strata over which relative
risks are to be estimated; thus, for example, a small group of persons of a given age,
marital status, and health history will form a single stratum. The matching variables
can also be thought of as proxies for a larger set of unobserved background variables
that are assumed to be constant within strata. The logit for the jth individual in the ith
stratum can be written as:
logit ( p ij ) = a i + b j X ij
where X ij is the vector of exposure variables and a i is a parameter dedicated to the
stratum. Since case-control studies will frequently have a large number of small
matched sets, the a i are nuisance parameters that can cause problems in estimation
(Cox and Hinkley, 1974). In the example discussed below, there are 63 matched sets,
each consisting of four cases and one control, with information on seven exposure
variables for every subject.
The problem with estimating an unconditional model for these data is that we would
need to include 63 1 = 62 dummy variables for the strata. This would leave us with
possibly 70 parameters being estimated for a data set with only 315 observations.
Furthermore, increasing the sample size will not help because an additional stratum
parameter would have to be estimated for each additional matched set in the study
sample. By working with the appropriate conditional likelihood, however, the nuisance
parameters can be eliminated, simplifying estimation and protecting against potential
biases that may arise in the unconditional model (Cox, 1975; Chamberlain, 1980). The
conditional model requires estimation only of the relative risk parameters of interest.
LOGIT allows the estimation of models for matched sample case-control studies
with one case and any number of controls per set. Thus, matched pair studies, as well
as studies with varying numbers of controls per case, are easily handled. However, not
all commands discussed so far are available for conditional logistic regression.
I-555
Chapter 17
Ui = B 1 T i + B2 Ci + ei
where i = 1, 2, 3 represents private auto, car pool, and train, respectively. In this
random utility model, the utility U i of the ith alternative is determined by the travel
time T i , the cost C i of that alternative, and a random error term, e i . Utility of an
alternative is assumed not to be influenced by the travel times or costs of other
alternatives available, although choice will be determined by the attributes of all
available alternatives. In addition to the alternative characteristics, utility is sometimes
also determined by an alternative specific constant.
The choice model specifies that an individual will choose the alternative with the
highest utility as determined by the equation above. Because of the random
component, we are reduced to making statements concerning the probability that a
given choice is made. If the error terms are distributed as i.i.d. extreme value, it can be
shown that the probability of the ith alternative being chosen is given by the familiar
logit formula
I-556
Logistic Regression
exp ( X i b )
Prob ( U i > U j for all j i ) = -----------------------------exp ( X i b )
Suppose that for the first few cases our data are as follows:
Subject
1
2
3
4
5
20
45
15
60
30
3.50
6.00
1.00
5.50
4.25
35
65
30
70
40
2.00
3.00
0.50
2.00
1.75
65
65
60
90
55
Train(2) Sex
1.10
1.00
1.00
2.00
1.50
Male
Female
Male
Male
Male
Age
27
35
22
45
52
The third record has a person who chooses to go to work by private auto (choice = 1);
when he drives, it takes 15 minutes to get to work and costs one dollar. Had he
carpooled instead, it would have taken 30 minutes to get to work and cost 50 cents. The
train would have taken an hour and cost one dollar. For this case, the utility of each
option is given by
U(private auto)= b1*15 + b2*1.00 + error13
U(car pool) = b1*30 + b2* 0.50 + error23
U(train) = b1*60 + b2*1.00 + error33
The error term has two subscripts, one pertaining to the alternative and the other
pertaining to the individual. The error is individual-specific and is assumed to be
independent of any other error or variable in the data set. The parameters b 1 and b 2
are common utility weights applicable to all individuals in the sample. In this example,
these are the only parameters, and their number does not depend on the number of
alternatives individuals can choose from. If a person also had the option of walking to
work, we would expand the model to include this alternative with
U (walking) = b1*70 + b2*0.00 + error43
and we would still be dealing with only the two regression coefficients b 1 and b 2 .
This highlights a major difference between the discrete choice and standard
polytomous logit models. In polytomous logit, the number of parameters grows with
the number alternatives; if the value of NCAT is increased from 3 to 4, a whole new
vector of parameters is estimated. By contrast, in the discrete choice model without a
constant, increasing the number of alternatives does not increase the number of
discrete choice parameters estimated.
I-557
Chapter 17
U i = b oi + b 1 T i + b 2 C i + e i
where i = 1, 2, 3 represents private auto, car pool, and train, respectively. The
constant here, b oi , is alternative-specific, with a separate one estimated for each
alternative: b o1 corresponds to private auto; b o2 , to car pooling; and b o3 , to train. Like
polytomous logit, the constant pertaining to the reference group is normalized to 0 and
is not estimated.
An alternative specific CONSTANT is entered into a discrete choice model to capture
unmeasured desirability of an alternative. Thus, the first constant could reflect the
convenience and comfort of having your own car (or in some cities the inconvenience
of having to find a parking space), and the second might reflect the inflexibility of
schedule associated with shared vehicles. With NCAT=3, the third constant will be
normalized to 0.
Stepwise Logit
Automatic model selection can be extremely useful for analyzing data with a large
number of covariates for which there is little or no guidance from previous research.
For these situations, LOGIT supports stepwise regression, allowing forward, backward,
mixed, and interactive covariate selection, with full control over forcing, selection
criteria, and candidate variables (including interactions). The procedure is based on
Peduzzi, Holford, and Hardy (1980).
Stepwise regression results in a model that cannot be readily evaluated using
conventional significance criteria in hypothesis tests, but the model may prove useful
for prediction. We strongly suggest that you separate the sample into learning and test
sets for assessment of predictive accuracy before fitting a model to the full data set. See
the cautionary discussion and references in Chapter 14.
I-558
Logistic Regression
n Dependent. Select the variable you want to examine. The dependent variable should
interaction to your model, use the Cross button. For example, to add the term
SEX*EDUCATION, add SEX to the Independent list and then add EDUCATION by
clicking Cross.
n Conditional(s). Select conditional variables. To add interactive conditional
variables to your model, use the Cross button. For example, to add the term
SEX*EDUCATION, add SEX to the Conditional list and then add EDUCATION by
clicking Cross.
I-559
Chapter 17
to obtain a model through the origin. When in doubt, include the constant.
n Prediction table. Produces a prediction-of-success table, which summarizes the
maximum likelihood adjusted after the first iteration. If this matrix is calculated, it
will be used during subsequent hypothesis testing and will affect t ratios for
estimated parameters.
n Save file. Saves specified statistics in filename.SYD.
Click the Options button to go to the Categories, Discrete Choice, and Estimation
Options dialog boxes.
Categories
You must specify numeric or string grouping variables that define cells. Specify for all
categorical variables for which logistic regression analysis should generate design
variables.
I-560
Logistic Regression
Dummy. Produces dummy codes for the design variables instead of effect codes.
Coding of dummy variables is the classic analysis of variance parameterization, in
which the sum of effects estimated for a classifying variable is 0. If your categorical
variable has k categories, k 1 dummy variables are created.
Discrete Choice
The discrete choice framework is designed specifically to model an individuals
choices in response to the characteristics of the choices. Characteristics of choices are
attributes such as price, travel time, horsepower, or calories; they are features of the
alternatives that an individual might choose from. You can define set names for groups
of variables, and create, edit, or delete variables.
Set Name. Specifies conditional variables. Enter a set name and then you can add and
cross variables. To create a new set, click New. Repeat this process until you have
defined all of your sets. You can edit existing sets by highlighting the name of the set
in the Set Name drop-down list. To delete a set, select the set in the drop-down list and
click Delete. When you click Continue, SYSTAT will check that each set name has a
definition. If a set name exists but no variables were assigned to it, the set is discarded
and the set name will not be in the drop-down list when you return to this dialog box.
Alternatives for discrete choice. Specify an alternative for discrete choice.
Characteristics of choice are features of the alternatives that an individual might
choose between. It is needed only when the number of alternatives in a choice model
varies per subject.
Number of categories. Specify the number of categories or alternatives the variable has.
This is needed only for the by-choice data layout where the values of the dependent
I-561
Chapter 17
variable are not explicitly coded. This is only enabled when the Alternatives for discrete
choice field is not empty.
Options
The Logit Options dialog box allows you to specify convergence and a tolerance level,
select complete or stepwise entry, and specify entry and removal criteria.
Converge. Specifies the largest relative change in any coordinate before iterations
terminate.
Tolerance. Prevents the entry of a variable that is highly correlated with the
independent variables already included in the model. Enter a value between 0 and 1.
Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the
correlation required to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Stepwise. Allows forward, backward, mixed, and interactive covariates selection,
with full control over forcing, selection criteria, and candidates, including
interactions. It results in a model that can be useful for prediction.
Stepwise Options. The following alternatives are available for stepwise entry and
removal:
I-562
Logistic Regression
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT
model at each step. For Forward, SYSTAT automatically adds a variable to the
model at each step.
n Interactive. Allows you to use your own judgment in selecting variables for
addition or deletion.
Probability. You can also control the criteria used to enter variables into and remove
variables from the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified
Deciles of Risk
After you successfully estimate your model using logistic regression, you can calculate
deciles of risk. This will help you make sure that your model fits the data and that the
results are not unduly influenced by a handful of unusual observations. In using the
deciles of risk table, please note that the goodness-of-fit statistics will depend on the
grouping rule specified.
I-563
Chapter 17
view of covariate effects that is not easily seen when considering each binary
submodel separately. In fact, the overall effect of a covariate on the probability of
an outcome can be of the opposite sign of its coefficient estimate in the
corresponding submodel. This is because the submodel concerns only two of the
outcomes, whereas the derivative table considers all outcomes at once.
n Based on equal counts per bin. Allocates approximately equal numbers of
observations to each cell. Enter the number of cells or bins in the Number of bins
text box.
Quantiles
After estimating your model, you can calculate quantiles for any single-predictor in the
model. Quantiles of unadjusted data can be useful in assessing the suitability of a
functional form when you are interested in the unconditional distribution of the failure
times.
Covariate(s). The Covariate(s) list contains all of the variables specified in the
Independent list in the main Logit dialog box. You can set any of the covariates to a
fixed value by selecting the variable in the Covariates list and entering a value in the
Value text box. This constraint appears as variable name = value in the Fixed Value
Settings list after you click Add. The quantiles for the desired variable correspond to a
model in which the covariates are fixed at these values. Any covariates not fixed to a
value are assigned the value of 0.
Quantile Value Variable. By default, the first variable in the Independent variable list in
the main dialog box is shown in this field. You can change this to any variable from
the list. This variable name is then issued as the argument for the QNTL command.
I-564
Logistic Regression
Simulation
SYSTAT allows you to generate and save predicted probabilities and odds ratios, using
the last model estimated to evaluate a set of logits. The logits are calculated from a
combination of fixed covariate values that you specify in this dialog box.
Covariate(s). The Covariate(s) list contains all of the variables specified in the
Independent list on the main Logit dialog box. Select a covariate, enter a fixed value
for the covariate in the Value text box, and click Add.
Value. Enter the value over which the parameters of the simulation are to vary.
Fixed value settings. This box lists the fixed values on the covariates from which the
logits are calculated.
When you click OK, SYSTAT prompts you to specify a file to which the simulation
results will be saved.
Hypothesis
After you successfully estimate your model using logistic regression, you can perform
post hoc analyses.
I-565
Chapter 17
Enter the hypotheses that you would like to test. All the hypotheses that you list will
be tested jointly in a single test. To test each restriction individually, you will have to
revisit this dialog box each time. To reference dummies generated from categorical
covariates, use square brackets, as in:
RACE [ 1 ] = 0
You can reproduce the Wald version of the t ratio by testing whether a coefficient is 0:
AGE = 0
If you dont specify a sub-vector, the first is assumed; thus, the constraint above is
equivalent to:
AGE { 1 } = 0
Using Commands
After selecting a file with USE filename, continue with:
LOGIT
CATEGORY grpvarlist / MISS EFFECT DUMMY
NCAT=n
ALT var
SET parameter=condvarlist
MODEL depvar = CONSTANT + indvarexp
depvar = condvarlist;polyvarlist
ESTIMATE / PREDICT TOLERANCE=d CONVERGE=d QML MEANS CLASS
DERIVATIVE=INDIVIDUAL or AVERAGE
or
START / BACKWARD FORWARD ENTER=d REMOVE=d FORCE=n
MAXSTEP=n
STEP var or + or - / AUTO
(sequence of STEPs)
STOP
SAVE
DC / SMART=n P=p1,p2,
QNTL var / covar=d covar=d
SIMULATE var1=d1, var2=d2, / DO var1=d1,d2,d3, var2=d1,d2,d3
HYPOTHESIS
CONSTRAIN argument
TEST
I-566
Logistic Regression
Usage Considerations
Types of data. LOGIT uses rectangular data only. The dependent variable is
automatically taken to be categorical. To change the order of the categories, use the
ORDER statement. For example,
ORDER CLASS / SORT=DESCENDING
LOGIT can also handle categorical predictor variables. Use the CATEGORY statement
to create them, and use the EFFECTS or DUMMY options of CATEGORY to determine
the coding method. Use the ORDER command to change the order of the categories.
Print options. For PRINT=SHORT, the output gives N, the type of association, parameter
estimates, and associated tests. PRINT=LONG gives, in addition to the above results, a
correlation matrix of the parameter estimates.
Quick Graphs. LOGIT produces no Quick Graphs. Use the saved files from ESTIMATE
or DC to produce diagnostic plots and fitted curves. See the examples.
Saving files. LOGIT saves simulation results, quantiles, or residuals and estimated
values.
BY groups. LOGIT analyzes data by groups.
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. LOGIT uses the FREQ variable, if present, to weight cases. This
inflates the total degrees of freedom to be the sum of the number of frequencies. Using
a FREQ variable does not require more memory, however. Cases whose value on the
FREQ variable are less than or equal to 0 are deleted from the analysis. The FREQ
variable may take non-integer values. When the FREQ command is in effect, separate
unweighted and weighted case counts are printed.
Weighting can be used to compensate for sampling schemes that stratify on the
covariates, giving results that more accurately reflect the population. Weighting is also
useful for market share predictions from samples stratified on the outcome variable in
discrete choice models. Such samples are known as choice-based in the econometric
literature (Manski and Lerman, 1977; Manski and McFadden, 1980; Coslett, 1980) and
are common in matched-sample case-control studies where the cases are usually oversampled, and in market research studies where persons who choose rare alternatives
are sampled separately.
Case weights. LOGIT does not allow case weighting.
I-567
Chapter 17
Examples
The following examples begin with the simple binary logit model and proceed to more
complex multinomial and discrete choice logit models. Along the way, we will
examine diagnostics and other options used for applications in various fields.
Example 1
Binary Logit
To illustrate the use of binary logistic regression, we take this example from Hosmer
and Lemeshows book Applied Logistic Regression, referred to below as H&L.
Hosmer and Lemeshow consider data on low infant birth weight (LOW) as a function
of several risk factors. These include the mothers age (AGE), mothers weight during
last menstrual period (LWT), race (RACE = 1: white, RACE = 2: black, RACE = 3:
other), smoking status during pregnancy (SMOKE), history of premature labor (PTL),
hypertension (HT), uterine irritability (UI), and number of physician visits during first
trimester (FTV). The dependent variable is coded 1 for birth weights less than 2500
grams and coded 0 otherwise. These variables have previously been identified as
associated with low birth weight in the obstetrical literature.
The first model considered is the simple regression of LOW on a constant and LWD,
a dummy variable coded 1 if LWT is less than 110 pounds and coded 0 otherwise. (See
H&L, Table 3.17.) LWD and LWT are similar variable names. Be sure to note which is
being used in the models that follow.
The input is:
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD
ESTIMATE
The output begins with a listing of the dependent variable and the sample split between
0 (reference) and 1 (response) for the dependent variable. A brief iteration history
follows, showing the progress of the procedure to convergence. Finally, the parameter
estimates, standard errors, standardized coefficients (popularly called t ratios), p
values, and the log-likelihood are presented.
I-568
Logistic Regression
RACE
BWT
SMOKE
RACE1
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 CONSTANT
2 LWD
189
59
130
189
1
2
3
4
is
-131.005
is
-113.231
is
-113.121
is
-113.121
-113.121
Estimate
-1.054
1.054
S.E.
0.188
0.362
t-ratio
-5.594
2.914
p-value
0.000
0.004
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LWD
2.868
5.826
1.412
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
8.431 with 1 df Chi-sq p-value = 0.004
McFaddens Rho-Squared =
0.036
Coefficients
We can evaluate these results much like a linear regression. The coefficient on LWD is
large relative to its standard error (t ratio = 2.91) and so appears to be an important
predictor of low birth weight. The interpretation of the coefficient is quite different
from ordinary regression, however. The logit coefficient tells how much the logit
increases for a unit increase in the independent variable, but the probability of a 0 or 1
outcome is a nonlinear function of the logit.
Odds Ratio
The odds-ratio table provides a more intuitively meaningful quantity for each
coefficient. The odds of the response are given by p ( 1 p ) , where p is the
probability of response, and the odds ratio is the multiplicative factor by which the
odds change when the independent variable increases by one unit. In the first model,
being a low-weight mother increases the odds of a low birth weight baby by a
I-569
Chapter 17
multiplicative factor of 2.87, with lower and upper confidence bounds of 1.41 and 5.83,
respectively. Since the lower bound is greater than 1, the variable appears to represent
a genuine risk factor. See Kleinbaum, Kupper, and Chambliss (1982) for a discussion.
Example 2
Binary Logit with Multiple Predictors
The binary logit example contains only a constant and a single dummy variable. We
consider the addition of the continuous variable AGE to the model.
The input is:
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD+AGE
ESTIMATE / MEANS
RACE
BWT
189
59
130
189
L-L
L-L
L-L
L-L
L-L
Log
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
0
1.000
0.356
22.305
1
2
3
4
5
is
-131.005
is
-112.322
is
-112.144
is
-112.143
is
-112.143
-112.143
-1
1.000
0.162
23.662
OVERALL
1.000
0.222
23.238
SMOKE
RACE1
I-570
Logistic Regression
Parameter
1 CONSTANT
2 LWD
3 AGE
Estimate
-0.027
1.010
-0.044
S.E.
0.762
0.364
0.032
t-ratio
-0.035
2.773
-1.373
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LWD
2.746
5.607
1.345
3 AGE
0.957
1.019
0.898
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
10.385 with 2 df Chi-sq p-value = 0.006
McFaddens Rho-Squared =
0.044
p-value
0.972
0.006
0.170
We see the means of the independent variables overall and by value of the dependent
variable. In this sample, there is a substantial difference between the mean LWD across
birth weight groups but an apparently small AGE difference.
AGE is clearly not significant by conventional standards if we look at the
coefficient/standard-error ratio. The confidence interval for the odds ratio (0.898,
1.019) includes 1.00, indicating no effect in relative risk, when adjusting for LWD.
Before concluding that AGE does not belong in the model, H&L consider the
interaction of AGE and LWD.
Example 3
Binary Logit with Interactions
In this example, we fit a model consisting of a constant, a dummy variable, a
continuous variable, and an interaction. Note that it is not necessary to create a new
interaction variable; this is done for us automatically by writing the interaction on the
MODEL statement. Lets also add a prediction table for this model.
Following is the input:
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD+AGE+LWD*AGE
ESTIMATE / PREDICTION
SAVE SIM319/SINGLE,SAVE ODDS RATIOS FOR H&L TABLE 3.19
SIMULATE CONSTANT=0,AGE=0,LWD=1 / DO LWD*AGE =15,45,5
USE SIM319
LIST
I-571
Chapter 17
RACE
BWT
SMOKE
RACE1
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
CONSTANT
LWD
AGE
AGE*LWD
189
59
130
189
1
2
3
4
5
is
-131.005
is
-110.937
is
-110.573
is
-110.570
is
-110.570
-110.570
Estimate
0.774
-1.944
-0.080
0.132
S.E.
t-ratio
0.910
0.851
1.725
-1.127
0.040
-2.008
0.076
1.746
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LWD
0.143
4.206
0.005
3 AGE
0.924
0.998
0.854
4 AGE*LWD
1.141
1.324
0.984
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
13.532 with 3 df Chi-sq p-value = 0.004
McFaddens Rho-Squared =
0.058
1
2
3
4
Simulation Vector
Fixed Parameter
1 CONSTANT
2 LWD
3 AGE
Actual
Total
21.280
37.720
37.720
92.280
59.000
130.000
59.000
0.361
0.049
0.601
130.000
0.710
0.022
189.000
0.361
0.639
Specificity:
False Response:
Value
0.0
1.000
0.0
0.710
0.290
p-value
0.395
0.260
0.045
0.081
I-572
Logistic Regression
Loop Parameter
Minimum
4 AGE*LWD
15.000
SYSTAT save file created.
7 records written to SYSTAT save file.
Case number
1
2
3
4
5
6
7
LOGIT
ODDS
0.04
1.04
0.70
2.01
1.36
3.90
2.02
7.55
2.68
14.63
3.34
28.33
4.00
54.86
SELOGIT
ODDSL
0.66
0.28
0.40
0.91
0.42
1.71
0.69
1.95
1.03
1.94
1.39
1.85
1.76
1.75
Maximum
45.000
Increment
5.000
PROB
ODDSU
0.51
3.79
0.67
4.44
0.80
8.88
0.88
29.19
0.94
110.26
0.97
432.77
0.98
1724.15
PLOWER
LOOP(1)
0.22
15.00
0.48
20.00
0.63
25.00
0.66
30.00
0.66
35.00
0.65
40.00
0.64
45.00
PUPPER
0.79
0.82
0.90
0.97
0.99
1.00
1.00
Likelihood-Ratio Statistic
At this point, it would be useful to assess the model as a whole. One method of model
evaluation is to consider the likelihood-ratio statistic. This statistic tests the hypothesis
that all coefficients except the constant are 0, much like the F test reported below linear
regressions. The likelihood-ratio statistic (LR for short) of 13.532 is chi-squared with
three degrees of freedom and a p value of 0.004. The degrees of freedom are equal to
the number of covariates in the model, not including the constant. McFaddens rhosquared is a transformation of the LR statistic intended to mimic an R-squared. It is
always between 0 and 1, and a higher rho-squared corresponds to more significant
results. Rho-squared tends to be much lower than R-squared though, and a low number
does not necessarily imply a poor fit. Values between 0.20 and 0.40 are considered
very satisfactory (Hensher and Johnson, 1981).
Models can also be assessed relative to one another. A likelihood-ratio test is
formally conducted by computing twice the difference in log-likelihoods for any pair
of nested models. Commonly called the G statistic, it has degrees of freedom equal to
the difference in the number of parameters estimated in the two models. Comparing the
current model with the model without the interaction, we have
I-573
Chapter 17
Simulation
To understand the implications of the interaction, we need to explore how the relative
risk of low birth weight varies over the typical child-bearing years. This changing
relative risk is evaluated by computing the logit difference for base and comparison
I-574
Logistic Regression
groups. The logit for the base group, mothers with LWD = 0, is written as L(0); the logit
for the comparison group, mothers with LWD = 1, is L(l). Thus,
L(O) = CONSTANT +
B2*AGE
L(l) = CONSTANT + B1*LWD + B2*AGE + B3*LWD*AGE
= CONSTANT + B1
+ B2*AGE + B3*AGE
which is the coefficient on LWD plus the interaction multiplied by its coefficient. The
difference L(l) (0) evaluated for a mother of a given age is a measure of the log relative
risk due to LWD being 1. This can be calculated simply for several ages, and converted
to odds ratios with upper and lower confidence bounds, using the SIMULATE
command.
SIMULATE calculates the predicted logit, predicted probability, odds ratio, upper
and lower bounds, and the standard error of the logit for any specified values of the
covariates. In the above command, the constant and age are set to 0, because these
coefficients do not appear in the logit difference. LWD is set to 1, and the interaction
is allowed to vary from 15 to 45 in increments of five years. The only printed output
produced by this command is a summary report.
SIMULATE does not print results when a DO LOOP is specified because of the
potentially large volume of output it can generate. To view the results, use the
commands:
USE SIM319
LIST
The results give the effect of low maternal weight (LWD) on low birth weight as a
function of age, where LOOP(1) is the value of AGE * LWD (which is just AGE) and
ODDSU and ODDSL are upper and lower bounds of the odds ratio. We see that the
effect of LWD goes up dramatically with age, although the confidence interval
becomes quite large beyond age 30. The results presented here are calculated internally
within LOGIT and thus differ slightly from those reported in H&L, who use printed
output with fewer decimal places of precision to obtain their results.
I-575
Chapter 17
Example 4
Deciles of Risk and Model Diagnostics
Before turning to more detailed model diagnostics, we fit H&Ls final model. As a
result of experimenting with more variables and a large number of interactions, H&L
arrive at the model used here. The input is:
USE HOSLEM
LOGIT
CATEGORY RACE / DUMMY
MODEL LOW=CONSTANT+AGE+RACE+SMOKE+HT+UI+LWD+PTD+ ,
AGE*LWD+SMOKE*LWD
ESTIMATE
SAVE RESID
DC / P=0.06850,0.09360,0.15320,0.20630,0.27810,0.33140,
0.42300,0.49124,0.61146
USE RESID
PPLOT PEARSON / SIZE=VARIANCE
PLOT DELPSTAT*PROB/SIZE=DELBETA(1)
The categorical variable RACE is specified to have three levels. By default LOGIT uses
the highest category as the reference group, although this can be changed. The model
includes all of the main variables except FTV, with LWT and PTL transformed into
dummy variable variants LWD and PTD, and two interactions. To reproduce the results
of Table 5.1 of H&L, we specify a particular set of cut points for the deciles of risk
table. Some of the results are:
Variables in the SYSTAT Rectangular file are:
ID
LOW
AGE
LWT
PTL
HT
UI
FTV
CASEID
PTD
LWD
Categorical values encountered during processing are:
RACE (3 levels)
1,
2,
3
LOW (2 levels)
0,
1
Binary LOGIT Analysis.
Dependent variable: LOW
Input records:
189
Records for analysis:
Sample split
Category choices
REF
RESP
Total
:
L-L
L-L
L-L
L-L
L-L
L-L
Log
at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
189
59
130
189
1
2
3
4
5
6
is
is
is
is
is
is
-131.005
-98.066
-96.096
-96.006
-96.006
-96.006
-96.006
RACE
BWT
SMOKE
RACE1
I-576
Logistic Regression
t-ratio
0.232
-1.843
-1.637
0.608
2.515
2.055
1.519
-0.926
2.613
1.779
-1.719
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 AGE
0.919
1.005
0.841
3 RACE_1
0.468
1.162
0.188
4 RACE_2
1.382
3.920
0.487
5 SMOKE
3.168
7.781
1.290
6 HT
3.893
14.235
1.065
7 UI
2.071
5.301
0.809
8 LWD
0.177
6.902
0.005
9 PTD
3.427
8.632
1.360
10 AGE*LWD
1.159
1.363
0.985
11 SMOKE*LWD
0.245
1.218
0.049
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
42.660 with 10 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.182
1
2
3
4
5
6
7
8
9
10
11
Parameter
CONSTANT
AGE
RACE_1
RACE_2
SMOKE
HT
UI
LWD
PTD
AGE*LWD
SMOKE*LWD
Estimate
0.248
-0.084
-0.760
0.323
1.153
1.359
0.728
-1.730
1.232
0.147
-1.407
S.E.
1.068
0.046
0.464
0.532
0.458
0.661
0.479
1.868
0.471
0.083
0.819
p-value
0.816
0.065
0.102
0.543
0.012
0.040
0.129
0.354
0.009
0.075
0.086
Deciles of Risk
Records processed: 189
Sum of weights =
189.000
Statistic
p-value
df
Hosmer-Lemeshow*
5.231
0.733
8.000
Pearson
183.443
0.374
178.000
Deviance
192.012
0.224
178.000
* Large influence of one or more deciles may affect statistic.
Category
0.423
Resp Obs
6.000
Exp
6.816
Ref Obs
12.000
Exp
11.184
0.069
0.491
0.0
10.000
0.854
8.570
18.000
9.000
17.146
10.430
0.094
0.611
1.000
9.000
1.641
10.517
19.000
10.000
18.359
8.483
Avg Prob
0.379
0.047
0.451
0.082
0.554
Category
1.000
Resp Obs
Exp
Ref Obs
Exp
15.000
14.122
4.000
4.878
Avg Prob
0.743
SYSTAT save file created.
189 records written to %1 save file.
0.153
0.206
0.278
0.331
4.000
2.000
6.000
6.000
2.252
3.646
5.017
5.566
14.000
18.000
14.000
12.000
15.748
16.354
14.983
12.434
0.125
0.182
0.251
0.309
I-577
Chapter 17
3
2
1
0
-1
-2
-3
-4
-3
-2
-1 0
1
PEARSON
VARIANCE
0.3
0.2
0.1
0.0
Deciles of Risk
How well does a model fit the data? Are the results unduly influenced by a handful of
unusual observations? These are some of the questions we try to answer with our
model assessment tools. Besides the prediction success table and likelihood-ratio tests
(see the Binary Logit with Interactions example), the model assessment methods in
LOGIT include the Pearson chi-square, deviance and Hosmer-Lemeshow statistics, the
deciles of risk table, and a collection of residual, leverage, and influence quantities.
Most of these are produced by the DC command, which is invoked after estimating a
model.
I-578
Logistic Regression
The table in this example is generated by partitioning the sample into 10 groups
based on the predicted probability of the observations. The row labeled Category gives
the end points of the cells defining a group. Thus, the first group consists of all
observations with predicted probability between 0 and 0.069, the second group covers
the interval 0.069 to 0.094, and the last group contains observations with predicted
probability greater than 0.611.
The cell end points can be specified explicitly as we did or generated automatically
by LOGIT. Cells will be equally spaced if the DC command is given without any
arguments, and LOGIT will allocate approximately equal numbers of observations to
each cell when the SMART option is given, as:
DC / SMART = 10
which requests 10 cells. Within each cell, we are given a breakdown of the observed
and expected 0s (Ref) and 1s (Resp) calculated as in the prediction success table.
Expected ls are just the sum of the predicted probabilities of 1 in the cell. In the table,
it is apparent that observed totals are close to expected totals everywhere, indicating a
fairly good fit. This conclusion is borne out by the Hosmer-Lemeshow statistic of 5.23,
which is approximately chi-squared with eight degrees of freedom. H&L discuss the
degrees of freedom calculation.
In using the deciles of risk table, it should be noted that the goodness-of-fit statistics
will depend on the grouping rule specified and that not all statistics programs will
apply the same rules. For example, some programs assign all tied probabilities to the
same cell, which can result in very unequal cell counts. LOGIT gives the user a high
degree of control over the grouping, allowing you to choose among several methods.
The table also provides the Pearson chi-square and the sum of squared deviance
residuals, assuming that each observation has a unique covariate pattern.
Regression Diagnostics
If the DC command is preceded by a SAVE command, a SYSTAT data file containing
regression diagnostics will be created (Pregibon, 1981; Cook and Weisberg, 1984).
The SAVE file contains these variables:
I-579
Chapter 17
ACTUAL
PREDICT
PROB
LEVERAGE(1)
LEVERAGE(2)
PEARSON
VARIANCE
STANDARD
DEVIANCE
DELDSTART
DELPSTART
DELBETA(1)
DELBETA(2)
DELBETA(3)
I-580
Logistic Regression
Example 5
Quantiles
In bioassay, it is common to estimate the dosage required to kill 50% of a target
population. For example, a toxicity experiment might establish the concentration of
nicotine sulphate required to kill 50% of a group of common fruit flies (Hubert, 1984).
More generally, the goal is to identify the level of a stimulus required to induce a 50%
response rate, where the response is any binary outcome variable and the stimulus is a
continuous covariate. In bioassay, stimuli include drugs, toxins, hormones, and
insecticides; the responses include death, weight gain, bacterial growth, and color
change, but the concepts are equally applicable to other sciences.
To obtain the LD50 in LOGIT, simply issue the QNTL command. However, dont
make the mistake of spelling quantile as QU, which means QUIT in SYSTAT. QNTL
will produce not only the LD50 but also a number of other quantiles as well, with upper
and lower bounds when they exist. Consider the following data from Williams (1986):
RESPONSE LDOSE COUNT
CASE
CASE
CASE
CASE
CASE
CASE
CASE
CASE
CASE
Here, RESPONSE is the dependent variable, LDOSE is the logarithm of the dose
(stimulus), and COUNT is the number of subjects with that response. The model
estimated is:
USE WILL
FREQ=COUNT
LOGIT
MODEL RESPONSE=CONSTANT+LDOSE
ESTIMATE
QNTL
I-581
Chapter 17
Weighted
Count
5
15.000
4
10.000
9
25.000
Count
:
L-L
L-L
L-L
L-L
L-L
Log
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 CONSTANT
2 LDOSE
1
2
3
4
5
is
is
is
is
is
-17.329
-13.277
-13.114
-13.112
-13.112
-13.112
Estimate
0.564
0.919
S.E.
t-ratio
0.496
1.138
0.394
2.334
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LDOSE
2.507
5.425
1.159
Log Likelihood of constants only model = LL(0) =
-16.825
2*[LL(N)-LL(0)] =
7.427 with 1 df Chi-sq p-value = 0.006
McFaddens Rho-Squared =
0.221
Evaluation Vector
1 CONSTANT
2 LDOSE
1.000
VALUE
Quantile Table
Probability
LOGIT
LDOSE
Upper
Lower
0.999
0.995
0.990
0.975
0.950
0.900
0.750
0.667
0.500
0.333
0.250
0.100
0.050
0.025
0.010
0.005
0.001
6.907
5.293
4.595
3.664
2.944
2.197
1.099
0.695
0.0
-0.695
-1.099
-2.197
-2.944
-3.664
-4.595
-5.293
-6.907
6.900
5.145
4.385
3.372
2.590
1.777
0.582
0.142
-0.613
-1.369
-1.809
-3.004
-3.817
-4.599
-5.612
-6.372
-8.127
44.788
33.873
29.157
22.875
18.042
13.053
5.928
3.551
0.746
-0.347
-0.731
-1.552
-2.046
-2.503
-3.081
-3.508
-4.486
3.518
2.536
2.105
1.519
1.050
0.530
-0.445
-1.047
-3.364
-7.392
-9.987
-17.266
-22.281
-27.126
-33.416
-38.136
-49.055
p-value
0.255
0.020
I-582
Logistic Regression
This table includes LD (probability) values between 0.001 and 0.999. The median
lethal LDOSE (log-dose) is 0.613 with upper and lower bounds of 0.746 and 3.364
for the default 95% confidence interval, corresponding to a dose of 0.542 with limits
2.11 and 0.0346.
SYSTAT BASIC is used to create the new variable LDOSEB on the fly, and the new
model is then estimated without a constant. The only important part of the results from
a restricted model is the final log-likelihood. It should be close to 15.032 if we have
I-583
Chapter 17
found the boundary of the confidence interval. We wont show the results of these
estimations except to say that the lower bound was found to be 2.634 and is tested
using the second LET statement. Note that the value of the bound is subtracted from the
original independent variable, resulting in the subtraction of a negative number. While
the process of looking for a bound that will yield a log-likelihood of 15.032 for these
data is one of trial and error, it should not take long with the interactive program.
Several other examples are provided in Williams (1986). We were able to reproduce
most of his confidence interval results, but for several models his reported LD50 values
seem to be incorrect.
will produce the quantiles for AGE with the other variables set as specified. The Fieller
bounds are calculated, adjusting for all other parameters estimated.
Example 6
Multinomial Logit
We will illustrate multinomial modeling with an example, emphasizing what is new in
this context. If you have not already read the example on binary logit, this is a good
time to do so. The data used here have been extracted from the National Longitudinal
Survey of Young Men, 1979. Information on 200 individuals is supplied on school
enrollment status (NOTENR = 1 if not enrolled, 0 otherwise), log10 of wage (LW), age,
highest completed grade (EDUC), mothers education (MED), fathers education
(FED), an index of reading material available in the home (CULTURE = 1 for least, 3
for most), mean income of persons in fathers occupation in 1960 (FOMY), an IQ
I-584
Logistic Regression
measure, a race dummy (BLACK = 0 for white), a region dummy (SOUTH = 0 for nonSouth), and the number of siblings (NSIBS).
We estimate a model to analyze the CULTURE variable, predicting its value with
several demographic characteristics. In this example, we ignore the fact that the
dependent variable is ordinal and treat it as a nominal variable. (See Agresti, 1990, for
a discussion of the distinction.)
USE NLS
FORMAT=4
PRINT=LONG
LOGIT
MODEL CULTURE=CONSTANT+MED+FOMY
ESTIMATE / MEANS,PREDICT,CLASS,DERIVATIVE=INDIVIDUAL
PRINT
These commands look just like our binary logit analyses with the exception of the
DERIVATIVE and CLASS options, which we will discuss below. The resulting output is:
Categorical values encountered during processing are:
CULTURE (3 levels)
1,
2,
3
Multinomial LOGIT Analysis.
Dependent variable: CULTURE
Input records:
200
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
200
12
49
139
200
1
2
3
L-L
L-L
L-L
L-L
L-L
L-L
L-L
Log
PARAMETER
CONSTANT
MED
FOMY
at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
Choice Group: 1
1 CONSTANT
2 MED
3 FOMY
1
2
3
4
5
6
7
1
1.0000
8.7500
4551.5000
is
-219.7225
is
-145.2936
is
-138.9952
is
-137.8612
is
-137.7851
is
-137.7846
is
-137.7846
-137.7846
2
1.0000
10.1837
5368.8571
3
1.0000
11.4460
6116.1367
OVERALL
1.0000
10.9750
5839.1750
Estimate
S.E.
t-ratio
p-value
5.0638
-0.4228
-0.0006
1.6964
0.1423
0.0002
2.9850
-2.9711
-2.6034
0.0028
0.0030
0.0092
I-585
Chapter 17
Choice Group: 2
1 CONSTANT
2 MED
3 FOMY
2.5435
-0.1917
-0.0003
0.9834
0.0768
0.0001
2.5864
-2.4956
-2.1884
95.0 % bounds
Upper
Lower
0.0097
0.0126
0.0286
Parameter
Odds Ratio
Choice Group: 1
2 MED
0.6552
0.8660
0.4958
3 FOMY
0.9994
0.9998
0.9989
Choice Group: 2
2 MED
0.8255
0.9597
0.7101
3 FOMY
0.9997
1.0000
0.9995
Log Likelihood of constants only model = LL(0) =
-153.2535
2*[LL(N)-LL(0)] =
30.9379 with 4 df Chi-sq p-value = 0.0000
McFaddens Rho-Squared =
0.1009
Effect
1 CONSTANT
2 MED
3 FOMY
Chi-Sq
Signif
0.0025
0.0023
0.0088
df
2.0000
2.0000
2.0000
Covariance Matrix
1
2
3
4
5
6
6
1
2.8777
-0.1746
-0.0002
0.5097
-0.0274
-0.0000
6
0.0000
0.0202
-0.0000
-0.0282
0.0027
-0.0000
0.0000
-0.0000
-0.0000
0.0000
0.9670
-0.0541
-0.0001
0.0059
-0.0000
2
-0.7234
1.0000
-0.0633
-0.2017
0.2462
-0.0149
3
-0.6151
-0.0633
1.0000
-0.1515
-0.0148
0.2284
4
0.3055
-0.2017
-0.1515
1.0000
-0.7164
-0.5544
5
-0.2100
0.2462
-0.0148
-0.7164
1.0000
-0.1570
Correlation Matrix
1
2
3
4
5
6
1
2
3
4
5
6
1
1.0000
-0.7234
-0.6151
0.3055
-0.2100
-0.1659
6
-0.1659
-0.0149
0.2284
-0.5544
-0.1570
1.0000
1
0.2033
-0.0174
-0.0000
2
0.3441
-0.0251
-0.0000
3
-0.5474
0.0425
0.0001
I-586
Logistic Regression
Predicted Choice
1
Actual
Total
1
2
3
1.8761
3.6373
6.4865
4.0901
13.8826
31.0273
6.0338
31.4801
101.4862
12.0000
49.0000
139.0000
Pred. Tot.
Correct
Success Ind.
Tot. Correct
12.0000
0.1563
0.0963
0.5862
49.0000
0.2833
0.0383
139.0000
0.7301
0.0351
200.0000
Predicted Choice
1
Actual
Total
1
2
3
1.0000
0.0
1.0000
3.0000
4.0000
5.0000
8.0000
45.0000
133.0000
12.0000
49.0000
139.0000
Pred. Tot.
Correct
Success Ind.
Tot. Correct
2.0000
0.0833
0.0233
0.6900
12.0000
0.0816
-0.1634
186.0000
0.9568
0.2618
200.0000
The output begins with a report on the number of records read and retained for analysis.
This is followed by a frequency table of the dependent variable; both weighted and
unweighted counts would be provided if the FREQ option had been used. The means
table provides means of the independent variables by value of the dependent variable.
We observe that the highest educational and income values are associated with the
most reading material in the home. Next, an abbreviated history of the optimization
process lists the log-likelihood at each iteration, and finally, the estimation results are
printed.
Note that the regression results consist of two sets of estimates, labeled Choice
Group 1 and Choice Group 2. It is this multiplicity of parameter estimates that
differentiates multinomial from binary logit. If there had been five categories in the
dependent variable, there would have been four sets of estimates, and so on. This
volume of output provides the challenge to understanding the results.
The results are a little more intelligible when you realize that we have really
estimated a series of binary logits simultaneously. The first submodel consists of the
two dependent variable categories 1 and 3, and the second consists of categories 2 and
3. These submodels always include the highest level of the dependent variable as the
reference class and one other level as the response class. If NCAT had been set to 25,
the 24 submodels would be categories 1 and 25, categories 2 and 25, through categories
24 and 25. We then obtain the odds ratios for the two submodels separately, comparing
I-587
Chapter 17
dependent variable levels 1 against 3 and 2 against 3. This table shows that levels 1 and
2 are less likely as MED and FOMY increase, as the odds ratio is less than 1.
Derivative Tables
In a multinomial context, we will want to know how the probabilities of each of the
outcomes will change in response to a change in the covariate values. This information
is provided in the derivative table, which tells us, for example, that when MED
increases by one unit, the probability of category 3 goes up by 0.042, and categories 1
and 2 go down by 0.017 and 0.025, respectively. To assess properly the effect of
fathers income, the variable should be rescaled to hundreds or thousands of dollars (or
the FORMAT increased) because the effect of an increase of one dollar is very small.
The sum of the entries in each row is always 0 because an increase in probability in one
category must come about by a compensating decrease in other categories. There is no
useful interpretation of the CONSTANT row.
In general, the table shows how probability is reallocated across the possible values
of the dependent variable as the independent variable changes. It thus provides a global
view of covariate effects that is not easily seen when considering each binary submodel
separately. In fact, the overall effect of a covariate on the probability of an outcome can
be of the opposite sign of its coefficient estimate in the corresponding submodel. This
is because the submodel concerns only two of the outcomes, whereas the derivative
table considers all outcomes at once.
This table was generated by evaluating the derivatives separately for each individual
observation in the data set and then computing the mean; this is the theoretically
I-588
Logistic Regression
correct way to obtain the results. A quick alternative is to evaluate the derivatives once
at the sample average of the covariates. This method saves time (but at the possible cost
of accuracy) and is requested with the option DERIVATIVE=AVERAGE.
Prediction Success
The PREDICT option instructs LOGIT to produce the prediction success table, which we
have already seen in the binary logit. (See Hensher and Johnson, 1981; McFadden,
1979.) The table will break down the distribution of predicted outcomes by actual
choice, with diagonals representing correct predictions and off-diagonals representing
incorrect predictions. For the multinomial model, the table will have dimensions NCAT
by NCAT with additional marginal results. For our example model, the core table is 3
by 3.
Each row of the table takes all cases having a specific value of the dependent
variable and shows how the model allocates those cases across the possible outcomes.
Thus in row 1, the 12 cases that actually had CULTURE = 1 were distributed by the
predictive model as 1.88 to CULTURE = 1, 4.09 to CULTURE = 2, and 6.03 to
CULTURE = 3. These numbers are obtained by summing the predicted probability of
being in each category across all of the cases with CULTURE actually equal to 1. A
similar allocation is provided for every value of the dependent variable.
The prediction success table is also bordered by additional informationrow totals
are observed sums, and column totals are predicted sums and will be equal for any
model containing a constant. The Correct row gives the ratio of the number correctly
predicted in a column to the column total. Thus, among cases for which CULTURE =
1, the fraction correct is 1.8761 12 = 0.1563 ; for CULTURE = 3, the ratio is
101.4862 139 = 0.7301 . The total correct gives the fraction correctly predicted
overall and is computed as the sum Correct in each column divided by the table total.
This is ( 1.8761 + 13.8826 + 101.4862 ) 200 = 0.5862 .
The success index measures the gain that the model exhibits in number correctly
predicted in each column over a purely random model (a model with just a constant).
A purely random model would assign the same probabilities of the three outcomes to
each case, as illustrated below:
Random Probabitity Model
Predicted Sample Fraction
Success Index =
CORRECT - Random Predicted
I-589
Chapter 17
Thus, the smaller the success index in each column, the poorer the performance of the
model; in fact, the index can even be negative.
Normally, one prediction success table is produced for each model estimated.
However, if the data have been separated into learning and test subsamples with BY, a
separate prediction success table will be produced for each portion of the data. This can
provide a clear picture of the strengths and weaknesses of the model when applied to
fresh data.
Classification Tables
Classification tables are similar to prediction success tables except that predicted
choices instead of predicted probabilities are added into the table. Predicted choice is
the choice with the highest probability. Mathematically, the classification table is a
prediction success table with the predicted probabilities changed, setting the highest
probability of each case to 1 and the other probabilities to 0.
In the absence of fractional case weighting, each cell of the main table will contain
an integer instead of a real number. All other quantities are computed as they would be
for the prediction success table. In our judgment, the classification table is not as good
a diagnostic tool as the prediction success table. The option is included primarily for
the binary logit to provide comparability with results reported in the literature.
Example 7
Conditional Logistic Regression
Data must be organized in a specific way for the conditional logistic model;
fortunately, this organization is natural for matched sample case-control studies. First,
matched samples must be grouped together; all subjects from a given stratum must be
contiguous. It is thus advisable to provide each set with a unique stratum number to
facilitate the sorting and tracking of records. Second, the dependent variable gives the
relative position of the case within a matched set. Thus, the dependent variable will be
an integer between 1 and NCAT, and if the case is first in each stratum, then the
dependent variable will be equal to 1 for every record in the data set.
To illustrate how to set up conditional logit models, we use data discussed at length
by Breslow and Day (1980) on cases of endometrial cancer in a retirement community
near Los Angeles. The data are reproduced in their Appendix III and are identified in
SYSTAT as MACK.SYD.
I-590
Logistic Regression
The data set includes the dependent variable CANCER, the exposure variables AGE,
GALL (gall bladder disease), HYP (hypertension), OBESE, ESTROGEN, DOSE, DUR
(duration of conjugated estrogen exposure), NON (other drugs), some transformations
of these variables, and a set identification number. The data are organized by sets, with
the case coming first, followed by four controls, and so on, for a total of 315
observations ( 63 * ( 4 + 1 ) ) .
To estimate a model of the relative risks of gall bladder disease, estrogen use, and
their interaction, you may proceed as follows:
USE MACK
PRINT LONG
LOGIT
MODEL DEPVAR=GALL+EST+GALL*EST ;
ALT=SETSIZE
NCAT=5
ESTIMATE
There are three key points to notice about this sequence of commands. First, the NCAT
command is required to let LOGIT know how many subjects there are in a matched set.
Unlike the unconditional binary LOGIT, a unit of information in matched samples will
typically span more than one line of data, and NCAT will establish the minimum size
of each matched set. If each set contains the same number of subjects, the NCAT
command completely describes the data organization. If there were a varying number
of controls per set, the size of each set would be signaled with the ALT command, as in
ALT = SETSIZE
I-591
Chapter 17
EST
GROUP
SETSIZ
DOS
OB
S.E.
t-ratio
0.8831
3.2777
0.6118
4.4137
0.9950
-2.0631
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 GALL
18.0717
102.0127
3.2014
2 EST
14.8818
49.3621
4.4866
3 GALL*EST
0.1284
0.9025
0.0183
Log Likelihood of constants only model = LL(0) =
0.0000
McFaddens Rho-Squared = 4.56944E+15
Covariance Matrix
1
2
3
1
0.7798
0.3398
-0.7836
0.3743
-0.3667
0.9900
2
0.6290
1.0000
-0.6024
3
-0.8918
-0.6024
1.0000
Correlation Matrix
2
3
1
1.0000
0.6290
-0.8918
p-value
0.0010
0.0000
0.0391
I-592
Logistic Regression
The output begins with a report on the number of SYSTAT records read and the
number of matched sets kept for analysis. The remaining output parallels the results
produced by the unconditional logit model. The parameters estimated are coefficients
of a linear logit, the relative risks are derived by exponentiation, and the interpretation
of the model is unchanged. Model selection will proceed as it would in linear
regression; you might experiment with logarithmic transformations of the data, explore
quadratic and higher-order polynomials in the risk factors, and look for interactions.
Examples of such explorations appear in Breslow and Day (1980).
Example 8
Discrete Choice Models
The CHOICE data set contains hypothetical data motivated by McFadden (1979). The
CHOICE variable represents which of the three transportation alternatives (AUTO,
POOL, TRAIN) each subject prefers. The first subscripted variable in each choice
category represents TIME and the second, COST. Finally, SEX$ represents the gender
of the chooser, and AGE, the age.
A basic discrete choice model is estimated with:
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+COST
ESTIMATE
I-593
Chapter 17
There are two new features of this program. First, the word TIME is not a SYSTAT
variable name; rather, it is a label we chose to remind us of time spent commuting. The
group of names in the SET statement are valid SYSTAT variables corresponding, in
order, to the three modes of transportation. Although there are three variable names in
the SET variable, only one attribute is being measured.
Following is the output:
Categorical values encountered during processing are:
CHOICE (3 levels)
1,
2,
3
Categorical variables are effects coded with the highest value as reference.
Conditional LOGIT Analysis.
Dependent variable: CHOICE
Input records:
29
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
Log
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 TIME
2 COST
29
15
6
8
29
1
2
3
4
is
is
is
is
-31.860
-31.142
-31.141
-31.141
-31.141
Estimate
-0.020
-0.088
t-ratio
-1.169
-0.611
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 TIME
0.980
1.014
0.947
2 COST
0.915
1.216
0.689
Log Likelihood of constants only model = LL(0) =
-29.645
McFaddens Rho-Squared =
-0.050
Covariance Matrix
1
2
1
0.000
0.001
2
0.021
Correlation Matrix
1
2
1
1.000
0.384
2
0.384
1.000
S.E.
0.017
0.145
p-value
0.243
0.541
I-594
Logistic Regression
The output begins with a frequency distribution of the dependent variable and a brief
iteration history and prints standard regression results for the parameters estimated.
A key difference between a conditional variable clause and a standard SYSTAT
polytomous variable is that each clause corresponds to only one estimated parameter
regardless of the value of NCAT, while each free-standing polytomous variable
generates NCAT 1 parameters. The difference is best seen in a model that mixes both
types of variables (see Hoffman and Duncan, 1988, or Steinberg, 1987) for further
discussion).
Mixed Parameters
The following is an example of mixing polytomous and conditional variables:
USE CHOICE
LOGIT
CATEGORY SEX$
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+COST+SEX$+AGE
ESTIMATE
The hybrid model generates a single coefficient each for TIME and COST and two sets
of parameters for the polytomous variables.
The resulting output is:
Categorical values encountered during processing are:
SEX$ (2 levels)
Female, Male
CHOICE (3 levels)
1,
2,
3
Conditional LOGIT Analysis.
Dependent variable: CHOICE
Input records:
29
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
L-L
Log
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 TIME
2 COST
29
15
6
8
29
1
2
3
4
5
is
is
is
is
is
-31.860
-28.495
-28.477
-28.477
-28.477
-28.477
Estimate
-0.018
-0.351
S.E.
0.020
0.217
t-ratio
-0.887
-1.615
p-value
0.375
0.106
I-595
Chapter 17
Choice Group: 1
3 SEX$_Female
4 AGE
Choice Group: 2
3 SEX$_Female
4 AGE
0.328
0.026
0.509
0.014
0.645
1.850
0.519
0.064
0.024
-0.008
0.598
0.016
0.040
-0.500
95.0 % bounds
Upper
Lower
1.022
0.945
1.078
0.460
0.968
0.617
Parameter
Odds Ratio
1 TIME
0.982
2 COST
0.704
Choice Group: 1
4 AGE
1.026
1.054
0.998
Choice Group: 2
4 AGE
0.992
1.024
0.961
Log Likelihood of constants only model = LL(0) =
-29.645
2*[LL(N)-LL(0)] =
2.335 with 4 df Chi-sq p-value = 0.674
McFaddens Rho-Squared =
0.039
Wald tests on effects across all choices
Effect
3 SEX$_Female
4 AGE
Wald
Statistic
0.551
4.475
Chi-Sq
Signif
0.759
0.107
df
2.000
2.000
Covariance Matrix
1
2
3
4
5
6
1
0.000
0.001
0.002
-0.000
0.002
-0.000
0.047
0.009
-0.001
-0.018
0.001
0.259
0.002
0.165
0.002
0.000
0.002
0.000
0.358
0.003
0.000
2
0.180
1.000
0.084
-0.499
-0.140
0.310
3
0.150
0.084
1.000
0.230
0.543
0.193
4
-0.076
-0.499
0.230
1.000
0.281
0.265
5
0.146
-0.140
0.543
0.281
1.000
0.323
6
-0.266
0.310
0.193
0.265
0.323
1.000
Correlation Matrix
1
2
3
4
5
6
1
1.000
0.180
0.150
-0.076
0.146
-0.266
Varying Alternatives
For some discrete choice problems, the number of alternatives available varies across
choosers. For example, health researchers studying hospital choice pooled data from
several cities in which each city had a different number of hospitals in the choice set
(Luft et al., 1988). Transportation research may pool data from locations having train
service with locations without trains. Carson, Hanemann, and Steinberg (1990) pool
responses from two contingent valuation survey questions having differing numbers of
alternatives. To let LOGIT know about this, there are two ways of proceeding. The most
flexible is to organize the data by choice. With the standard data layout, use the ALT
command, as in
ALT=NCHOICES
I-596
Logistic Regression
Interactions
A common practice in discrete choice models is to enter characteristics of choosers as
interactions with attributes of the alternatives in conditional variable clauses. When
dealing with large sets of alternatives, such as automobile purchase choices or hospital
choices, where the model may contain up to 60 different alternatives, adding
polytomous variables can quickly produce unmanageable estimation problems, even
for mainframes. In the transportation literature, it has become commonplace to
introduce demographic variables as interactions with, or other functions, of the discrete
choice variables. Thus, instead of, or in addition to, the COST group of variables,
AUTO(2), POOL(2), TRAIN(2), you might see the ratio of cost to income. These ratios
would be created with LET transformations and then added in another SET list for use
as a conditional variable in the MODEL statement. Interactions can also be introduced
this way. By confining demographic variables to appear only as interactions with
choice variables, the number of parameters estimated can be kept quite small.
Thus, an investigator might prefer
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET TIMEAGE=AUTO(1)*AGE,POOL(1)*AGE,TRAIN(1)*AGE
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+TIMEAGE+COST
ESTIMATE
I-597
Chapter 17
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 TIME
2 TIMEAGE
3 COST
29
15
6
8
29
1
2
3
4
5
is
is
is
is
is
-31.860
-28.021
-27.866
-27.864
-27.864
-27.864
Estimate
-0.148
0.003
0.007
S.E.
0.062
0.001
0.155
t-ratio
-2.382
2.193
0.043
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 TIME
0.863
0.974
0.764
2 TIMEAGE
1.003
1.006
1.000
3 COST
1.007
1.365
0.742
Log Likelihood of constants only model = LL(0) =
-29.645
2*[LL(N)-LL(0)] =
3.561 with 1 df Chi-sq p-value = 0.059
McFaddens Rho-Squared =
0.060
Covariance Matrix
1
2
3
1
0.004
-0.000
-0.001
0.000
0.000
0.024
2
-0.936
1.000
0.273
3
-0.110
0.273
1.000
Correlation Matrix
1
2
3
1
1.000
-0.936
-0.110
p-value
0.017
0.028
0.966
I-598
Logistic Regression
Constants
The models estimated here deliberately did not include a constant because the constant
is treated as a polytomous variable in LOGIT. To obtain an alternative specific constant,
enter the following model statement:
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=CONSTANT+TIME+COST
ESTIMATE
Two CONSTANT parameters would be estimated. For the discrete choice model with
the type of data layout of this example, there is no need to specify the NCAT value
because LOGIT determines this automatically by the number of variables between the
brackets. If the model statement is inconsistent in the number of variables within
brackets across conditional variable clauses, an error message will be generated.
Following is the output:
Categorical values encountered during processing are:
CHOICE (3 levels)
1,
2,
3
Conditional LOGIT Analysis.
Dependent variable: CHOICE
Input records:
29
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
L-L
Log
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
TIME
COST
CONSTANT
CONSTANT
29
15
6
8
29
1
2
3
4
5
is
is
is
is
is
-31.860
-25.808
-25.779
-25.779
-25.779
-25.779
Estimate
-0.012
-0.567
1.510
-0.865
t-ratio
-0.575
-2.550
2.482
-1.282
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 TIME
0.988
1.029
0.950
2 COST
0.567
0.877
0.367
Log Likelihood of constants only model = LL(0) =
-29.645
2*[LL(N)-LL(0)] =
7.732 with 2 df Chi-sq p-value = 0.021
McFaddens Rho-Squared =
0.130
1
2
3
3
S.E.
0.020
0.222
0.608
0.675
p-value
0.565
0.011
0.013
0.200
I-599
Chapter 17
Effect
3 CONSTANT
Chi-Sq
Signif
0.013
df
2.000
Covariance Matrix
1
2
3
4
1
0.000
0.001
-0.001
-0.005
0.049
-0.082
0.056
0.370
0.046
0.455
2
0.130
1.000
-0.606
0.372
3
-0.053
-0.606
1.000
0.113
4
-0.350
0.372
0.113
1.000
Correlation Matrix
1
2
3
4
1
1.000
0.130
-0.053
-0.350
Example 9
By-Choice Data Format
In the standard data layout, there is one data record per case that contains information
on every alternative open to a chooser. With a large number of alternatives, this can
quickly lead to an excessive number of variables. A convenient alternative is to
organize data by choice; with this data layout, there is one record per alternative and
as many as NCAT records per case. The data set CHOICE2 organizes the CHOICE data
of the Discrete Choice Models example in this way. If you analyze the differences
between the two data sets, you will see that they are similar to those between the splitplot and multivariate layout for the repeated measures design (see Analysis of
Variance). To set up the same problem in a by-choice layout, input the following:
USE CHOICE2
LOGIT
NCAT=3
ALT=NCHOICES
MODEL CHOICE=TIME+COST ;
ESTIMATE
The by-choice format requires that the dependent variable appear with the same value
on each record pertaining to the case. An ALT variable (here NCHOICES) indicating
the number of records for this case must also appear on each record. The by-choice
organization results in fewer variables on the data set, with the savings increasing with
the number of alternatives. However, there is some redundancy in that certain data
values are repeated on each record. The best reason for using a by-choice format is to
I-600
Logistic Regression
handle varying numbers of alternatives per case. In this situation, there is no need to
shuffle data values or to be concerned with choice order.
With the by-choice data format, the NCAT statement is required; it is the only way
for LOGIT to know the number of alternatives to expect per case. For varying numbers
of alternatives per case, the ALT statement is also required, although we use it here with
the same number of alternatives.
USE CHOICE2
LOGIT
CATEGORY SEX$
NCAT=3
ALT=NCHOICES
MODEL CHOICE=TIME+COST ; AGE+SEX$
ESTIMATE
Because the number of alternatives (ALT) is the same for each case in this example, the
output is the same as the Mixed Parameters example.
I-601
Chapter 17
and the sum of the weights would be 100 * 1.8 + 100 * 0.2 = 200 , as required. To
handle such samples, LOGIT permits non-integer weights and does not truncate them
to integers.
Example 10
Stepwise Regression
LOGIT offers forward and backward stepwise logistic regression with single stepping
as an option. The simplest way to initiate stepwise regression is to substitute START
for ESTIMATE following a MODEL statement and then proceed with stepping with the
STEP command, just as in GLM or Regression.
An upward step consists of three components. First, the current model is estimated
to convergence. The procedure is exactly the same as regular estimation. Second, score
statistics for each additional effect are conducted, adjusted for variables already in the
model. The joint significance of all additional effects together is also computed.
Finally, the effect with the smallest significance level for its score statistic is identified.
If this significance level is below the ENTER option (0.05 by default), the effect is
added to the model.
A downward step also consists of three computational segments. First, the model is
estimated to convergence. Then Wald statistics are computed for each effect in the
model. Finally, the effect with the largest p value for its Wald test statistic is identified.
If this significance level is above the REMOVE criterion (by default 0.10), the effect is
removed from the model.
If you require certain effects to remain in the model regardless of the outcome of the
Wald test, force them into the model by listing them first on the model and using the
FORCE option of START. It is important to set the ENTER and REMOVE criteria
carefully because it is possible to have a variable cycle in and out of a model
repeatedly. The defaults are:
START / ENTER = .05, REMOVE = .10
I-602
Logistic Regression
Hosmer and Lemeshow use stepwise regression in their search for a model of low
birth weight discussed in the Binary Logit section. We conduct a similar analysis
with:
USE HOSLEM
LOGIT
CATEGORY RACE
MODEL LOW=CONSTANT+PTL+LWT+HT+RACE+SMOKE+UI+AGE+FTV
START / ENTER=.15,REMOVE=.20
STEP / AUTO
RACE
BWT
SMOKE
RACE1
189
59
130
189
Step
0
Log Likelihood:
-117.336
Parameter
Estimate
1 CONSTANT
-0.790
Score tests on effects not in model
Effect
PTL
LWT
HT
RACE
SMOKE
UI
AGE
FTV
Joint Score
Step
1
Log Likelihood:
Parameter
1 CONSTANT
2 PTL
2
3
4
5
6
7
8
9
S.E.
0.157
Score
Statistic
7.267
5.438
4.388
5.005
4.924
5.401
2.674
0.749
30.959
Chi-Sq
Signif
0.007
0.020
0.036
0.082
0.026
0.020
0.102
0.387
0.000
Estimate
-0.964
0.802
S.E.
0.175
0.317
t-ratio
-5.033
p-value
0.000
df
1.000
1.000
1.000
2.000
1.000
1.000
1.000
1.000
9.000
-113.946
t-ratio
-5.511
2.528
p-value
0.000
0.011
I-603
Chapter 17
Chi-Sq
Signif
0.043
0.030
0.069
0.075
0.075
0.062
0.448
0.002
Score
Effect
Statistic
LWT
6.900
RACE
4.882
SMOKE
3.117
UI
4.225
AGE
3.448
FTV
0.370
Joint Score
20.658
Step
3
Log Likelihood:
-107.982
Parameter
Estimate
1 CONSTANT
1.093
2 PTL
0.726
3 HT
1.856
4 LWT
-0.017
Score tests on effects not in model
Chi-Sq
Signif
0.009
0.087
0.078
0.040
0.063
0.543
0.004
Score
Effect
Statistic
5 RACE
5.266
6 SMOKE
2.857
7 UI
3.081
8 AGE
1.895
9 FTV
0.118
Joint Score
14.395
Step
4
Log Likelihood:
-105.425
Parameter
Estimate
1 CONSTANT
1.405
2 PTL
0.746
3 HT
1.805
4 LWT
-0.018
5 RACE_1
-0.518
6 RACE_2
0.569
Score tests on effects not in model
Chi-Sq
Signif
0.072
0.091
0.079
0.169
0.732
0.026
Score
Statistic
5.936
3.265
1.019
0.025
9.505
Chi-Sq
Signif
0.015
0.071
0.313
0.873
0.050
3
4
5
6
7
8
9
4
5
6
7
8
9
Effect
6 SMOKE
7 UI
8 AGE
9 FTV
Joint Score
S.E.
0.184
0.318
0.616
S.E.
0.841
0.328
0.705
0.007
S.E.
0.900
0.328
0.714
0.007
0.237
0.318
df
1.000
1.000
2.000
1.000
1.000
1.000
1.000
8.000
t-ratio
-5.764
2.585
2.066
p-value
0.000
0.010
0.039
df
1.000
2.000
1.000
1.000
1.000
1.000
7.000
t-ratio
1.299
2.213
2.633
-2.560
p-value
0.194
0.027
0.008
0.010
df
2.000
1.000
1.000
1.000
1.000
6.000
t-ratio
1.560
2.278
2.530
-2.607
-2.190
1.787
df
1.000
1.000
1.000
1.000
4.000
p-value
0.119
0.023
0.011
0.009
0.029
0.074
I-604
Logistic Regression
Step
5
Log Likelihood:
-102.449
Parameter
Estimate
1 CONSTANT
0.851
2 PTL
0.602
3 HT
1.745
4 LWT
-0.017
5 RACE_1
-0.734
6 RACE_2
0.557
7 SMOKE
0.946
Score tests on effects not in model
Score
Effect
Statistic
7 UI
3.034
8 AGE
0.781
9 FTV
0.014
Joint Score
3.711
Step
6
Log Likelihood:
-100.993
Parameter
Estimate
1 CONSTANT
0.654
2 PTL
0.503
3 HT
1.855
4 LWT
-0.016
5 RACE_1
-0.741
6 RACE_2
0.585
7 SMOKE
0.939
8 UI
0.786
Score tests on effects not in model
Score
Effect
Statistic
8 AGE
0.553
9 FTV
0.056
Joint Score
0.696
Log Likelihood:
-100.993
Parameter
Estimate
1 CONSTANT
0.654
2 PTL
0.503
3 HT
1.855
4 LWT
-0.016
5 RACE_1
-0.741
6 RACE_2
0.585
7 SMOKE
0.939
8 UI
0.786
S.E.
0.913
0.335
0.695
0.007
0.263
0.324
0.395
Chi-Sq
Signif
0.082
0.377
0.904
0.294
S.E.
0.921
0.341
0.695
0.007
0.265
0.323
0.399
0.456
Chi-Sq
Signif
0.457
0.813
0.706
S.E.
0.921
0.341
0.695
0.007
0.265
0.323
0.399
0.456
t-ratio
0.933
1.797
2.511
-2.418
-2.790
1.720
2.396
p-value
0.351
0.072
0.012
0.016
0.005
0.085
0.017
df
1.000
1.000
1.000
3.000
t-ratio
0.710
1.475
2.669
-2.320
-2.797
1.811
2.354
1.721
p-value
0.477
0.140
0.008
0.020
0.005
0.070
0.019
0.085
df
1.000
1.000
2.000
t-ratio
0.710
1.475
2.669
-2.320
-2.797
1.811
2.354
1.721
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 PTL
1.654
3.229
0.847
3 HT
6.392
24.964
1.637
4 LWT
0.984
0.998
0.971
5 RACE_1
0.477
0.801
0.284
6 RACE_2
1.795
3.379
0.953
7 SMOKE
2.557
5.586
1.170
8 UI
2.194
5.367
0.897
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
32.686 with 7 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.139
p-value
0.477
0.140
0.008
0.020
0.005
0.070
0.019
0.085
Not all logistic regression programs compute the variable addition statistics in the same
way, so minor differences in output are possible. Our results listed in the Chi-Square
Significance column of the first step, for example, correspond to H&Ls first row in
their Table 4.15; the two sets of results are very similar but not identical. While our
method yields the same final model as H&L, the order in which variables are entered
I-605
Chapter 17
is not the same because intermediate p values differ slightly. Once a final model is
arrived at, it is re-estimated to give true maximum likelihood estimates.
Example 11
Hypothesis Testing
Two types of hypothesis tests are easily conducted in LOGIT: the likelihood ratio (LR)
test and the Wald test. The tests are discussed in numerous statistics books, sometimes
under varying names. Accounts can be found in Maddalas text (1988), Cox and
Hinkley (1974), Rao (1973), Engel (1984), and Breslow and Day (1980). Here we
provide some elementary examples.
Likelihood-Ratio Test
The likelihood-ratio test is conducted by fitting two nested models (the restricted and
the unrestricted) and comparing the log-likelihoods at convergence. Typically, the
unrestricted model contains a proposed set of variables, and the restricted model omits
a selected subset, although other restrictions are possible. The test statistic is twice the
difference of the log-likelihoods and is chi-squared with degrees of freedom equal to
the number of restrictions imposed. When the restrictions consist of excluding
variables, the degrees of freedom are equal to the number of parameters set to 0.
If a model contains a constant, LOGIT automatically calculates a likelihood-ratio test
of the null hypothesis that all coefficients except the constant are 0. It appears on a line
that looks like:
2*[LL(N)-LL(0)] = 26.586 with 5 df, Chi-sq p-value = 0.00007
This example line states that twice the difference between the likelihood of the
estimated model and the constants only model is 26.586, which is a chi-squared
deviate on five degrees of freedom. The p value indicates that the null hypothesis
would be rejected.
To illustrate use of the LR test, consider a model estimated on the low birth weight
data (see the Binary Logit example). Assuming CATEGORY=RACE, compare the
following model
MODEL LOW CONSTANT + LWD + AGE + RACE + PTD
I-606
Logistic Regression
with
MODEL LOW CONSTANT + LWD + AGE
The null hypothesis is that the categorical variable RACE, which contributes two
parameters to the model, and PTD are jointly 0. The model likelihoods are 104.043
and 112.143, and twice the difference (16.20) is chi-squared with three degrees of
freedom under the null hypothesis. This value can also be more conveniently
calculated by taking the difference of the LR test statistics reported below the
parameter estimates and the difference in the degrees of freedom. The unrestricted
model above has G = 26.587 with five degrees of freedom, and the restricted model
has G = 10.385 with two degrees of freedom. The difference between the G values
is 16.20, and the difference between degrees of freedom is 3.
Although LOGIT will not automatically calculate LR statistics across separate
models, the p value of the result can be obtained with the command:
CALC 1-XCF(16.2,3)
Wald Test
The Wald test is the best known inferential procedure in applied statistics. To conduct
a Wald test, we first estimate a model and then pose a linear constraint on the
parameters estimated. The statistic is based on the constraint and the appropriate
elements of the covariance matrix of the parameter vector. A test of whether a single
parameter is 0 is conducted as a Wald test by dividing the squared coefficient by its
variance and referring the result to a chi-squared distribution on one degree of freedom.
Thus, each t ratio is itself the square root of a simple Wald test. Following is an
example:
USE HOSLEM
LOGIT
CATEGORY RACE
MODEL LOW=CONSTANT+LWD+AGE+RACE+PTD
ESTIMATE
HYPOTHESIS
CONSTRAIN PTD=0
CONSTRAIN RACE[1]=0
CONSTRAIN RACE[2]=0
TEST
I-607
Chapter 17
1
2
3
ChiSq Statistic =
ChiSq p-value = 0.002
Degrees of freedom = 3
3
0.0
0.0
0.0
Q
1.515
-0.442
0.464
4
0.0
1.000
0.0
5
0.0
0.0
1.000
15.104
Note that this statistic of 15.104 is close to the LR statistic of 16.2 obtained for the same
hypothesis in the previous section. Although there are three separate CONSTRAIN lines
in the HYPOTHESIS paragraph above, they are tested jointly in a single test. To test
each restriction individually, place a TEST after each CONSTRAIN. The restrictions
being tested are each entered with separate CONSTRAIN commands. These can include
any linear algebraic expression without parentheses involving the parameters. If
interactions were present on the MODEL statement, they can also appear on the
CONSTRAIN statement. To reference dummies generated from categorical covariates,
use square brackets, as in the example for RACE. This constraint refers to the
coefficient labeled RACE1 in the output.
More elaborate tests can be posed in this framework. For example,
CONSTRAIN 7*LWD - 4.3*AGE + 1.5*RACE[2] = -5
or
CONSTRAIN AGE + LWD = 1
I-608
Logistic Regression
Example 12
Quasi-Maximum Likelihood
When a model to be estimated by maximum likelihood is misspecified, White (1982)
has shown that the standard methods for obtaining the variance-covariance matrix are
incorrect. In particular, standard errors derived from the inverse matrix of second
derivatives and all hypothesis tests based on this matrix are unreliable. Since
misspecification may be the rule rather than the exception, is there any safe way to
proceed with inference? White offers an alternative variance-covariance matrix that
simplifies (asymptotically) to the inverse Hessian when the model is not misspecified
and is correct when the model is misspecified. Calling the procedure of estimating a
misspecified model quasi-maximum likelihood estimation (QMLE), the proper QML
matrix is defined as
Q = H1GH1
I-609
Chapter 17
White shows that for a misspecified model, the LR test is not asymptotically chisquared, and the Wald and likelihood-ratio tests are not asymptotically equivalent,
even when the QML matrix is used for Wald tests.
The best course of action appears to be to use only the QML version of the Wald test
when misspecification is a serious possibility. If the QML covariance matrix is
requested with the ESTIMATE command, a second set of parameter statistics will be
printed, reflecting the new standard errors, t ratios and p values; the coefficients are
unchanged. The QML covariance matrix will replace the standard covariance matrix
during subsequent hypothesis testing with the HYPOTHESIS command. Following is
an example:
USE NLS
LOGIT
MODEL CULTURE=CONSTANT+IQ
ESTIMATE / QML
at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
Choice Group: 1
1 CONSTANT
2 IQ
Choice Group: 2
1 CONSTANT
2 IQ
200
12
49
139
200
1
2
3
4
5
6
is
-219.722
is
-148.554
is
-144.158
is
-143.799
is
-143.793
is
-143.793
-143.793
Estimate
S.E.
t-ratio
p-value
4.252
-0.065
2.107
0.021
2.018
-3.052
0.044
0.002
3.287
-0.041
1.275
0.012
2.579
-3.372
95.0 % bounds
Upper
Lower
0.010
0.001
Parameter
Odds Ratio
Choice Group: 1
2 IQ
0.937
0.977
0.898
Choice Group: 2
2 IQ
0.960
0.983
0.937
Log Likelihood of constants only model = LL(0) =
-153.254
2*[LL(N)-LL(0)] =
18.921 with 2 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.062
Covariance matrix QML adjusted.
I-610
Logistic Regression
Log Likelihood:
Parameter
Choice Group: 1
1 CONSTANT
2 IQ
Choice Group: 2
1 CONSTANT
2 IQ
-143.793
Estimate
S.E.
t-ratio
p-value
4.252
-0.065
2.252
0.023
1.888
-2.860
0.059
0.004
3.287
-0.041
1.188
0.011
2.767
-3.682
95.0 % bounds
Upper
Lower
0.006
0.000
Parameter
Odds Ratio
Choice Group: 1
2 IQ
0.937
0.980
0.896
Choice Group: 2
2 IQ
0.960
0.981
0.939
Log Likelihood of constants only model = LL(0) =
-153.254
2*[LL(N)-LL(0)] =
18.921 with 2 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.062
Note the changes in the standard errors, t ratios, p values, odds ratio bounds, Wald test
p values, and covariance matrix.
Computation
All calculations are in double precision.
Algorithms
LOGIT uses Gauss Newton methods for maximizing the likelihood. By default, two
tolerance criteria must be satisfied: the maximum value for relative coefficient changes
must fall below 0.001, and the Euclidean norm of the relative parameter change vector
must also fall below 0.001. By default, LOGIT uses the second derivative matrix to
update the parameter vector. In discrete choice models, it may be preferable to use a
first derivative approximation to the Hessian instead. This option, popularized by
Berndt, Hall, Hall, and Hausman (1974), will be noted if it is used by the program.
BHHH uses the summed outer products of the gradient vector in place of the Hessian
matrix and generally will converge much more slowly than the default method.
Missing Data
Cases with missing data on any variables included in a model are deleted.
I-611
Chapter 17
Basic Formulas
For the binary logistic regression model, the dependent variable for the ith case is Y i ,
taking on values of 0 (nonresponse) and 1 (response), and the probability of response
is a function of the covariate vector x i and the unknown coefficient vector . We write
this probability as:
x
ei
Prob ( Y i = 1 x i ) = --------------x
1+ei
and abbreviate it as P i . The log-likelihood for the sample is given by
n
LL ( ) =
Y log P + ( 1 Y )log ( 1 P )
i
i=1
For the polytomous multinomial logit, the integer-valued dependent variable ranges
from 1 to k , and the probability that the ith case has Y = m , where 1 m k is:
x
eim
Prob ( Y i = m x i ) = ---------------k
xi j
j=1
In this model, k is fixed for all cases, there is a single covariate vector x i , and k j
parameter vectors are estimated. This last equation is identified by normalizing k to 0.
McFaddens discrete choice model represents a distinct variant of the logit model
based on Luces (1959) probabilistic choice model. Each subject is observed to make
a choice from a set C i consisting of J i elements. Each element is characterized by a
separate covariate vector of attributes Z k . The dependent variable Yi ranges from 1 to
J i , with J i possibly varying across subjects, and the probability that Y i = k , where
1 k Jis a function of the attribute vectors Z 1 , Z 2 , ... Z j and the parameter vector .
The probability that the ith subject chooses element m from his choice set is:
Z
em
Prob ( Y i = m Z ) = ---------------Z
ej
j Ci
I-612
Logistic Regression
Heuristically, this equation differs from the previous one in the components that vary
with alternative outcomes of the dependent variable. In the polytomous logit, the
coefficients are alternative-specific and the covariate vector is constant; in the discrete
choice model, while the attribute vector is alternative-specific, the coefficients are
constant. The models also differ in that the range of the dependent variable can be casespecific in the discrete choice model, while it is constant for all cases in the polytomous
model.
The polytomous logit can be recast as a discrete choice model in which each
covariate x is entered as an interaction with an alternative-specific dummy, and the
number of alternatives is constant for all cases. This reparameterization is used for the
mixed polytomous discrete choice model.
bj = x j ( XVX ) 1 x j
and x j is the covariate vector for the xth case, X is the data matrix for the sample
i ( 1 P i ) ,
including a constant, and V is a diagonal matrix with general A A element P
the fitted probability for the ith case. b j is our LEVERAGE(2).
v j = P i ( 1 P i )
Thus LEVERAGE(L) is given by
hj = vj b j
I-613
Chapter 17
y i p i
r j = -------------------------p i ( 1 p i )
The VARIANCE of the residual is
vj ( 1 hj )
and the standardized residual STANDARD is
rj
r s j = ---------------1 hj
The DEVIANCE residual is defined as
dj =
2 ln ( pj )
for y j = 1 and
d j = 2 ln ( 1 p j )
otherwise.
DELDSTAT is the change in deviance and is
D j = d j ( 1 h j )
2
= r s j
2
The final three saved quantities are measures of the overall change in the estimated
parameter vector .
DELBETA ( 1 )
= rsj hj ( 1 hj )
2
I-614
Logistic Regression
= rs j hj ( 1 hj )
DELBETA ( 3 )
= rs j hj ( 1 hj )
References
Agresti, A. (1990). Categorical data analysis. New York: John Wiley & Sons, Inc.
Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates
in logistic regression models. Biometrika, 71, 110.
Amemiya, T. (1981). Qualitative response models: A survey. Journal of Economic
Literature, 14831536.
Begg, Colin B. and Gray, R. (1984). Calculation of polychotomous logistic regression
parameters using individualized regressions. Biometrika, 71, 1118.
Beggs, S., Cardell, N. S., and Hausman, J. A. (1981). Assessing the potential demand for
electric cars. Journal of Econometrics, 16, 119.
Ben-Akival, M. and Lerman, S. (1985). Discrete choice analysis. Cambridge, Mass.: MIT
Press.
Berndt, E. K., Hall, B. K., Hall, R. E., and Hausman, J. A. (1974). Estimation and inference
in non-linear structural models. Annals of Economic and Social Measurement, 3,
653665.
Breslow, N. (1982). Covariance adjustment of relative-risk estimates in matched studies.
Biometrics, 38, 661672.
Breslow, N. and Day, N. E. (1980). Statistical methods in cancer research, vol. II: The
design and analysis of cohort studies. Lyon: IARC.
Breslow, N., Day, N. E., Halvorsen, K.T, Prentice, R.L., and Sabai, C. (1978). Estimation
of multiple relative risk functions in matched case-control studies. American Journal of
Epidemiology, 108, 299307.
Carson, R., Hanemann, M., and Steinberg, S. (1990). A discrete choice contingent
valuation estimate of the value of kenai king salmon. Journal of Behavioral Economics,
19, 5368.
Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic
Studies, 47, 225238.
Cook, D. R. and Weisberg, S. (1984). Residuals and influence in regression. New York:
Chapman and Hall.
I-615
Chapter 17
I-616
Logistic Regression
I-617
Chapter 17
Chapter
18
Loglinear Models
Laszlo Engelman
Loglinear models are useful for analyzing relationships among the factors of a
multiway frequency table. The loglinear procedure computes maximum likelihood
estimates of the parameters of a loglinear model by using the Newton-Raphson
method. For each user-specified model, a test of fit of the model is provided, along
with observed and expected cell frequencies, estimates of the loglinear parameters
(lambdas), standard errors of the estimates, the ratio of each lambda to its standard
error, and multiplicative effects (EXP()).
For each cell, you can request its contribution to the Pearson chi-square or the
likelihood-ratio chi-square. Deviates, standardized deviates, Freeman-Tukey
deviates, and likelihood-ratio deviates are available to characterize departures of the
observed values from expected values.
When searching for the best model, you can request tests after removing each firstorder effect or interaction term one at a time individually or hierarchically (when a
lower-order effect is removed, so are its respective interaction terms). The models do
not need to be hierarchical.
A model can explain the frequencies well in most cells, but poorly in a few.
LOGLIN uses Freeman-Tukey deviates to identify the most divergent cell, fit a model
without it, and continue in a stepwise manner identifying other outlier cells that depart
from your model.
You can specify cells that contain structural zeros (cells that are empty naturally or
by design, not by sampling), and fit a model to the subset of cells that remain. A test
of fit for such a model is often called a test of quasi-independence.
I-618
I-619
Chapter 18
Statistical Background
Researchers fit loglinear models to the cell frequencies of a multiway table in order to
describe relationships among the categorical variables that form the table. A loglinear
model expresses the logarithm of the expected cell frequency as a linear function
of certain parameters in a manner similar to that of analysis of variance.
To introduce loglinear models, recall how to calculate expected values for the
Pearson chi-square statistic. The expected value for a cell in a row i and column j is:
Ri * C j
(Part of each expected value comes from the row its in and part from the column its
in.) Now take the log:
ln (R i * C j ) = ln R i + ln C j
and let:
ln R i = A i
ln C j = B j
and write:
Ai + B j
This expected value is computed under the null hypothesis of independence (that is,
there is no interaction between the table factors). If this hypothesis is rejected, you
would need more information than Ai and Bj. In fact, the usual chi-square test can be
expressed as a test that the interaction term is needed in a model that estimates the log
of the cell frequencies. We write this model as:
ln F ij = + A i + B j + AB ij
or more commonly as:
I-620
Loglinear Models
ln F ij = + i + j + ij
A
AB
where is an overall mean effect and the parameters sum to zero over the levels of
the row factors and the column factors. For a particular cell in a three-way table (a cell
in the i row, j column, and k level of the third factor) we write:
ln F ijk = + i + j + k + ij + ik + jk + ijk
A
AB
AC
BC
ABC
Agresti (1984)
Fienberg (1980)
Goodman (1978)
Haberman (1978)
So, a loglinear model expresses the logarithm of the expected cell frequency as a linear
function of certain parameters in a manner similar to that of analysis of variance. An
important distinction between ANOVA and loglinear modeling is that in the latter, the
focus is on the need for interaction terms; while in ANOVA, testing for main effects is
the primary interest. Look back at the loglinear model for the two-way tablethe usual
chi-square tests the need for the ABij interaction, not for A alone or B alone.
The loglinear model for a three-way table is saturated because it contains all
possible terms or effects. Various smaller models can be formed by including only
selected combinations of effects (or equivalently testing that certain effects are 0). An
important goal in loglinear modeling is parsimonythat is, to see how few effects are
needed to estimate the cell frequencies. You usually dont want to test that the main
effect of a factor is 0 because this is the same as testing that the total frequencies are
equal for all levels of the factor. For example, a test that the main effect for SURVIVE$
I-621
Chapter 18
(alive, dead) is 0 simply tests whether the total number of survivors equals the number
of nonsurvivors. If no interaction terms are included and the test is not significant (that
is, the model fits), you can report that the table factors are independent. When there are
more than two second-order effects, the test of an interaction is conditional on the other
interactions and may not have a simple interpretation.
one or more terms. If not significant, compare results with models with fewer
terms.
n For the model you select as best, examine fitted values and residuals, looking for
cells (or layers within the table) with large differences between observed and
expected (fitted) cell counts.
How do you determine which effects or terms to include in your loglinear model?
Ideally, by using your knowledge of the subject matter of your study, you have a
specific model in mindthat is, you want to make statements regarding the
independence of certain table factors. Otherwise, you may want to screen for effects.
The likelihood-ratio chi-square is additive under partitioning for nested models.
Two models are nested if all the effects of the first are a subset of the second. The
likelihood ratio chi-square is additive because the statistic for the second model can be
subtracted from that for the first. The difference provides a test of the additional
effectsthat is, the difference in the two statistics has an asymptotic chi-square
distribution with degrees of freedom equal to the difference between those for the two
model chi-squares (or the difference between the number of effects in the two models).
This property does not hold for the Pearson chi-square. The additive property for the
likelihood ratio chi-square is useful for screening effects to include in a model.
If you are doing exploratory research and lack firm knowledge about which effects
to include, some statisticians suggest a strategy of starting with a large model and, step
by step, identifying effects to delete. (You compare each smaller model nested within
the larger one as described above.) But we caution you about multiple testing. If you
test many models in a search for your ideal model, remember that the p value
associated with a specific test is valid when you execute one and only one test. That is,
use p values as relative measures when you test several models.
I-622
Loglinear Models
I-623
Chapter 18
the standard error of lambda, the ratio of lambda to its standard error, the
multiplicative effect (EXP()), and the indices of the table of factors.
fit.
I-624
Loglinear Models
n Ratio. Displays lambda divided by standard error of lambda. For large samples, this
chi-square for testing the fit of the model and the difference in the chi-square
statistics between the smaller model and the full model.
n Hterm. Tests each term by removing it and its higher order interactions from the
model. These tests are similar to those in Term except that only hierarchical models
are testedif a lower-order effect is removed, so are the higher-order effects that
include it.
To examine the parameters, you can request the coefficients of the design variables,
the covariance matrix of the parameters, the correlation matrix of the parameters, and
the additive effect of each level for each term (lambda).
In addition, for each cell you can choose to display the observed frequency, the
expected frequency, the standardized deviate, the standard error of lambda, the
observed minus the expected frequency, the likelihood ratio of the deviate, the
Freeman-Tukey deviate, the contribution to Pearson chi-square, and the contribution
to the models log-likelihood.
Finally, you can select the number of cells to identify as outlandish. The first cell
has the largest Freeman-Tukey deviate (these deviates are similar to z scores when the
data are from a Poisson distribution). It is treated as a structural zero, the model is fit
to the remaining cells, and the cell with the largest Freeman-Tukey deviate is
identified. This process continues step by step, each time including one more cell as a
structural zero and refitting the model.
Structural Zeros
A cell is declared to be a structural zero when the probability is zero that there are
counts in the cell. Notice that such zero frequencies do not arise because of small
samples but because the cells are empty naturally (a male hysterectomy patient) or by
design (the diagonal of a two-way table comparing fathers (rows) and sons (columns)
occupations is not of interest when studying changes or mobility). A model can then
I-625
Chapter 18
be fit to the subset of cells that remain. A test of fit for such a model is often called a
test of quasi-independence.
To specify structural zeros, click Zero in the Loglinear Model dialog box.
When fitting a model, LOGLIN excludes cells identified as structural zeros, and then,
as in a regression analysis with zero weight cases, it can compute expected values,
deviates, and so on, for all cells including the structural zero cells.
You might consider identifying cells as structural zeros when:
n It is meaningful to the study at hand to exclude some cellsfor example, the
are one or two aberrant cells. That is, after you select the best model, fit a second
model with fewer effects and identify the outlier cells (the most outlandish cells)
for the smaller model. Then refit the best model declaring the outlier cells to be
I-626
Loglinear Models
structural zeros. If the additional interactions are no longer necessary, you might
report the smaller model, adding a sentence describing how the unusual cell(s)
depart from the model.
Simply specify the table factors in the same order in which you want to view them from
left to right. In other words, the last variable selected defines the columns of the table
and cross-classifications of all preceding variables define the rows.
Although you can also form multiway tables using Crosstabs, tables for loglinear
models are more compact and easy to read. Crosstabs forms a series of two-way tables
stratified by all combinations of the other table factors. Loglinear models create one
table, with the rows defined by factor combinations. However, loglinear model tables
do not display marginal totals, whereas Crosstabs tables do.
I-627
Chapter 18
Using Commands
First, specify your data with USE filename. Continue with:
LOGLIN
FREQ var
TABULATE var1*var2*
MODEL variables defining table = terms of model
ZERO CELL n1, n2,
SAVE filename / ESTIMATES or LAMBDAS
PRINT SHORT or MEDIUM or LONG or NONE ,
/ OBSFREQ CHISQ RATIO MLE EXPECT STAND ELAMBDA ,
TERM HTERM COVA CORR LAMBDA SELAMBDA DEVIATES ,
LRDEV FTDEV PEARSON LOGLIKE CELLS=n
ESTIMATE / DELTA=n LCONV=n CONV=n TOL=n ITER=n HALF=n
Usage Considerations
Types of data. LOGLIN uses a cases-by-variables rectangular file or data recorded as
frequencies with cell indices.
Print options. You can control what report panels appear in the output by globally
setting output length to SHORT, MEDIUM, or LONG. You can also use the PRINT
command in LOGLIN to request reports individually. You can specify individual panels
by specifying the particular option.
Short output panels include the observed frequency for each cell, the Pearson and
likelihood-ratio chi-square statistics, lambdas divided by their standard errors, the log
of the models maximized likelihood value, and a report of the three most outlandish
cells.
Medium results include all of the above, plus the following: the expected frequency
for each cell (current model), standardized deviations, multiplicative effects, a test of
each term by removing it from the model, a test of each term by removing it and its
higher-order interactions from the model, and the five most outlandish cells.
Long results add the following: coefficients of design variables, the covariance
matrix of the parameters, the correlation matrix of the parameters, the additive effect
of each level for each term, the standard errors of the lambdas, the observed minus the
expected frequency for each cell, the contribution to the Pearson chi-square from each
cell, the likelihood-ratio deviate for each cell, the Freeman-Tukey deviate for each cell,
the contribution to the models log-likelihood from each cell, and the 10 most
outlandish cells.
As a PRINT option, you can also specify CELLS=n, where n is the number of
outlandish cells to identify.
I-628
Loglinear Models
Examples
Example 1
Loglinear Modeling of a Four-Way Table
In this example, you use the Morrison breast cancer data stored in the CANCER data
file (Bishop, et al. (1975)) and treat the data as a four-way frequency table:
CENTER$
SURVIVE$
AGE
TUMOR$
The CANCER data include one record for each of the 72 cells formed by the four table
factors. Each record includes a variable, NUMBER, that has the number of women in
the cell plus numeric or character value codes to identify the levels of the four factors
that define the cell.
I-629
Chapter 18
For the first model of the CANCER data, you include three two-way interactions.
The input is:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = center$ + age,
+ survive$ + tumor$,
+ age*center$,
+ survive$*center$,
+ tumor$*center$
PRINT SHORT / EXPECT LAMBDAS
ESTIMATE / DELTA=0.5
The MODEL statement has two parts: table factors and terms (effects to fit). Table
factors appear to the left of the equals sign and terms are on the right. The layout of the
table is determined by the order in which the variables are specifiedfor example,
specify TUMOR$ last so its levels determine the columns.
The LABEL statement assigns category names to the numeric codes for AGE. If the
statement is omitted, the data values label the categories. By default, SYSTAT orders
string variables alphabetically, so we specify SORT = NONE to list the categories for
the other factors as they first appear in the data file.
We specify DELTA = 0.5 to add 0.5 to each cell frequency. This option is common
in multiway table procedures as an aid when some cell sizes are sparse. It is of little
use in practice and is used here only to make the results compare with those reported
elsewhere.
I-630
Loglinear Models
72
764
Observed Frequencies
====================
CENTER$
AGE
SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
9.000
7.000
4.000
3.000
Alive
|
26.000
68.000
25.000
9.000
+
50 to 69 Dead
|
9.000
9.000
11.000
2.000
Alive
|
20.000
46.000
18.000
5.000
+
70 & Over Dead
|
2.000
3.000
1.000
0.0
Alive
|
1.000
6.000
5.000
1.000
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
6.000
7.000
6.000
0.0
Alive
|
11.000
24.000
4.000
0.0
+
50 to 69 Dead
|
8.000
20.000
3.000
2.000
Alive
|
18.000
58.000
10.000
3.000
+
70 & Over Dead
|
9.000
18.000
3.000
0.0
Alive
|
15.000
26.000
1.000
1.000
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
16.000
7.000
3.000
0.0
Alive
|
16.000
20.000
8.000
1.000
+
50 to 69 Dead
|
14.000
12.000
3.000
0.0
Alive
|
27.000
39.000
10.000
4.000
+
70 & Over Dead
|
3.000
7.000
3.000
0.0
Alive
|
12.000
11.000
4.000
1.000
-----------------------------+-------------------------------------------------
Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity
57.5272
55.8327
-282.7342
9.9530
df
df
51
51
Probability
Probability
0.24635
0.29814
I-631
Chapter 18
Expected Values
===============
CENTER$
AGE
SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
7.852
15.928
7.515
2.580
Alive
|
28.076
56.953
26.872
9.225
+
50 to 69 Dead
|
6.281
12.742
6.012
2.064
Alive
|
22.460
45.563
21.498
7.380
+
70 & Over Dead
|
1.165
2.363
1.115
0.383
Alive
|
4.166
8.451
3.988
1.369
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
5.439
12.120
2.331
0.699
Alive
|
10.939
24.378
4.688
1.406
+
50 to 69 Dead
|
11.052
24.631
4.737
1.421
Alive
|
22.231
49.542
9.527
2.858
+
70 & Over Dead
|
6.754
15.052
2.895
0.868
Alive
|
13.585
30.276
5.822
1.747
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
9.303
10.121
3.476
0.920
Alive
|
19.989
21.746
7.468
1.977
+
50 to 69 Dead
|
14.017
15.249
5.237
1.386
Alive
|
30.117
32.764
11.252
2.979
+
70 & Over Dead
|
5.582
6.073
2.086
0.552
Alive
|
11.993
13.048
4.481
1.186
-----------------------------+-------------------------------------------------
I-632
Loglinear Models
CENTER$
|
AGE
| Under 50
50 to 69
70 & Over
---------+------------------------------------Tokyo
|
0.565
0.043
-0.609
Boston
|
-0.454
-0.043
0.497
Glamorgn |
-0.111
-0.000
0.112
---------+------------------------------------CENTER$
|
SURVIVE$
| Dead
Alive
---------+------------------------Tokyo
|
-0.181
0.181
Boston
|
0.107
-0.107
Glamorgn |
0.074
-0.074
---------+------------------------CENTER$
|
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+------------------------------------------------Tokyo
|
-0.368
-0.191
0.214
0.345
Boston
|
0.044
0.315
-0.178
-0.181
Glamorgn |
0.323
-0.123
-0.036
-0.164
---------+-------------------------------------------------
Lambda / SE(Lambda)
===================
THETA
------------1.826
------------CENTER$
Tokyo
Boston
Glamorgn
------------------------------------0.596
0.014
-0.586
------------------------------------AGE
Under 50
50 to 69
70 & Over
------------------------------------2.627
8.633
-8.649
------------------------------------SURVIVE$
Dead
Alive
-------------------------11.548
11.548
------------------------TUMOR$
MinMalig
MinBengn
MaxMalig
MaxBengn
------------------------------------------------6.775
15.730
-1.718
-10.150
------------------------------------------------CENTER$
|
AGE
| Under 50
50 to 69
70 & Over
---------+------------------------------------Tokyo
|
7.348
0.576
-5.648
Boston
|
-5.755
-0.618
5.757
Glamorgn |
-1.418
-0.003
1.194
---------+-------------------------------------
I-633
Chapter 18
CENTER$
|
SURVIVE$
| Dead
Alive
---------+------------------------Tokyo
|
-3.207
3.207
Boston
|
1.959
-1.959
Glamorgn |
1.304
-1.304
---------+------------------------CENTER$
|
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+------------------------------------------------Tokyo
|
-3.862
-2.292
2.012
2.121
Boston
|
0.425
3.385
-1.400
-0.910
Glamorgn |
3.199
-1.287
-0.289
-0.827
---------+------------------------------------------------Model ln(MLE): -160.563
CENTER$
| AGE
| | SURVIVE$
| | | TUMOR$
- - - 1 1 1 2
2 3 2 3
3 1 1 1
Initially, SYSTAT produces a frequency table for the data. We entered cases for 72
cells. The total frequency count across these cells is 764that is, there are 764 women
in the sample. Notice that the order of the factors is the same order we specified in the
MODEL statement. The last variable (TUMOR$) defines the columns; the remaining
variables define the rows.
The test of fit is not significant for either the Pearson chi-square or the likelihoodratio test, indicating that your model with its three two-way interactions does not
disagree with the observed frequencies. The model statement describes an association
between study center and age, survival, and tumor status. However, at each center, the
other three factors are independent. Because the overall goal is parsimony, we could
explore whether any of the interactions can be dropped.
Rafterys BIC (Bayesian Information Criterion) adjusts the chi-square for both the
complexity of the model (measured by degrees of freedom) and the size of the sample.
It is the likelihood-ratio chi-square minus the degrees of freedom for the current model
times the natural log of the sample size. If BIC is negative, you can conclude that the
model is preferable to the saturated model. When comparing alternative models, select
the model with the lowest BIC value.
The index of dissimilarity can be interpreted as the percentage of cases that need to
be relocated in order to make the observed and expected counts equal. For these data,
you would have to move about 9.95% of the cases to make the expected frequencies fit.
I-634
Loglinear Models
The expected frequencies are obtained by fitting the loglinear model to the observed
frequencies. Compare these values with the observed frequencies. Values for
corresponding cells will be similar if the model fits well.
After the expected values, SYSTAT lists the parameter estimates for the model you
requested. Usually, it is of more interest to examine these estimates divided by their
standard errors. Here, however, we display them in order to relate them to the expected
values. For example, the observed frequency for the cell in the upper left corner (Tokyo,
Under 50, Dead, MinMalig) is 9. To find the expected frequency under your model, you
add the estimates (from each panel, select the term that corresponds to your cell):
theta
CENTER$
AGE
SURVIVE$
TUMOR$
1.826
0.049
0.145
-0.456
0.480
C*A
C*S
C*T
0.565
-0.181
-0.368
and SYSTAT responds 7.846. In the panel of expected values, this number is printed
as 7.852 (in its calculations, SYSTAT uses more digits following the decimal point).
Thus, for this cell, the sample includes 9 women (observed frequency) and the model
predicts 7.85 women (expected frequency).
The ratio of the parameter estimates to their asymptotic standard errors is part of the
default output. Examine these values to better understand the relationships among the
table factors. Because, for large samples, this ratio can be interpreted as a standard
normal deviate (z score), you can use it to indicate significant parametersfor
example, for an interaction term, significant positive (or negative) associations. In the
CENTER$ by AGE panel, the ratio for young women from Tokyo is very large (7.348),
implying a significant positive association, and that for older Tokyo women is
extremely negative (5.648). The reverse is true for the women from Boston. If you use
the Column Percent option in XTAB to print column percentages for CENTER$ by
AGE, you will see that among the women under 50, more than 50% are from Tokyo
(52.1), while only 23% are from Boston. In the 70 and over age group, 14% are from
Tokyo and 55% are from Boston.
I-635
Chapter 18
The Alive estimate for Tokyo shows a strong positive association (3.207) with
survival in Tokyo. The relationship in Boston is negative (1.959). In this study, the
overall survival rate is 72.5%. In Tokyo, 79.3% of the women survived, while in
Boston, 67.6% survived. There is a negative association for having a malignant tumor
with minimal inflammation in Tokyo (3.862). The same relationship is strongly
positive in Glamorgan (3.199).
Cells that depart from the current model are identified as outlandish in a stepwise
manner. The first cell has the largest Freeman-Tukey deviate (these deviates are
similar to z scores when the data are from a Poisson distribution). It is treated as a
structural zero, the model is fit to the remaining cells, and the cell with the largest
Freeman-Tukey deviate is identified. This process continues step by step, each time
including one more cell as a structural zero and refitting the model.
For the current model, the observations in the cell corresponding to the youngest
nonsurvivors from Tokyo with benign tumors and minimal inflammation (Tokyo,
Under 50, Dead, MinBengn) differs the most from its expected value. There are 7
women in the cell and the expected value is 15.9 women. The next most unusual cell
is 2,3,2,3 (Boston, 70 & Over, Alive, MaxMalig), and so on.
Medium Output
We continue the previous analysis, repeating the same model, but changing the PRINT
(output length) setting to request medium-length results:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = age # center$,
+ survive$ # center$,
+ tumor$ # center$
PRINT MEDIUM
ESTIMATE / DELTA=0.5
I-636
Loglinear Models
Notice that we use shortcut notation to specify the model. The output includes:
Standardized Deviates = (Obs-Exp)/sqrt(Exp)
===========================================
CENTER$
AGE
SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
0.410
-2.237
-1.282
0.262
Alive
|
-0.392
1.464
-0.361
-0.074
+
50 to 69 Dead
|
1.085
-1.048
2.034
-0.044
Alive
|
-0.519
0.065
-0.754
-0.876
+
70 & Over Dead
|
0.774
0.414
-0.109
-0.619
Alive
|
-1.551
-0.843
0.507
-0.315
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
0.241
-1.471
2.403
-0.836
Alive
|
0.018
-0.077
-0.318
-1.186
+
50 to 69 Dead
|
-0.918
-0.933
-0.798
0.486
Alive
|
-0.897
1.202
0.153
0.084
+
70 & Over Dead
|
0.864
0.760
0.062
-0.932
Alive
|
0.384
-0.777
-1.999
-0.565
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
2.196
-0.981
-0.255
-0.959
Alive
|
-0.892
-0.374
0.195
-0.695
+
50 to 69 Dead
|
-0.004
-0.832
-0.977
-1.177
Alive
|
-0.568
1.089
-0.373
0.592
+
70 & Over Dead
|
-1.093
0.376
0.633
-0.743
Alive
|
0.002
-0.567
-0.227
-0.171
-----------------------------+------------------------------------------------Multiplicative Effects = exp(Lambda)
====================================
THETA
------------6.209
------------AGE
Under 50
50 to 69
70 & Over
------------------------------------1.156
1.559
0.555
------------------------------------CENTER$
Tokyo
Boston
Glamorgn
------------------------------------1.050
1.001
0.951
------------------------------------SURVIVE$
Dead
Alive
------------------------0.634
1.578
------------------------TUMOR$
MinMalig
MinBengn
MaxMalig
MaxBengn
------------------------------------------------1.616
2.748
0.865
0.260
-------------------------------------------------
I-637
Chapter 18
CENTER$
|
AGE
| Under 50
50 to 69
70 & Over
---------+------------------------------------Tokyo
|
1.760
1.044
0.544
Boston
|
0.635
0.958
1.644
Glamorgn |
0.895
1.000
1.118
---------+------------------------------------CENTER$
|
SURVIVE$
| Dead
Alive
---------+------------------------Tokyo
|
0.835
1.198
Boston
|
1.113
0.899
Glamorgn |
1.077
0.929
---------+------------------------CENTER$
|
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+------------------------------------------------Tokyo
|
0.692
0.826
1.238
1.412
Boston
|
1.045
1.370
0.837
0.834
Glamorgn |
1.382
0.884
0.965
0.849
---------+------------------------------------------------Model ln(MLE): -160.563
Term tested
--------------AGE. . . . . .
CENTER$. . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .
Term tested
hierarchically
--------------AGE. . . . . .
CENTER$. . . .
SURVIVE$ . . .
TUMOR$ . . . .
-196.672
128.05
55
0.0000
72.22
0.0000
-166.007
66.72
53
0.0975
10.89
0.0043
-178.267
91.24
57
0.0027
35.41
0.0000
CENTER$
| AGE
| | SURVIVE$
| | | TUMOR$
- - - 1 1 1 2
2 3 2 3
3 1 1 1
2 1 1 3
1 2 1 3
The goodness-of-fit tests provide an overall indication of how close the expected
values are to the cell counts. Just as you study residuals for each case in multiple
regression, you can use deviates to compare the observed and expected values for each
cell. A standardized deviate is the square root of each cells contribution to the Pearson
chi-square statisticthat is, (the observed frequency minus the expected frequency)
I-638
Loglinear Models
divided by the square root of the expected frequency. These values are similar to z
scores. For the second cell in the first row, the expected value under your model is
considerably larger than the observed count (its deviate is 2.237, the observed count
is 7, and the expected count is 15.9). Previously, this cell was identified as the most
outlandish cell using Freeman-Tukey deviates.
Note that LOGLIN produces five types of deviates or residuals: standardized, the
observed minus the expected frequency, the likelihood-ratio deviate, the FreemanTukey deviate, and the Pearson deviate.
( lambda )
Estimates of the multiplicative parameters equal e
. Look for values that
depart markedly from 1.0. Very large values indicate an increased probability for that
combination of indices and, conversely, a value considerably less than 1.0 indicates an
unlikely combination. A test of the hypothesis that a multiplicative parameter equals
1.0 is the same as that for lambda equal to 0; so use the values of (lambda)/S.E. to test
the values in this panel. For the CENTER$ by AGE interaction, the most likely
combination is women under 50 from Tokyo (1.76); the least likely combination is
women 70 and over from Tokyo (0.544).
After listing the multiplicative effects, SYSTAT tests reduced models by removing
each first-order effect and each interaction from the model one at a time. For each
smaller model, LOGLIN provides:
n A likelihood-ratio chi-square for testing the fit of the model
n The difference in the chi-square statistics between the smaller model and the full
model
The likelihood-ratio chi-square for the full model is 55.833. For a model that omits
AGE, the likelihood-ratio chi-square is 166.95. This smaller model does not fit the
observed frequencies (p value < 0.00005). To determine whether the removal of this
term results in a significant decrease in the fit, look at the difference in the statistics:
166.95 55.833 = 111.117, p value < 0.00005. The fit worsens significantly when AGE
is removed from the model.
From the second line in this panel, it appears that a model without the first-order
term for CENTER$ fits (p value = 0.3523). However, removing any of the two-way
interactions involving CENTER$ significantly decreases the model fit.
The hierarchical tests are similar to the preceding tests except that only hierarchical
models are testedif a lower-order effect is removed, so are the higher-order effects
that include it. For example, in the first line, when CENTER$ is removed, the three
interactions with CENTER$ are also removed. The reduction in the fit is significant
(p < 0.00005). Although removing the first-order effect of CENTER$ does not
I-639
Chapter 18
significantly alter the fit, removing the higher-order effects involving CENTER$
decreases the fit substantially.
Example 2
Screening Effects
In this example, you pretend that no models have been fit to the CANCER data (that is,
you have not seen the other examples). As a place to start, first fit a model with all
second-order interactions finding that it fits. Then fit models nested within the first by
using results from the HTERM (terms tested hierarchically) panel to guide your
selection of terms to be removed.
Heres a summary of your instructions: you study the output generated from the first
MODEL and ESTIMATE statements and decide to remove AGE by TUMOR$. After
seeing the results for this smaller model, you decide to remove AGE by SURVIVE$, too.
USE cancer
LOGLIN
FREQ = number
PRINT NONE / CHI HTERM
MODEL center$*age*survive$*tumor$
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$
-
= tumor$..center$^2
= tumor$..center$^2,
age*tumor$
= tumor$..center$^2,
age*tumor$,
- age*survive$
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$ = tumor$..center$^2,
- age*tumor$,
- age*survive$,
- tumor$*survive$
ESTIMATE / DELTA=0.5
40.1650
39.9208
-225.6219
7.6426
df
df
40
40
Probability
Probability
50.10
43
0.2125
0.46294
0.47378
0.0171
I-640
Loglinear Models
AGE
* TUMOR$ .
AGE
* SURVIVE$
CENTER$
* TUMOR$ .
CENTER$
* SURVIVE$
CENTER$
* AGE. . .
. .
-153.343
41.39
46
0.6654
1.47
0.9613
. .
-154.693
44.09
42
0.3831
4.17
0.1241
. .
-169.724
74.15
46
0.0053
34.23
0.0000
. .
-156.501
47.71
42
0.2518
7.79
0.0204
. .
-186.011
106.73
44
0.0000
66.81
0.0000
41.8276
41.3934
-263.9807
7.8682
df
df
46
46
Probability
Probability
51.61
49
0.3719
-155.452
45.61
48
-171.415
77.54
52
-157.291
49.29
48
-187.702
110.11
50
0.64757
0.66536
0.0168
0.5713
4.22
0.1214
0.0124
36.14
0.0000
0.4214
7.90
0.0193
0.0000
68.72
0.0000
45.3579
45.6113
-273.0400
8.4720
df
df
48
48
Probability
Probability
0.58174
0.57126
-160.563
55.83
51
0.2981
10.22
0.0168
-173.524
81.75
54
0.0087
36.14
0.0000
-161.264
57.23
50
0.2245
11.62
0.0030
-191.561
117.83
52
0.0000
72.22
0.0000
57.5272
55.8327
-282.7342
9.9530
df
df
51
51
Probability
Probability
0.24635
0.29814
I-641
Chapter 18
Term tested
hierarchically
--------------TUMOR$ . . . .
SURVIVE$ . . .
AGE. . . . . .
CENTER$. . . .
CENTER$
* TUMOR$ . . .
CENTER$
* SURVIVE$ . .
CENTER$
* AGE. . . . .
91.24
57
0.0027
35.41
0.0000
-166.007
66.72
53
0.0975
10.89
0.0043
-196.672
128.05
55
0.0000
72.22
0.0000
The likelihood-ratio chi-square for the model that includes all two-way interactions is
39.9 (p value = 0.4738). If the AGE by TUMOR$ interaction is removed, the chi-square
for the smaller model is 41.39 (p value = 0.6654). Does the removal of this interaction
cause a significant change? No, chi-square = 1.47 (p value = 0.9613). This chi-square
is computed as 41.39 minus 39.92 with 46 minus 40 degrees of freedom. The removal
of this interaction results in the least change, so you remove it first. Notice also that the
estimate of the maximized likelihood function is largest when this second-order effect
is removed (153.343).
The model chi-square for the second model is the same as that given for the first
model with AGE * TUMOR$ removed (41.3934). Here, if AGE by SURVIVE$ is
removed, the new model fits (p value = 0.5713) and the change between the model
minus one interaction and that minus two interactions is insignificant (p value =
0.1214).
If SURVIVE$ by TUMOR$ is removed from the current model with four
interactions, the new model fits (p value = 0.2981). The change in fit is not significant
(p = 0.0168). Should we remove any other terms? Looking at the HTERM panel for the
model with three interactions, you see that a model without CENTER$ by SURVIVE$
has a marginal fit (p value = 0.0975) and the chi-square for the difference is significant
(p value = 0.0043). Although the goal is parsimony and technically a model with only
two interactions does fit, you opt for the model that also includes CENTER$ by
SURVIVE$ because it is a significant improvement over the very smallest model.
I-642
Loglinear Models
Example 3
Structural Zeros
This example identifies outliers and then declares them to be structural zeros. You
wonder if any of the interactions in the model fit in the example on loglinear modeling
for a four-way table are necessary only because of a few unusual cells. To identify the
unusual cells, first pull back from your ideal model and fit a model with main effects
only, asking for the four most unusual cells. (Why four cells? Because 5% of 72 cells
is 3.6 or roughly 4.)
USE cancer
LOGLIN
FREQ = number
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = tumor$ .. center$
PRINT / CELLS=4
ESTIMATE / DELTA=0.5
Of course this model doesnt fit, but following are selections from the output:
Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity
181.3892
174.3458
-243.8839
19.3853
df
df
63
63
Probability
Probability
0.00000
0.00000
CENTER$
| AGE
| | SURVIVE$
| | | TUMOR$
- - - 1 1 2 2
1 3 2 1
1 1 2 3
1 3 2 2
Next, fit your ideal model, identifying these four cells as structural zeros and also
requesting PRINT / HTERM to test the need for each interaction term.
I-643
Chapter 18
Following are selections from the output. Notice that asterisks mark the structural zero
cells.
Number of cells (product of levels):
Number of structural zero cells:
Total count:
72
4
664
Observed Frequencies
====================
CENTER$
AGE
SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
9.000
7.000
4.000
3.000
Alive
|
26.000
*68.000
*25.000
9.000
+
50 to 69 Dead
|
9.000
9.000
11.000
2.000
Alive
|
20.000
46.000
18.000
5.000
+
70 & Over Dead
|
2.000
3.000
1.000
0.0
Alive
|
*1.000
*6.000
5.000
1.000
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
6.000
7.000
6.000
0.0
Alive
|
11.000
24.000
4.000
0.0
+
50 to 69 Dead
|
8.000
20.000
3.000
2.000
Alive
|
18.000
58.000
10.000
3.000
+
70 & Over Dead
|
9.000
18.000
3.000
0.0
Alive
|
15.000
26.000
1.000
1.000
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
16.000
7.000
3.000
0.0
Alive
|
16.000
20.000
8.000
1.000
+
50 to 69 Dead
|
14.000
12.000
3.000
0.0
Alive
|
27.000
39.000
10.000
4.000
+
70 & Over Dead
|
3.000
7.000
3.000
0.0
Alive
|
12.000
11.000
4.000
1.000
-----------------------------+------------------------------------------------* indicates structural zero cells
Pearson ChiSquare
46.8417 df
47 Probability 0.47906
LR ChiSquare
44.8815 df
47 Probability 0.56072
Rafterys BIC -260.5378
Dissimilarity
10.1680
I-644
Loglinear Models
Term tested
hierarchically
--------------AGE. . . . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$. . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .
69.75
51
0.0416
-149.166
50.28
49
-162.289
76.52
53
0.0001
0.4226
5.40
0.0674
0.0189
31.64
0.0000
The model has a nonsignificant test of fit and so does a model without the CENTER$
by SURVIVAL$ interaction (p value = 0.4226).
72
2
Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity
Probability
Probability
Term tested
hierarchically
--------------AGE. . . . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$. . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .
50.2610
49.1153
-269.8144
10.6372
df
df
671
49
49
0.42326
0.46850
-172.356
90.57
53
0.0010
41.45
0.0000
-153.888
53.63
51
0.3737
4.52
0.1045
-169.047
83.95
55
0.0072
34.84
0.0000
When the two cells for the young women from Tokyo are excluded from the model
estimation, the CENTER$ by SURVIVE$ effect is not needed (p value = 0.3737).
I-645
Chapter 18
72
2
Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity
Probability
Probability
Term tested
hierarchically
--------------AGE. . . . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$. . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .
53.4348
50.9824
-273.8564
9.4583
df
df
757
49
49
0.30782
0.39558
-177.799
96.39
53
0.0003
45.41
0.0000
-161.382
63.56
51
0.1114
12.58
0.0019
-171.123
83.04
55
0.0086
32.06
0.0000
When the two cells for the women from the older age group are treated as structural
zeros, the case for removing the CENTER$ by SURVIVE$ effect is much weaker than
when the cells for the younger women are structural zeros. Here, the inclusion of the
effect results in a significant improvement in the fit of the model (p value = 0.0019).
Conclusion
The structural zero feature allowed you to quickly focus on 2 of the 72 cells in your
multiway table: the survivors under 50 from Tokyo, especially those with benign
tumors with minimal inflammation. The overall survival rate for the 764 women is
72.5%, that for Tokyo is 79.3%, and that for the most unusual cell is 90.67%. Half of
the Tokyo women under age 50 have MinBengn tumors (75 out of 151) and almost
10% of the 764 women (spread across 72 cells) are concentrated here. Possibly the
protocol for study entry (including definition of a tumor) was executed differently at
this center than at the others.
I-646
Loglinear Models
Example 4
Tables without Analyses
If you want only a frequency table and no analysis, use TABULATE. Simply specify the
table factors in the same order in which you want to view them from left to right. In
other words, the last variable defines the columns of the table and cross-classifications
of the preceding variables the rows.
For this example, we use data in the CANCER file. Here we use LOGLIN to display
counts for a 3 by 3 by 2 by 4 table (72 cells) in two dozen lines. The input is:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=59 to 69, 70=70 & Over
ORDER center$ / SORT=NONE
ORDER tumor$ / SORT =MinBengn, MaxBengn, MinMalig,
MaxMalig
TABULATE age * center$ * survive$ * tumor$
72
764
Observed Frequencies
====================
AGE
CENTER$
SURVIVE$ |
TUMOR$
| MinBengn
MaxBengn
MinMalig
MaxMalig
---------+---------+---------+------------------------------------------------Under 50 Tokyo
Alive
|
68.000
9.000
26.000
25.000
Dead
|
7.000
3.000
9.000
4.000
+
Boston
Alive
|
24.000
0.0
11.000
4.000
Dead
|
7.000
0.0
6.000
6.000
+
Glamorgn Alive
|
20.000
1.000
16.000
8.000
Dead
|
7.000
0.0
16.000
3.000
---------+---------+---------+------------------------------------------------59 to 69 Tokyo
Alive
|
46.000
5.000
20.000
18.000
Dead
|
9.000
2.000
9.000
11.000
+
Boston
Alive
|
58.000
3.000
18.000
10.000
Dead
|
20.000
2.000
8.000
3.000
+
Glamorgn Alive
|
39.000
4.000
27.000
10.000
Dead
|
12.000
0.0
14.000
3.000
---------+---------+---------+------------------------------------------------70 & Over Tokyo
Alive
|
6.000
1.000
1.000
5.000
Dead
|
3.000
0.0
2.000
1.000
+
Boston
Alive
|
26.000
1.000
15.000
1.000
Dead
|
18.000
0.0
9.000
3.000
+
Glamorgn Alive
|
11.000
1.000
12.000
4.000
Dead
|
7.000
0.0
3.000
3.000
-----------------------------+-------------------------------------------------
I-647
Chapter 18
Computation
Algorithms
Loglinear modeling implements the algorithms of Haberman (1973).
References
Agresti, A. (1984). Analysis of ordinal categorical data. New York: Wiley-Interscience.
Agresti, A. (1990). Categorical data analysis. New York: Wiley-Interscience.
Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate
analysis: Theory and practice. Cambridge, Mass.: McGraw-Hill.
Fienberg, S. E. (1980). The analysis of cross-classified categorical data, 2nd ed.
Cambridge, Mass.: MIT Press.
Goodman, L. A. (1978). Analyzing qualitative/categorical data: Loglinear models and
latent structure analysis. Cambridge, Mass.: Abt Books.
Haberman, S. J. (1973). Loglinear fit for contingency tables, algorithm AS 51. Applied
Statistics, 21, 218224.
Haberman, S. J. (1978). Analysis of qualitative data, Vol. 1: Introductory topics. New
York: Academic Press.
Knoke, D. and Burke, P. S. (1980). Loglinear models. Newbury Park: Sage.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Index
A matrix, I-499
accelerated failure time distribution, II-538
ACF plots, II-642
additive trees, I-62, I-68
AID, I-35, I-37
Akaike Information Criterion, II-288
alpha level, II-314, II-315, II-321
alternative hypothesis, I-13, II-312
analysis of covariance, I-431
examples, I-462, I-478
model, I-432
analysis of variance, I-210, I-487
algorithms, I-485
ANOVA command, I-438
assumptions, I-388
between-group differences, I-394
bootstrapping, I-438
compared to loglinear modeling, I-619
compared to regression trees, I-35
contrasts, I-390, I-434, I-435, I-436
data format, I-438
examples, I-439, I-442, I-447, I-457, I-459, I-461,
I-462, I-464, I-470, I-472, I-475, I-478, I-480
factorial, I-387
hypothesis tests, I-386, I-434, I-435, I-436
interactions, I-387
model, I-432
multivariate, I-393, I-396
overview, I-431
post hoc tests, I-389, I-432
power analysis, II-311, II-318, II-319, II-351, II353, II-374, II-378
Quick Graphs, I-438
repeated measures, I-393, I-436
residuals, I-432
two-way ANOVA, II-319, II-353, II-378
unbalanced designs, I-391
unequal variances, I-388
649
usage, I-438
within-subject differences, I-394
Anderberg dichotomy coefficients, I-120, I-126
angle tolerance, II-496
anisotropy, II-500, II-508
geometric, II-500
zonal, II-500
A-optimality, I-246
ARIMA models, II-627, II-637, II-650
algorithms, II-678
ARMA models, II-632
autocorrelation plots, I-374, II-630, II-633, II-642
Automatic Interaction Detection, I-35
autoregressive models, II-630
axial designs, I-242
650
Index
C matrix, I-499
candidate sets
for optimal designs, I-245
canonical correlation analysis, I-487
bootstrapping, II-417
data format, II-416, II-405
examples, II-418, II-409, II-425
interactions, II-417
model, II-415
nominal scales, II-417
overview, II-407
partialled variables, II-415
Quick Graphs, II-417
rotation, II-416
usage, II-417
canonical correlations, I-286
canonical rotation, II-299
categorical data, II-199
categorical predictors, I-35
Cauchy kernel, II-463, II-472, II-473
CCF plots, II-643
central composite designs, I-238, I-269
central limit theorem, II-588
centroid designs, I-241
CHAID, I-36, I-37
chi-square, I-160
Chi-square test for independence, I-148
circle model
in perceptual mapping, II-297
city-block distance, I-126
classical analysis, II-604
classification functions, I-280
classification trees, I-36
algorithms, I-51
basic tree model, I-32
bootstrapping, I-44
commands, I-43
compared to discriminant analysis, I-36, I-39
data format, I-44
displays, I-41
examples, I-45, I-47, I-49
loss functions, I-38, I-41
missing data, I-51
mobiles, I-31
model, I-41
overview, I-31
pruning, I-37
Quick Graphs, I-44
saving files, I-44
stopping criteria, I-37, I-43
usage, I-44
cluster analysis
additive trees, I-68
algorithms, I-84
bootstrapping, I-70
commands, I-69
data types, I-70
distances, I-66
examples, I-71, I-75, I-78, I-79, I-81, I-82
exclusive clusters, I-54
hierarchical clustering, I-64
k-means clustering, I-67
missing values, I-84
overlapping clusters, I-54
overview, I-53
Quick Graphs, I-70
saving files, I-70
usage, I-70
clustered data, II-47
Cochrans test of linear trend, I-168
651
Index
canonical, II-407
commands, I-128
continuous data, I-125
data format, I-129
dissimilarity measures, I-126
distance measures, I-126
examples, I-129, I-132, I-134, I-135, I-137, I-140, I143, I-145
missing values, I-124, I-146, II-12
options, I-127
power analysis, II-311, II-317, II-336, II-338
Quick Graphs, I-129
rank-order data, I-126
saving files, I-129
set, II-407
usage, I-129
correlograms, II-509
correspondence analysis, II-294, II-298
algorithms, I-156
bootstrapping, I-150
commands, I-150
data format, I-150
examples, I-151, I-153
missing data, I-149, I-156
model, I-149
multiple correspondence analysis, I-149
overview, I-147
Quick Graphs, I-150
simple correspondence analysis, I-149
usage, I-150
covariance matrix, I-125, II-12
covariance paths
path analysis, II-236
covariograms, II-495
Cramers V, I-165
critical level, I-13
Cronbachs alpha, II-604, II-605
See descriptive statistics, I-214
cross-correlation plots, II-643
crossover designs, I-487
crosstabulation
bootstrapping, I-172
commands, I-171
data format, I-172
652
Index
examples, I-173, I-175, I-177, I-178, I-179, I-181, I186, I-188, I-189, I-192, I-194, I-196, I-197, I199, I-200
multiway, I-170
one-way, I-158, I-160, I-166
overview, I-157
Quick Graphs, I-172
standardizing tables, I-159
two-way, I-158, I-161, I-167, I-168
usage, I-172
cross-validation, I-37, I-280, I-380
D matrix, I-499
D SUB-A (da), II-433
dates, II-536
degrees-of-freedom, II-586
dendrograms, I-57, I-70
dependence paths
path analysis, II-235
descriptive statistics, I-1
basic statistics, I-211, I-212
bootstrapping, I-215
commands, I-215
Cronbachs alpha, I-214
data format, I-215
overview, I-205
Quick Graphs, I-215
stem-and-leaf plots, I-213
usage, I-215
design of experiments, I-92, I-250, I-251
axial designs, I-242
bootstrapping, I-252
Box-Behnken designs, I-238
central composite designs, I-238
centroid designs, I-241
commands, I-252
examples, I-253, I-254, I-256, I-258, I-260, I-263, I264, I-265, I-266, I-269, I-270
factorial designs, I-231, I-232
lattice designs, I-241
mixture designs, I-232, I-239
optimal designs, I-232, I-244
overview, I-227
Quick Graphs, I-252
653
Index
D-optimality, I-246
dot histogram plots, I-15
D-PRIME (d ), II-444
dummy codes, I-490
Duncans test, I-390
Dunnett test, I-493
Dunn-Sidak test, I-127, II-603
ECVI, II-287
edge effects, II-517
effect size
in power analysis, II-315
effects codes, I-383, I-490
efficiency, I-244
eigenvalues, I-286
ellipse model
in perceptual mapping, II-298
EM algorithm, I-362
EM estimation, II-8
for correlations, I-127, II-12
for covariances, II-12
for SSCP matrix, II-12
endogenous variables
path analysis, II-236
Epanechnikov kernel, II-475, II-484, II-485
equamax rotation, I-333, I-337
Euclidean distances, II-121
exogenous variables
path analysis, II-236
expected cross-validation index, II-287
exponential distribution, II-550
exponential model, II-510, II-520
exponential smoothing, II-650
external unfolding, II-296
F distribution
noncentrality parameter, II-327, II-356
factor analysis, I-331, II-294
algorithms, I-362
bootstrapping, I-339
commands, I-339
compared to principal components analysis, I-334
convergence, I-335
correlations vs covariances, I-331
data format, I-339
eigenvalues, I-335
eigenvectors, I-338
examples, I-341, I-344, I-348, I-350, I-353, I-356
iterated principal axis, I-335
loadings, I-338
maximum likelihood, I-335
missing values, I-362
number of factors, I-335
overview, I-327
principal components, I-335
Quick Graphs, I-339
residuals, I-338
rotation, I-333, I-337
save, I-338
scores, I-338
usage, I-339
factor loadings, II-616
factorial analysis of variance, I-387
factorial designs, I-231, I-232
analysis of, I-235
examples, I-253
fractional factorials, I-234
full factorial designs, I-234
Fedorov method, I-245
Fieller bounds, I-581
filters, II-652
Fisher's LSD, I-389, I-493
Fishers exact test, I-164, I-168
Fishers linear discriminant function, II-294
fixed variance
path analysis, II-238
fixed-bandwidth method
compared to KNN method, II-479
for smoothing, II-477, II-479, II-484
Fletcher-Powell minimization, II-632
forward selection, I-379
Fourier analysis, II-651, II-664
fractional factorial designs, I-487
654
Index
GLM
See general linear models
global criterion
See G-optimality
Goodman-Kruskal gamma, I-126, I-165, I-168
Goodman-Kruskal lambda, I-168
G-optimality, I-246
Graeco-Latin square designs, I-235
Greenhouse-Geisser statistic, I-395
Guttman mu2 monotonicity coefficients, I-119, I-126
Guttmans coefficient of alienation, II-123
Guttmans loss function, II-143
Guttman-Rulon coefficient, II-605
655
Index
ID3, I-37
incomplete block designs, I-487
independence, I-161
in loglinear models, I-618
INDSCAL model, II-119
inertia, I-148
inferential statistics, I-7, II-312
instrumental variables, II-681
internal-consistency, II-605
interquartile range, I-207
interval censored data, II-534
inverse-distance smoother, II-470
isotropic, II-495
item-response analysis
See test item analysis
item-test correlations, II-604
k nearest-neighbors method
compared to fixed-bandwidth method, II-467
for smoothing, II-465, II-472
Kendalls tau-b coefficients, I-165, I-168
Kendalls tau-b coefficients, I-126
kernel functions, II-460, II-462
biweight, II-463, II-472, II-473
Cauchy, II-463, II-472, II-473
Epanechnikov, II-463, II-472, II-473
Gaussian, II-463, II-472, II-473
plotting, II-463
relationship with bandwidth, II-466
tricube, II-463, II-472, II-473
triweight, II-463, II-472, II-473
uniform, II-463, II-472
k-exchange method, I-245
k-means clustering, I-60, I-67
Kolmogorov-Smirnov test, II-200
KR20, II-605
kriging, II-506
ordinary, II-501, II-512
simple, II-501, II-512
trend components, II-501
universal, II-502
Kruskals loss function, II-142
Kruskals STRESS, II-123
Kruskal-Wallis test, II-198, II-199
Kukoc statistic 7
Kulczynski measure, I-126
kurtosis, I-211
lags
number of lags, II-496
latent trait model, II-604, II-606
Latin square designs, I-235, I-258, I-487
lattice, II-220
lattice designs, I-241
Lawley-Hotelling trace, I-286
least absolute deviations, II-154
Levene test, I-388
leverage, I-376
likelihood ratio chi-square, I-164, I-620
compared to Pearson chi-square, I-620
likelihood-ratio chi-square, I-168, I-622
Lilliefors test, II-217
linear contrasts, I-390
linear discriminant function, I-280
linear discriminant model, I-276
linear models
analysis of variance, I-431
general linear models, I-487
hierarchical, II-47
linear regression, I-399
linear regression, I-11, I-371, II-330
bootstrapping, I-403
commands, I-403
data format, I-403
estimation, I-401
656
Index
examples, I-404, I-407, I-410, I-413, I-417, I-420, I424, I-426, I-427, I-428, I-429
model, I-400
overview, I-399
Quick Graphs, I-403
residuals, I-373, I-400
stepwise, I-379, I-401
tolerance, I-401
usage, I-403
using correlation matrix as input, I-381
using covariance matrix as input, I-381
using SSCP matrix as input, I-381
listwise deletion, I-362, II-3
Little MCAR test, II-1, II-11, II-12, II-31
loadings, I-330, I-331
LOESS smoothing, II-470, II-472, II-476, II-477, II480, II-489
logistic item-response analysis, II-620
one-parameter model, II-606
two-parameter model, II-606
logistic regression
algorithms, I-609
bootstrapping, I-565
categorical predictors, I-558
compared to conjoint analysis, I-92
compared to linear model, I-550
conditional variables, I-557
confidence intervals, I-581
convergence, I-560
data format, I-565
deciles of risk, I-561
discrete choice, I-559
dummy coding, I-558
effect coding, I-558
estimation, I-560
examples, I-566, I-568, I-569, I-574, I-579, I-582, I591, I-598, I-600, I-604, I-607
missing data, I-609
model, I-557
options, I-560
overview, I-549
post hoc tests, I-563
prediction table, I-557
print options, I-565
quantiles, I-562, I-582
Quick Graphs, I-565
simulation, I-563
stepwise estimation, I-560
tolerance, I-560
usage, I-565
weights, I-565
logit model, I-551
loglinear modeling
bootstrapping, I-626
commands, I-626
compared to analysis of variance, I-619
compared to Crosstabs, I-625
convergence, I-621
data format, I-626
examples, I-627, I-638, I-641, I-645
frequency tables, I-625
model, I-621
overview, I-617
parameters, I-622
Quick Graphs, I-626
saturated models, I-619
statistics, I-622
structural zeros, I-623
usage, I-626
log-logistic distribution, II-538
log-normal distribution, II-538
longitudinal data, II-47
loss functions, I-38, II-151
multidimensional scaling, II-142
LOWESS smoothing, II-627
low-pass filter, II-640
LSD test, I-432, I-493
madograms, II-509
Mahalanobis distances, I-276, I-286
Mann-Whitney test, II-198, II-199
MANOVA
See analysis of variance
Mantel-Haenszel test, I-170
MAR, II-9
Marquardt method, II-159
Marron & Nolan canonical kernel width, II-466, II472
657
Index
mass, I-148
matrix displays, I-57
maximum likelihood estimates, II-152
maximum likelihood factor analysis, I-334
maximum Wishart likelihood, II-247
MCAR, II-9
MCAR test, II-1, II-12, II-31
McFaddens conditional logit model, I-554
McNemars test, I-164, I-168
MDPREF, II-298, II-299
MDS,
See multidimensional scaling
mean, I-3, I-207, I-211
mean smoothing, II-468, II-472, II-474
means coding, I-384
median, I-4, I-206, I-211
median smoothing, II-468
meta-analysis, I-382
MGLH,
See general linear models
midrange, I-207
minimum spanning trees, II-503
Minkowski metric, II-123
MIS function, II-20
missing value analysis
algorithms, II-46
bootstrapping, II-15
casewise pattern table, II-20
data format, II-15
EM algorithm, II-8, II-12, II-25, II-33, II-42
examples, II-15, II-20, II-25, II-33, II-42
listwise deletion, II-3, II-25, II-33
MISSING command, II-14
missing value patterns, II-15
model, II-12
outliers, II-12
overview, II-1
pairwise deletion, II-3, II-25, II-33
pattern variables, II-2, II-42
Quick Graphs, II-15
randomness, II-9
regression imputation, II-6, II-12, II-25, II-42
saving estimates, II-12, II-15
658
Index
metric, II-123
missing values, II-144
nonmetric, II-123
overview, II-119
power function, II-123
Quick Graphs, II-127
residuals, II-123
Shepard diagrams, II-123, II-127
usage, II-127
multilevel models
See mixed regression
multinomial logit, I-552
compared to binary logit, I-552
multiple correlation, I-372
multiple correspondence analysis, I-148
multiple regression, I-376
multivariate analysis of variance, I-396
mutually exclusive, I-160
659
Index
660
Index
POSET, II-219
positive matching dichotomy coefficients, I-120
power, II-314
power analysis
analysis of variance, II-311
bootstrapping, II-359
commands, II-358
correlation coefficients, II-317, II-336, II-338
correlations, II-311
data format, II-359
examples, II-360, II-363, II-366, II-369, II-374, II378
661
Index
Sakitt D, II-433
sample size, II-315, II-323
samples, I-8
sampling
See bootstrap
saturated models
loglinear modeling, I-619
scalogram
See partially ordered scalogram analysis with coordinates
scatterplot matrix, I-117
Scheff model
in mixture designs, I-243
Scheffe test, I-389, I-432, I-493
screening designs, I-242
SD-RATIO, II-433
seasonal decomposition, II-637
second-order stationarity, II-495
semi-variograms, II-496, II-509
set correlations, II-407
assumptions, II-408
measures of association, II-409
missing data, II-429
partialing, II-408
See canonical correlation analysis
Shepard diagrams, II-123, II-127
Shepards smoother, II-470
sign test, II-201
signal detection analysis
algorithms, II-456
bootstrapping, II-437
chi-square model, II-433
662
Index
commands, II-436
convergence, II-433
data format, II-437
examples, II-433, II-434, II-447, II-450, II-453, II454
663
Index
664
Index
t distributions, II-584
compared to normal distributions, II-586
t tests
assumptions, II-588
Bonferroni adjustment, II-591
bootstrapping, II-592
commands, II-592
confidence intervals, II-591
data format, II-592
degrees of freedom, II-586
Dunn-Sidak adjustment, II-591
examples, II-593, II-595, II-597, II-599, II-601
one-sample, II-318, II-345, II-586, II-590
overview, II-583
paired, II-318, II-346, II-363, II-587, II-590
power analysis, II-311, II-318, II-345, II-346, II348
algorithms, II-678
ARIMA models, II-627, II-650
bootstrapping, II-653
clear series, II-645
commands, II-644, II-646, II-649, II-650, II-651, II653
665
Index
estimation, II-681
examples, II-686, II-688, II-691
heteroskedasticity-consistent standard errors, II683
unbalanced designs
in analysis of variance, I-391
uncertainty coefficient, I-168
unfolding models, II-295
uniform kernel, II-463, II-472
variance, I-211
of estimates, I-237
variance component models
See mixed regression
variance of prediction, I-238
variance paths
path analysis, II-236
varimax rotation, I-333, I-337
variograms, II-496, II-506, II-515
model, II-497
vector model
in perceptual mapping, II-297
Voronoi polygons, II-493, II-504, II-506
z tests
one-sample, II-341
power analysis, II-311, II-341, II-342
two-sample, II-342