Statistics I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 686
At a glance
Powered by AI
The document discusses descriptive and inferential statistics, bootstrapping, sampling methods, and time series analysis techniques.

The document discusses descriptive statistics such as mean, median, standard deviation, stem-and-leaf plots, sorting, and standardizing. It also discusses inferential statistics techniques such as hypothesis testing, confidence intervals, and checking assumptions.

The main sections covered include introductions to statistics, bootstrapping and sampling, and time series analysis.

SYSTAT 10.

Statistics I

WWW.SYSTAT.COM

For more information about SYSTAT Software Inc. products , please visit our WWW
site at http://www.systat.com or contact
Marketing Department
SYSTAT Software Inc.
501 Canal Boulevard, Suite F
Richmond, CA 94804-2028
Tel: (800) 797-7401, (866) 797-8288
Fax: (800) 797-7406

Windows is a registered trademark of Microsoft Corporation.


General notice: Other product names mentioned herein are used for identification
purposes only and may be trademarks of their respective companies.
The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use,
duplication, or disclosure by the Government is subject to restrictions as set forth in
subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at
52.227-7013. Contractor/manufacturer is SYSTAT Software Inc., 501 Canal Boulevard,
Suite F, Richmond, CA 94804-2028
TM

SYSTAT 10.2 Statistics I


Copyright 2002 by SYSTAT Software Inc.
All rights reserved.
Printed in the United States of America.
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the publisher.
1234567890

05 04 03 02 01 00

ISBN 81-88341-04-5

Contents
1 Introduction to Statistics

I-1

Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-1


Know Your Batch . . . . . . . . . . . .
Sum, Mean, and Standard Deviation .
Stem-and-Leaf Plots . . . . . . . . . .
The Median . . . . . . . . . . . . . . .
Sorting . . . . . . . . . . . . . . . . . .
Standardizing . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

. I-2
. I-3
. I-3
. I-4
. I-5
. I-6

Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-7


What Is a Population? . . . . . . .
Picking a Simple Random Sample
Specifying a Model . . . . . . . . .
Estimating a Model . . . . . . . . .
Confidence Intervals . . . . . . . .
Hypothesis Testing . . . . . . . . .
Checking Assumptions . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

. I-7
. I-8
I-10
I-10
I-11
I-12
I-14

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-16

2 Bootstrapping and Sampling

I-17

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-17


Bootstrapping in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . I-20
Bootstrap Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . I-20
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-20
Usage Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . I-20

iii

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-28
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-29

3 Classification and Regression Trees

I-31

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-31


The Basic Tree Model . . . . . . . . . . . . . . .
Categorical or Quantitative Predictors. . . . . .
Regression Trees . . . . . . . . . . . . . . . . . .
Classification Trees . . . . . . . . . . . . . . . .
Stopping Rules, Pruning, and Cross-Validation .
Loss Functions . . . . . . . . . . . . . . . . . . .
Geometry . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

I-32
I-35
I-35
I-36
I-37
I-38
I-38

Classification and Regression Trees in SYSTAT . . . . . . . . . . . . . I-41


Trees Main Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . I-41
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-43
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . I-44
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-44
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-51

4 Cluster Analysis

I-53

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-54


Types of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . I-54

iv

Correlations and Distances.


Hierarchical Clustering . . .
Partitioning via K-Means . .
Additive Trees . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

. I-55
. I-56
. I-60
.I-62

Cluster Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . I-64


Hierarchical Clustering Main Dialog Box .
K-Means Main Dialog Box . . . . . . . . .
Additive Trees Main Dialog Box . . . . . .
Using Commands . . . . . . . . . . . . . .
Usage Considerations . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.I-64
. I-67
. I-68
. I-69
. I-70

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-71
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-84
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-84
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-84

5 Conjoint Analysis

I-87

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-87


Additive Tables. . . . . . . . . . . . . . . . . . . . . . . .
Multiplicative Tables . . . . . . . . . . . . . . . . . . . .
Computing Table Margins Based on an Additive Model
Applied Conjoint Analysis. . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.I-88
. I-89
. I-91
. I-92

Conjoint Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . I-93


Conjoint Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . I-93
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-95
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . I-95
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-96
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-112
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-112
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-113
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-113

6 Correlations, Similarities,
and Distance Measures

I-115

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-116


The Scatterplot Matrix (SPLOM) . . .
The Pearson Correlation Coefficient .
Other Measures of Association . . . .
Transposed Data . . . . . . . . . . . .
Hadi Robust Outlier Detection. . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.I-117
.I-117
.I-119
.I-122
.I-123

Correlations in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . .I-124


Correlations Main Dialog Box . . . . . . . . . . . . . . . . . . . . .I-124
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-128
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-129
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-129
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-145
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-145
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-146
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-146

7 Correspondence Analysis

I-147

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-147


The Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . .I-147
The Multiple Model . . . . . . . . . . . . . . . . . . . . . . . . . . .I-148
Correspondence Analysis in SYSTAT . . . . . . . . . . . . . . . . . . .I-149
Correspondence Analysis Main Dialog Box. . . . . . . . . . . . .I-149
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-150
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-150
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-151
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-156

vi

8 Crosstabulation

I-157

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-158


Making Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-158
Significance Tests and Measures of Association . . . . . . . . . I-160
Crosstabulations in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . I-166
One-Way Frequency Tables Main Dialog Box .
Two-Way Frequency Tables Main Dialog Box .
Multiway Frequency Tables Main Dialog Box .
Using Commands . . . . . . . . . . . . . . . . .
Usage Considerations . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

. I-166
. I-167
. I-170
. I-171
. I-172

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-173
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-203
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-203

9 Descriptive Statistics

I-205

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-206


Location . . . . . . . . . .
Spread. . . . . . . . . . .
The Normal Distribution.
Non-Normal Shape . . .
Subpopulations. . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

I-206
I-207
I-207
I-208
I-209

Descriptive Statistics in SYSTAT . . . . . . . . . . . . . . . . . . . . . I-211


Basic Statistics Main Dialog Box
Stem Main Dialog Box . . . . . .
Cronbach Main Dialog Box . . .
Using Commands . . . . . . . . .
Usage Considerations . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

I-211
I-213
I-214
I-215
I-215

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-216
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-225

vii

10 Design of Experiments

I-227

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-228


The Research Problem. . . . . . . . . . . . . .
Types of Investigation . . . . . . . . . . . . . .
The Importance of Having a Strategy . . . . .
The Role of Experimental Design in Research
Types of Experimental Designs . . . . . . . . .
Factorial Designs . . . . . . . . . . . . . . . . .
Response Surface Designs . . . . . . . . . . .
Mixture Designs . . . . . . . . . . . . . . . . .
Optimal Designs. . . . . . . . . . . . . . . . . .
Choosing a Design . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.I-228
.I-229
.I-230
.I-231
.I-231
.I-232
.I-236
.I-239
.I-244
.I-248

Design of Experiments in SYSTAT . . . . . . . . . . . . . . . . . . . . .I-250


Design of Experiments Wizard
Classic Design of Experiments
Using Commands . . . . . . . .
Usage Considerations . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.I-250
.I-251
.I-252
.I-252

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-253
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-273

11 Discriminant Analysis

I-275

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-276


Linear Discriminant Model . . . . . . . . . . . . . . . . . . . . . .I-276
Discriminant Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . .I-283
Discriminant Analysis Main Dialog Box . . . . . . . . . . . . . . .I-283
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-287
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . I-288
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-288
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-326

viii

12 Factor Analysis

I-327

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . I-327


A Principal Component . . . . . . . . . . . . . .
Factor Analysis. . . . . . . . . . . . . . . . . . .
Principal Components versus Factor Analysis .
Applications and Caveats. . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

. I-328
. I-331
. I-334
. I-334

Factor Analysis in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . . . I-335


Factor Analysis Main Dialog Box . . . . . . . . . . . . . . . . . . I-335
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . I-339
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . I-339
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-341
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-362
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-362
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-362
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-363

13 Linear Models

I-365

Simple Linear Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . I-365


Equation for a Line . . . . .
Least Squares . . . . . . .
Estimation and Inference .
Standard Errors . . . . . .
Hypothesis Testing . . . .
Multiple Correlation . . . .
Regression Diagnostics . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

. I-366
. I-369
. I-369
. I-371
. I-371
. I-372
. I-373

Multiple Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-376


Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . I-379
Using an SSCP, a Covariance, or
a Correlation Matrix as Input . . . . . . . . . . . . . . . . . . . . . . . I-381

ix

Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-382


Effects Coding . . . . . . . . . . . .
Means Coding . . . . . . . . . . . .
Models . . . . . . . . . . . . . . . .
Hypotheses . . . . . . . . . . . . .
Multigroup ANOVA . . . . . . . . .
Factorial ANOVA . . . . . . . . . .
Data Screening and Assumptions
Levene Test . . . . . . . . . . . . .
Pairwise Mean Comparisons . . .
Linear and Quadratic Contrasts. .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.I-383
.I-384
.I-385
.I-386
.I-386
.I-387
.I-388
.I-388
.I-389
.I-390

Repeated Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-393


Assumptions in Repeated Measures. . . . . . . . . . . . . . . . .I-394
Issues in Repeated Measures Analysis . . . . . . . . . . . . . . .I-395
Types of Sums of Squares . . . . . . . . . . . . . . . . . . . . . . . . .I-396
SYSTATs Sums of Squares . . . . . . . . . . . . . . . . . . . . . .I-397

14 Linear Models I: Linear Regression

I-399

Linear Regression in SYSTAT. . . . . . . . . . . . . . . . . . . . . . . .I-400


Regression Main Dialog Box . . . . . . . . . . . . . . . . . . . . .I-400
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-403
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . .I-403
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-404
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-430

15 Linear Models II: Analysis of Variance

I-431

Analysis of Variance in SYSTAT . . . . . . . . . . . . . . . . . . . . . . I-432


ANOVA: Estimate Model
ANOVA: Hypothesis Test
Repeated Measures . . .
Using Commands . . . .
Usage Considerations . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

I-432
I-434
I-436
I-438
I-438

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-439
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-485

16 Linear Models III:


General Linear Models

I-487

General Linear Models in SYSTAT . . . . . . . . . . . . . . . . . . . . I-488


Model Estimation (in GLM) . . . . . . . . . . . . . . . . . . . . . . I-488
Pairwise Comparisons . . . . . . . . . . . . . . . . . . . . . . . . I-493
Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . I-495
Post hoc Tests for Repeated Measures . . . . . . . . . . . . . . I-495
Using Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . I-501
Usage Considerations . . . . . . . . . . . . . . . . . . . . . . . . I-501
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-503
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-546
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-546
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-546

xi

17 Logistic Regression

I-549

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-549


Binary Logit . . . . . .
Multinomial Logit . . .
Conditional Logit . . .
Discrete Choice Logit
Stepwise Logit . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.I-550
.I-552
.I-552
.I-554
.I-556

Logistic Regression in SYSTAT . . . . . . . . . . . . . . . . . . . . . . .I-557


Estimate Model Main Dialog Box .
Deciles of Risk . . . . . . . . . . .
Quantiles . . . . . . . . . . . . . .
Simulation . . . . . . . . . . . . . .
Hypothesis . . . . . . . . . . . . .
Using Commands . . . . . . . . . .
Usage Considerations . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.I-557
.I-561
.I-562
.I-563
.I-563
.I-564
.I-565

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-566
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-609
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-613

18 Loglinear Models

I-617

Statistical Background . . . . . . . . . . . . . . . . . . . . . . . . . . .I-618


Fitting a Loglinear Model . . . . . . . . . . . . . . . . . . . . . . .I-620
Loglinear Models in SYSTAT . . . . . . . . . . . . . . . . . . . . . . . .I-621
Loglinear Model Main Dialog Box
Frequency Tables (Tabulate) . . .
Using Commands . . . . . . . . . .
Usage Considerations . . . . . . .

xii

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.I-621
.I-625
.I-626
.I-626

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-627
Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-646

Index

649

xiii

List of Examples
Additive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-82
Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . I-462
ANOVA Assumptions and Contrasts . . . . . . . . . . . . . . . . . . I-442
Automatic Stepwise Regression . . . . . . . . . . . . . . . . . . . . I-417
Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-216
Binary Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-566
Binary Logit with Interactions . . . . . . . . . . . . . . . . . . . . . . I-569
Binary Logit with Multiple Predictors . . . . . . . . . . . . . . . . . . . I-568
Box-Behnken Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-264
Box-Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-103
Box-Hunter Fractional Factorial Design . . . . . . . . . . . . . . . . I-256
By-Choice Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . I-598
Canonical Correlation Analysis. . . . . . . . . . . . . . . . . . . . . . I-544
Canonical Correlations: Using Text Output . . . . . . . . . . . . . . . I-26
Central Composite Response Surface Design . . . . . . . . . . . . . I-269
Choice Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-96
Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-45
Cochrans Test of Linear Trend. . . . . . . . . . . . . . . . . . . . . . . I-194
Conditional Logistic Regression . . . . . . . . . . . . . . . . . . . . . I-588
Confidence Interval on a Median . . . . . . . . . . . . . . . . . . . . . I-25
Confidence Intervals for One-Way Table Percentages . . . . . . . . . I-199
Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-313
Correspondence Analysis (Simple) . . . . . . . . . . . . . . . . . . . . I-151
Covariance Alternatives to Repeated Measures. . . . . . . . . . . . . I-532
Crossover and Changeover Designs. . . . . . . . . . . . . . . . . . . . I-520
Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-321
Deciles of Risk and Model Diagnostics . . . . . . . . . . . . . . . . . . . I-574

xv

Discrete Choice Models. . . . . . . . . . . . . . . . . . . . . . . . . . I-591


Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . I-536
Discriminant Analysis Using Automatic Backward Stepping. . . . . I-298
Discriminant Analysis Using Automatic Forward Stepping . . . . . .I-293
Discriminant Analysis Using Complete Estimation. . . . . . . . . . . I-288
Discriminant Analysis Using Interactive Stepping . . . . . . . . . . I-306
Employment Discrimination. . . . . . . . . . . . . . . . . . . . . . . . I-107
Factor Analysis Using a Covariance Matrix. . . . . . . . . . . . . . I-353
Factor Analysis Using a Rectangular File . . . . . . . . . . . . . . . .I-356
Fishers Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . I-192
Fractional Factorial Design . . . . . . . . . . . . . . . . . . . . . . . I-254
Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . I-512
Frequency Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-177
Full Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-253
Hadi Robust Outlier Detection . . . . . . . . . . . . . . . . . . . . . . I-140
Hierarchical Clustering: Clustering Cases . . . . . . . . . . . . . . . . . I-75
Hierarchical Clustering: Clustering Variables and Cases . . . . . . . I-79
Hierarchical Clustering: Clustering Variables . . . . . . . . . . . . . . I-78
Hierarchical Clustering: Distance Matrix Input . . . . . . . . . . . . . . I-81
Hotellings T-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . I-535
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-604
Incomplete Block Designs . . . . . . . . . . . . . . . . . . . . . . . . I-510
Interactive Stepwise Regression . . . . . . . . . . . . . . . . . . . . I-420
Iterated Principal Axis . . . . . . . . . . . . . . . . . . . . . . . . . . I-348
K-Means Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-71
Latin Square Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . I-518
Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-258
Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-21
Loglinear Modeling of a Four-Way Table . . . . . . . . . . . . . . . . I-627
Mantel-Haenszel Test . . . . . . . . . . . . . . . . . . . . . . . . . . . I-200
Maximum Likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . . . I-344

xvi

McNemars Test of Symmetry . . . . . . . . . . . . . . . . . . . . . . I-197


Missing Category Codes . . . . . . . . . . . . . . . . . . . . . . . . . I-178
Missing Cells Designs (the Means Model) . . . . . . . . . . . . . . . I-523
Missing Data: EM Estimation . . . . . . . . . . . . . . . . . . . . . . . I-135
Missing Data: Pairwise Deletion . . . . . . . . . . . . . . . . . . . . . I-134
Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-459
Mixture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-265
Mixture Design with Constraints . . . . . . . . . . . . . . . . . . . . . . . I-266
Mixture Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-544
Multinomial Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-582
Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . I-153
Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . I-413
Multivariate Analysis of Variance. . . . . . . . . . . . . . . . . . . . . I-480
Multiway Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-181
Nested Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I-513
Odds Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-189
One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-439
One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-503
One-Way Repeated Measures . . . . . . . . . . . . . . . . . . . . . . I-464
One-Way Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-173
Optimal Designs: Coordinate Exchange . . . . . . . . . . . . . . . . . I-270
Partial Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-545
Pearson Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-129
Percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-179
Plackett-Burman Design . . . . . . . . . . . . . . . . . . . . . . . . . . I-263
Principal Components Analysis (Within Groups). . . . . . . . . . . . . I-540
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-341
Probabilities Associated with Correlations . . . . . . . . . . . . . . . . I-137
Quadratic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-315
Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-579
Quasi-Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . I-607

xvii

Randomized Block Designs . . . . . . . . . . . . . . . . . . . . . . . . I-510


Regression Tree with Box Plots . . . . . . . . . . . . . . . . . . . . . . I-47
Regression Tree with Dit Plots . . . . . . . . . . . . . . . . . . . . . . . I-49
Regression with Ecological or Grouped Data. . . . . . . . . . . . . . I-428
Regression without the Constant. . . . . . . . . . . . . . . . . . . . I-429
Repeated Measures Analysis of Covariance . . . . . . . . . . . . . . I-478
One Within Factor with Ordered Levels . . . . . . . . . . . . . . . . I-470
One Within Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-472
Repeated Measures ANOVA for Two Trial Factors . . . . . . . . . . I-475
Residuals and Diagnostics for Simple Linear Regression. . . . . . I-410
Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-350
S2 and S3 Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . I-143
Grouping Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-218
One Grouping Variable . . . . . . . . . . . . . . . . . . . . . . . . . . I-217
Screening Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-638
Separate Variance Hypothesis Tests . . . . . . . . . . . . . . . . . I-461
Single-Degree-of-Freedom Designs . . . . . . . . . . . . . . . . . . I-457
Spearman Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . I-143
Spearman Rank Correlation . . . . . . . . . . . . . . . . . . . . . . . . I-24
Split Plot Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-515
Stem-and-Leaf Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-221
Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-600
Structural Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-641
Tables with Ordered Categories. . . . . . . . . . . . . . . . . . . . . I-196
Tables without Analyses . . . . . . . . . . . . . . . . . . . . . . . . I-645
Taguchi Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-260
Testing Nonzero Null Hypotheses. . . . . . . . . . . . . . . . . . . . I-427
Testing whether a Single Coefficient Equals Zero . . . . . . . . . . I-424
Testing whether Multiple Coefficients Equal Zero . . . . . . . . . .

I-426

Tetrachoric Correlation. . . . . . . . . . . . . . . . . . . . . . . . . I-145


Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-132
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-407

xviii

Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-447


Two-Way Table Statistics (Long Results) . . . . . . . . . . . . . . . I-188
Two-Way Table Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . I-186
Two-Way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-175
Weighting Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-532
Word Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-100

xix

Chapter

1
Introduction to Statistics
Leland Wilkinson

Statistics and state have the same root. Statistics are the numbers of the state. More
generally, they are any numbers or symbols that formally summarize our observations
of the world. As we all know, summaries can mislead or elucidate. Statistics also
refers to the introductory course we all seem to hate in college. When taught well,
however, it is this course that teaches us how to use numbers to elucidate rather than
to mislead.
Statisticians specialize in many areasprobability, exploratory data analysis,
modeling, social policy, decision making, and others. While they may philosophically
disagree, statisticians nevertheless recognize at least two fundamental tasks:
description and inference. Description involves characterizing a batch of data in
simple but informative ways. Inference involves generalizing from a sample of data
to a larger population of possible data. Descriptive statistics help us to observe more
acutely, and inferential statistics help us to formulate and test hypotheses.
Any distinctions, such as this one between descriptive and inferential statistics, are
potentially misleading. Lets look at some examples, however, to see some
differences between these approaches.

Descriptive Statistics
Descriptive statistics may be single numerical summaries of a batch, such as an
average. Or, they may be more complex tables and graphs. What distinguishes
descriptive statistics is their reference to a given batch of data rather than to a more
general population or class. While there are exceptions, we usually examine
descriptive statistics to understand the structure of a batch. A closely related field is

I-1

I-2
Chapter 1

called exploratory data analysis. Both exploratory and descriptive methods may lead
us to formulate laws or test hypotheses, but their focus is on the data at hand.
Consider, for example, the following batch. These are numbers of arrests by sex in
1985 for selected crimes in the United States. The source is the FBI Uniform Crime
Reports. What can we say about differences between the patterns of arrests of men and
women in the United States in 1985?
CRIME

MALES

FEMALES

murder
rape
robbery
assault
burglary
larceny
auto
arson
battery
forgery
fraud
embezzle
vandal
weapons
vice
sex
drugs
gambling
family
dui
drunk
disorderly
vagrancy
runaway

12904
28865
105401
211228
326959
744423
97835
13129
416735
46286
151773
5624
181600
134210
29584
74602
562754
21995
35553
1208416
726214
435198
24592
53808

1815
303
8639
32926
26753
334053
10093
2003
75937
23181
111825
3184
20192
10970
67592
6108
90038
3879
5086
157131
70573
99252
3001
72473

Know Your Batch


First, we must be careful in characterizing the batch. These statistics do not cover the
gamut of U.S. crimes. We left out curfew and loitering violations, for example. Not all
reported crimes are included in these statistics. Some false arrests may be included.

I-3
Introduc tion to Sta tistics

State laws vary on the definitions of some of these crimes. Agencies may modify arrest
statistics for political purposes. Know where your batch came from before you use it.

Sum, Mean, and Standard Deviation


Were there more male than female arrests for these crimes in 1985? The following
output shows us the answer. Males were arrested for 5,649,688 crimes (not 5,649,688
malessome may have been arrested more than once). Females were arrested
1,237,007 times.
N of cases
Minimum
Maximum
Sum
Mean
Standard Dev

MALES
24
5624.000
1208416.000
5649688.000
235403.667
305947.056

FEMALES
24
303.000
334053.000
1237007.000
51541.958
74220.864

How about the average (mean) number of arrests for a crime? For males, this was
235,403 and for females, 51,542. Does the mean make any sense to you as a summary
statistic? Another statistic in the table, the standard deviation, measures how much
these numbers vary around the average. The standard deviation is the square root of the
average squared deviation of the observations from their mean. It, too, has problems in
this instance. First of all, both the mean and standard deviation should represent what
you could observe in your batch, on average: the mean number of fish in a pond, the
mean number of children in a classroom, the mean number of red blood cells per cubic
millimeter. Here, we would have to say, the mean murder-rape-robbery--runaway
type of crime. Second, even if the mean made sense descriptively, we might question
its use as a typical crime-arrest statistic. To see why, we need to examine the shape of
these numbers.

Stem-and-Leaf Plots
Lets look at a display that compresses these data a little less drastically. The stem-andleaf plot is like a tally. We pick a most significant digit or digits and tally the next digit
to the right. By using trailing digits instead of tally marks, we preserve extra digits in
the data. Notice the shape of the tally. There are mostly smaller numbers of arrests and
a few crimes (such as larceny and driving under the influence of alcohol) with larger

I-4
Chapter 1

numbers of arrests. Another way of saying this is that the data are positively skewed
toward larger numbers for both males and females.
Stem and Leaf Plot of variable:
Minimum:
5624.000
Lower hinge:
29224.500
Median:
101618.000
Upper hinge:
371847.000
Maximum: 1208416.000

MALES, N = 24

0 H 011222234579
1 M 0358
2
1
3 H 2
4
13
5
6
6
7
24
* * * Outside Values * * *
12
0
Stem and Leaf Plot of variable:
Minimum:
303.000
Lower hinge:
4482.500
Median:
21686.500
Upper hinge:
74205.000
Maximum:
334053.000

FEMALES, N = 24

0 H 00000000011
0 M 2223
0
0 H 6777
0
99
1
1
1
1
5
* * * Outside Values * * *
3
3

The Median
When data are skewed like this, the mean gets pulled from the center of the majority
of numbers toward the extreme with the few. A statistic that is not as sensitive to
extreme values is the median. The median is the value above which half the data fall.
More precisely, if you sort the data, the median is the middle value or the average of
the two middle values. Notice that for males the median is 101,618, and for females,
21,686. Both are considerably smaller than the means and more typical of the majority
of the numbers. This is why the median is often used for representing skewed data,
such as incomes, populations, or reaction times.
We still have the same representativeness problem that we had with the mean,
however. Even if the medians corresponded to real data values in this batch (which
they dont because there is an even number of observations), it would be hard to
characterize what they would represent.

I-5
Introduc tion to Sta tistics

Sorting
Most people think of means, standard deviations, and medians as the primary
descriptive statistics. They are useful summary quantities when the observations
represent values of a single variable. We purposely chose an example where they are
less appropriate, however, even when they are easily computable. There are better
ways to reveal the patterns in these data. Lets look at sorting as a way of uncovering
structure.
I was talking once with an FBI agent who had helped to uncover the Chicago
machines voting fraud scandal some years ago. He was a statistician, so I was curious
what statistical methods he used to prove the fraud. He replied, We sorted the voter
registration tape alphabetically by last name. Then we looked for duplicate names and
addresses. Sorting is one of the most basic and powerful data analysis techniques. The
stem-and-leaf plot, for example, is a sorted display.
We can sort on any numerical or character variable. It depends on our goal. We
began this chapter with a question: Are there differences between the patterns of arrests
of men and women in the United States in 1985? How about sorting the male and
female arrests separately? If we do this, we will get a list of crimes in order of
decreasing frequency within sex.
MALES

FEMALES

dui
larceny
drunk
drugs
disorderly
battery
burglary
assault
vandal
fraud
weapons
robbery
auto
sex

larceny
dui
fraud
disorderly
drugs
battery
runaway
drunk
vice
assault
burglary
forgery
vandal
weapons

I-6
Chapter 1

MALES

FEMALES

runaway
forgery
family
vice
rape
vagrancy
gambling
arson
murder
embezzle

auto
robbery
sex
family
gambling
embezzle
vagrancy
arson
murder
rape

You might want to connect similar crimes with lines. The number of crossings would
indicate differences in ranks.

Standardizing
This ranking is influenced by prevalence. The most frequent crimes occur at the top of
the list in both groups. Comparisons within crimes are obscured by this influence. Men
committed almost 100 times as many rapes as women, for example, yet rape is near the
bottom of both lists. If we are interested in contrasting the sexes on patterns of crime
while holding prevalence constant, we must standardize the data. There are several
ways to do this. You may have heard of standardized test scores for aptitude tests.
These are usually produced by subtracting means and then dividing by standard
deviations. Another method is simply to divide by row or column totals. For the crime
data, we will divide by totals within rows (each crime). Doing so gives us the
proportion of each arresting crime committed by men or women. The total of these two
proportions will thus be 1.
Now, a contrast between men and women on this standardized value should reveal
variations in arrest patterns within crime type. By subtracting the female proportion
from the male, we will highlight primarily male crimes with positive values and female
crimes with negative. Next, sort these differences and plot them in a simple graph. The
following shows the result:

I-7
Introduc tion to Sta tistics

Now we can see clear contrasts between males and females in arrest patterns. The
predominantly aggressive crimes appear at the top of the list. Rape now appears where
it belongsan aggressive, rather than sexual, crime. A few crimes dominated by
females are at the bottom.

Inferential Statistics
We often want to do more than describe a particular sample. In order to generalize,
formulate a policy, or test a hypothesis, we need to make an inference. Making an
inference implies that we think a model describes a more general population from
which our data have been randomly sampled. Sometimes it is difficult to imagine a
population from which you have gathered data. A population can be all possible
voters, all possible replications of this experiment, or all possible moviegoers.
When you make inferences, you should have a population in mind.

What Is a Population?
We are going to use inferential methods to estimate the mean age of the unusual
population contained in the 1980 edition of Whos Who in America. We could enter all
73,500 ages into a SYSTAT file and compute the mean age exactly. If it were practical,
this would be the preferred method. Sometimes, however, a sampling estimate can be
more accurate than an entire census. For example, biases are introduced into large
censuses from refusals to comply, keypunch or coding errors, and other sources. In

I-8
Chapter 1

these cases, a carefully constructed random sample can yield less-biased information
about the population.
This an unusual population because it is contained in a book and is therefore finite.
We are not about to estimate the mean age of the rich and famous. After all, Spy
magazine used to have a regular feature listing all of the famous people who are not in
Whos Who. And bogus listings may escape the careful fact checking of the Whos Who
research staff. When we get our estimate, we might be tempted to generalize beyond
the book, but we would be wrong to do so. For example, if a psychologist measures
opinions in a random sample from a class of college sophomores, his or her
conclusions should begin with the statement, College sophomores at my university
think If the word people is substituted for college sophomores, it is the
experimenters responsibility to make clear that the sample is representative of the
larger group on all attributes that might affect the results.

Picking a Simple Random Sample


That our population is finite should cause us no problems as long as our sample is much
smaller than the population. Otherwise, we would have to use special techniques to
adjust for the bias it would cause. How do we choose a simple random sample from
a population? We use a method that ensures that every possible sample of a given size
has an equal chance of being chosen. The following methods are not random:
n Pick the first name on every tenth page (some names have no chance of being

chosen).
n Close your eyes, flip the pages of the book, and point to a name (Tversky and others

have done research that shows that humans cannot behave randomly).
n Randomly pick the first letter of the last name and randomly choose from the

names beginning with that letter (there are more names beginning with C, for
example, than with I).
The way to pick randomly from a book, file, or any finite population is to assign a
number to each name or case and then pick a sample of numbers randomly. You can
use SYSTAT to generate a random number between 1 and 73,500, for example, with
the expression:
1 + INT(73500URN)

I-9
Introduc tion to Sta tistics

There are too many pages in Whos Who to use this method, however. As a short cut,
I randomly generated a page number and picked a name from the page using the
random number generator. This method should work well provided that each page has
approximately the same number of names (between 19 and 21 in this case). The sample
is shown below:
AGE

60
74
39
78
66
63
45
56
65
51
52
59
67
48
36
34
68
50
51
47
81
56
49
58
58

SEX

male
male
female
male
male
male
male
male
male
male
male
male
male
male
female
female
male
male
male
male
male
male
male
male
male

AGE

38
44
49
62
76
51
51
75
65
41
67
50
55
45
49
58
47
55
67
58
76
70
69
46
60

SEX

female
male
male
male
female
male
male
male
female
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male
male

I-10
Chapter 1

Specifying a Model
To make an inference about age, we need to construct a model for our population:

a = +
This model says that the age ( a ) of someone we pick from the book can be described
by an overall mean age ( ) plus an amount of error ( ) specific to that person and due
to random factors that are too numerous and insignificant to describe systematically.
Notice that we use Greek letters to denote things that we cannot observe directly and
Roman letters for those that we do observe. Of the unobservables in the model, is
called a parameter, and , a random variable. A parameter is a constant that helps to
describe a population. Parameters indicate how a model is an instance of a family of
models for similar populations. A random variable varies like the tossing of a coin.
There are two more parameters associated with the random variable but not
appearing in the model equation. One is its mean ( ),which we have rigged to be 0,
and the other is its standard deviation ( or simply ). Because a is simply the sum
of (a constant) and (a random variable), its standard deviation is also .
In specifying this model, we assume the following:
n The model is true for every member of the population.
n The error, plus or minus, that helps determine one population members age is

independent of (not predictable from) the error for other members.


n The errors in predicting all of the ages come from the same random distribution

with a mean of 0 and a standard deviation of .

Estimating a Model
Because we have not sampled the entire population, we cannot compute the parameter
values directly from the data. We have only a small sample from a much larger
population, so we can estimate the parameter values only by using some statistical
method on our sample data. When our three assumptions are appropriate, the sample
mean will be a good estimate of the population mean. Without going into all of the
details, the sample estimate will be, on average, close to the values of the mean in the
population.

I-11
Introduc tion to Sta tistics

We can use various methods in SYSTAT to estimate the mean. One way is to
specify our model using Linear Regression. Select AGE and add it to the Dependent
list. With commands:
REGRESSION
MODEL AGE=CONSTANT

This model says that AGE is a function of a constant value ( ). The rest is error ( ).
Another method is to compute the mean from the Basic Statistics routines. The result
is shown below:
AGE
N OF CASES
MEAN
STANDARD DEV
STD. ERROR

50
56.700
11.620
1.643

Our best estimate of the mean age of people in Whos Who is 56.7 years.

Confidence Intervals
Our estimate seems reasonable, but it is not exactly correct. If we took more samples
of size 50 and computed estimates, how much would we expect them to vary? First, it
should be plain without any mathematics to see that the larger our sample, the closer
will be our sample estimate to the true value of in the population. After all, if we
could sample the entire population, the estimates would be the true values. Even so, the
variation in sample estimates is a function only of the sample size and the variation of
the ages in the population. It does not depend on the size of the population (number of
people in the book). Specifically, the standard deviation of the sample mean is the
standard deviation of the population divided by the square root of the sample size. This
standard error of the mean is listed on the output above as 1.643. On average, we
would expect our sample estimates of the mean age to vary by plus or minus a little
more than one and a half years, assuming samples of size 50.
If we knew the shape of the sampling distribution of mean age, we would be able to
complete our description of the accuracy of our estimate. There is an approximation
that works quite well, however. If the sample size is reasonably large (say, greater than
25), then the mean of a simple random sample is approximately normally distributed.
This is true even if the population distribution is not normal, provided the sample size
is large.

I-12
Chapter 1

We now have enough information from our sample to construct a normal


approximation of the distribution of our sample mean. The following figure shows this
approximation to be centered at the sample estimate of 56.7 years. Its standard
deviation is taken from the standard error of the mean, 1.643 years.

Density

0.3

0.2

0.1

0.0
50

55
60
Mean Age

65

We have drawn the graph so that the central area comprises 95% of all the area under
the curve (from about 53.5 to 59.9). From this normal approximation, we have built a
95% symmetric confidence interval that gives us a specific idea of the variability of
our estimate. If we did this entire procedure againsample 50 names, compute the
mean and its standard error, and construct a 95% confidence interval using the normal
approximationthen we would expect that 95 intervals out of a hundred so
constructed would cover the real population mean age. Remember, population mean
age is not necessarily at the center of the interval that we just constructed, but we do
expect the interval to be close to it.

Hypothesis Testing
From the sample mean and its standard error, we can also construct hypothesis tests on
the mean. Suppose that someone believed that the average age of those listed in Whos
Who is 61 years. After all, we might have picked an unusual sample just through the
luck of the draw. Lets say, for argument, that the population mean age is 61 and the
standard deviation is 11.62. How likely would it be to find a sample mean age of 56.7?
If it is very unlikely, then we would reject this null hypothesis that the population mean
is 61. Otherwise, we would fail to reject it.

I-13
Introduc tion to Sta tistics

There are several ways to represent an alternative hypothesis against this null
hypothesis. We could make a simple alternative value of 56.7 years. Usually, however,
we make the alternative compositethat is, it represents a range of possibilities that do
not include the value 61. Here is how it would look:
H0: = 61 (null hypothesis)
HA: 61 (alternative hypothesis)

We would reject the null hypothesis if our sample value for the mean were outside of
a set of values that a population value of 61 could plausibly generate. In this context,
plausible means more probable than a conventionally agreed upon critical level for
our test. This value is usually 0.05. A result that would be expected to occur fewer than
five times in a hundred samples is considered significant and would be a basis for
rejecting our null hypothesis.
Constructing this hypothesis test is mathematically equivalent to sliding the normal
distribution in the above figure to center over 61. We then look at the sample value 56.7
to see if it is outside of the middle 95% of the area under the curve. If so, we reject the
null hypothesis.

Density

0.3

0.2

0.1
56.7

0.0
50

55
60
Mean Age

65

The following t test output shows a p value (probability) of 0.012 for this test. Because
this value is lower than 0.05, we would reject the null hypothesis that the mean age is
61. This is equivalent to saying that the value of 61 does not appear in the 95%
confidence interval.

I-14
Chapter 1

One-sample t test of AGE with 50 cases;


Mean =
SD =

56.700
11.620

Ho: Mean =

95.00% CI

df =

49

61.000
53.398 to
t =
Prob =

60.002
-2.617
0.012

The mathematical duality between confidence intervals and hypothesis testing may
lead you to wonder which is more useful. The answer is that it depends on the context.
Scientific journals usually follow a hypothesis testing model because their null
hypothesis value for an experiment is usually 0 and the scientist is attempting to reject
the hypothesis that nothing happened in the experiment. Any rejection is usually taken
to be interesting, even when the sample size is so large that even tiny differences from
0 will be detected.
Those involved in making decisionsepidemiologists, business people,
engineersare often more interested in confidence intervals. They focus on the size
and credibility of an effect and care less whether it can be distinguished from 0. Some
statisticians, called Bayesians, go a step further and consider statistical decisions as a
form of betting. They use sample information to modify prior hypotheses. See Box and
Tiao (1973) or Berger (1985) for further information on Bayesian statistics.

Checking Assumptions
Now that we have finished our analyses, we should check some of the assumptions we
made in doing them. First, we should examine whether the data look normally
distributed. Although sample means will tend to be normally distributed even when the
population isnt, it helps to have a normally distributed population, especially when we
do not know the population standard deviation. The stem-and-leaf plot gives us a quick
idea:
Stem and leaf plot of variable:
Minimum:
Lower hinge:
Median:
Upper hinge:
Maximum:

34.000
49.000
56.000
66.000
81.000
3
4
3
689
4
14
4 H 556778999
5
0011112
5 M 556688889
6
0023
6 H 55677789
7
04
7
5668
8
1

AGE

, N =

50

I-15
Introduc tion to Sta tistics

There is another plot, called a dot histogram (dit) plot which looks like a stem-and-leaf
plot. We can use different symbols to denote males and females in this plot, however,
to see if there are differences in these subgroups. Although there are not enough
females in the sample to be sure of a difference, it is nevertheless a good idea to
examine it. The dot histogram reveals four of the six females to be younger than
everyone else.

A better test of normality is to plot the sorted age values against the corresponding
values of a mathematical normal distribution. This is called a normal probability plot.
If the data are normally distributed, then the plotted values should fall approximately on
a straight line. Our data plot fairly straight. Again, different symbols are used for the
males and females. The four young females appear in the bottom left corner of the plot.

Does this possible difference in ages by gender invalidate our results? No, but it
suggests that we might want to examine the gender differences further to see whether
or not they are significant.

I-16
Chapter 1

References
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nd ed. New York:
Springer Verlag.
Box, G. E. P. and Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading,
Mass.: Addison-Wesley.

Chapter

2
Bootstrapping and Sampling
Leland Wilkinson and Laszlo Engelman

Bootstrapping is not a module in SYSTAT. It is a procedure available in most


modules where appropriate. Bootstrapping is so important as a general statistical
methodology, however, that it deserves a separate chapter. SYSTAT handles
bootstrapping as a single option to the ESTIMATE command or its equivalent in each
module. The computations are handled without producing a scratch file of the
bootstrapped samples. This saves disk space and computer time. Bootstrap, jackknife,
and other samples are simply computed on-the-fly.

Statistical Background
Bootstrap (Efron and Tibshirani, 1993) is the most recent and most powerful of a
variety of strategies for producing estimates of parameters in samples taken from
unknown probability distributions. Efron and LePage (1992) summarize the problem
most succinctly. We have a set of real-valued observations x 1 , ,x n independently
sampled from an unknown probability distribution F. We are interested in estimating
some parameter by using the information in the sample data with an estimator
= t ( x ) . Some measure of the estimates accuracy is as important as the estimate
itself; we want a standard error of and, even better, a confidence interval on the true
value .

I-17

I-18
Chapter 2

Classical statistical methods provide a powerful way of handling this problem when
F is known and is simplewhen , for example, is the mean of the normal
distribution. Focusing on the standard error of the mean, we have:

(F)
-------------n
2

se { x ;F } =

Substituting the unbiased estimate for ( F ) ,


2

(x x)

2
( F ) =

i=1
--------------------------(n 1)

we have:
n

(x x)

se ( x ) =

i=1
--------------------------n( n 1)

Parametric methods often work fairly well even when the distribution is contaminated
or only approximately known because the central limit theorem shows that sums of
independent random variables with finite variances tend to be normal in large samples
even when the variables themselves are not normal. But problems arise for estimates
more complicated than a meanmedians, sample correlation coefficients, or
eigenvalues, especially in small or medium-sized samples and even, in some cases, in
large samples.
Strategies for approaching this problem nonparametrically have involved using
to obtain information needed for the standard error
the empirical distribution F
estimate. One approach is Tukeys jackknife (Tukey, 1958), which is offered in
SAMPLE=JACKKNIFE. Tukey proposed computing n subsets of ( x 1 , , x n ), each
consisting of all of the cases except the ith deleted case (for i = 1 , , n ). He
produced standard errors as a function of the n estimates from these subsets.
Another approach has involved subsampling, usually via simple random samples.
This option is offered in SAMPLE=SIMPLE. A variety of researchers in the 1950s and
1960s explored these methods empirically (for example, Block, 1960; see Noreen,

I-19
Bootstrapping and Sampling

1989, for others). This method amounts to a Monte Carlo study in which the sample is
treated as the population. It is also closely related to methodology for permutation tests
(Fisher, 1935; Dwass, 1957; Edginton, 1980).
The bootstrap (Efron, 1979) has been the focus of most recent theoretical research.
F is defined as:

F : probability 1/n on x i for i = 1, 2 , ,n


Then, since
2
2
( F ) = ( x x )

we have:

se { x, F } =

(x x)
-----------------n

is to sample from
The computer algorithm for getting the samples for generating F
( x 1 , ,x n ) with replacement. Efron and other researchers have shown that the
general procedure of generating samples and computing estimates yields data
on which we can make useful inferences. For example, instead of computing only
and its standard error, we can do histograms, densities, order statistics (for symmetric
and asymmetric confidence intervals), and other computations on our estimates. In
other words, there is much to learn from the bootstrap sample distributions of the
estimates themselves.
There are some concerns, however. The naive bootstrap computed this way (with
SAMPLE=BOOT and STATS for computing means and standard deviations) is not
especially good for long-tailed distributions. It is also not suited for time-series or
stochastic data. See LePage and Billard (1992) for recent research on and solutions to
some of these problems. There are also several simple improvements to the naive
boostrap. One is the pivot, or bootstrap-t method, discussed in Efron and Tibshirani
(1993). This is especially useful for confidence intervals on the mean of an unknown
distribution. Efron (1982) discusses other applications. There are also refinements
based on correction for bias in the bootstrap sample itself (DiCiccio and Efron, 1996).
In general, however, the naive bootstrap can help you get better estimates of
standard errors and confidence intervals than many large-sample approximations, such
as Fishers z transformation for Pearson correlations or Wald tests for coefficients in

I-20
Chapter 2

nonlinear models. And in cases in which no good approximations are available (see
some of the examples below), the bootstrap is the only way to go.

Bootstrapping in SYSTAT
Bootstrap Main Dialog Box
No dialog box exists for performing bootstrapping; therefore, you must use SYSTATs
command language. To do a bootstrap analysis, simply add the sample type to the
command that initiates model estimation (usually ESTIMATE).

Using Commands
The syntax is:
ESTIMATE / SAMPLE=BOOT(m,n)
SIMPLE(m,n)
JACK

The arguments m and n stand for the number of samples and the sample size of each
sample. The parameter n is optional and defaults to the number of cases in the file.
The BOOT option generates samples with replacement, SIMPLE generates samples
without replacement, and JACK generates a jackknife set.

Usage Considerations
Types of data. Bootstrapping works on procedures with rectangular data only.
Print options. It is best to set PRINT=NONE; otherwise, you will get 16 miles of output.
If you want to watch, however, set PRINT=LONG and have some fun.
Quick Graphs. Bootstrapping produces no Quick Graphs. You use the file of bootstrap
estimates and produce the graphs you want. See the examples.
Saving files. If you are doing this for more than entertainment (watching output fly by),
save your data into a file before you use the ESTIMATE / SAMPLE command. See the
examples.
BY groups. By all means. Are you a masochist?

I-21
Bootstrapping and Sampling

Case frequencies. Yes, FREQ=<variable> works. This feature does not use extra
memory.
Case weights. Use case weighting if it is available in a specific module.

Examples
A few examples will serve to illustrate bootstrapping. They cover only a few of the
statistical modules, however. We will focus on the tools you can use to manipulate
output and get the summary statistics you need for bootstrap estimates.

Example 1
Linear Models
This example involves the famous Longley (1967) regression data. These real data
were collected by James Longley at the Bureau of Labor Statistics to test the limits of
regression software. The predictor variables in the data set are highly collinear, and
several coefficients of variation are extremely large. The input is:
USE LONGLEY
GLM
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
SAVE BOOT / COEF
ESTIMATE / SAMPLE=BOOT(2500,16)
OUTPUT TEXT1
USE LONGLEY
MODEL TOTAL=CONSTANT+DEFLATOR..TIME
ESTIMATE
USE BOOT
STATS
STATS X(1..6)
OUTPUT *
BEGIN
DEN X(1..6) / NORM
DEN X(1..6)
END

Notice that we save the coefficients into the file BOOT. We request 2500 bootstrap
samples of size 16 (the number of cases in the file). Then we fit the Longley data with
a single regression to compare the result to our bootstrap. Finally, we use the bootstrap

I-22
Chapter 2

file and compute basic statistics on the bootstrap estimated regression coefficients. The
OUTPUT command is used to save this part of the output to a file. We should not use it
earlier in the program unless we want to save the output for the 2500 regressions. To
view the bootstrap distributions, we create histograms on the coefficients to see their
distribution.
The resulting output is:
Variables in the SYSTAT Rectangular file are:
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
TOTAL
Dep Var: TOTAL

N: 16

Multiple R: 0.998

Adjusted squared multiple R: 0.992


Effect
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

POPULATN

TIME

Squared multiple R: 0.995

Standard error of estimate: 304.854

Coefficient

Std Error

-3482258.635
15.062
-0.036
-2.020
-1.033
-0.051
1829.151

890420.384
84.915
0.033
0.488
0.214
0.226
455.478

Std Coef Tolerance


0.0
0.046
-1.014
-0.538
-0.205
-0.101
2.480

.
0.007
0.001
0.030
0.279
0.003
0.001

P(2 Tail)

-3.911
0.177
-1.070
-4.136
-4.822
-0.226
4.016

0.004
0.863
0.313
0.003
0.001
0.826
0.003

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
1.84172E+08
6 3.06954E+07
330.285
0.000
Residual
836424.056
9
92936.006
--------------------------------------------------------------------------------------------------------------------------------Durbin-Watson D Statistic
First Order Autocorrelation

2.559
-0.348

Variables in the SYSTAT Rectangular file are:


CONSTANT
X(1..6)
N of cases
Minimum
Maximum
Mean
Standard Dev

X(1)
2500
-816.248
1312.052
20.648
128.301

X(2)
2500
-0.846
0.496
-0.049
0.064

X(3)
2500
-12.994
7.330
-2.214
0.903

X(4)
2500
-8.864
2.617
-1.118
0.480

X(5)
2500
-2.591
3142.235
1.295
62.845

X(6)
2499
-5050.438
12645.703
1980.382
980.870

I-23
Bootstrapping and Sampling

Following is the plot of the results:


1200

1500

0.2

0.1

200
0
-1000

0.3
500

1000

0.0
2000

0
-1.0

0.8

-0.5

0
-20

-10

0.8

0.4
0.3

500

0.2

1500

0.6
0.5

Count

1000

1000

0.4
0.3

500

0.2

0.1
0

0.0
5

0.6
0.5

1000

0.4
0.3

500

0.2
0.1

0.1
0
-1000

1000 2000
X(5)

0.0
10

3000

0.0
4000

Proportion per Bar

0.5

1500

0.7

Proportion per Bar

0.6

0
X(3)

2000

Proportion per Bar

1500
Count

0.2
0.1

0.0
0.5

0.0

0.7

X(4)

0.3
500

X(2)

2000

-5

0.4

0.1

X(1)

0
-10

1000

Count

Count

0.2
400

0.4

Proportion per Bar

600

1000

0.5

Proportion per Bar

0.3

0.6

0.5

Proportion per Bar

800

1500

Count

0.4

Count

1000

0.6

0
-10000

10000

0.0
20000

X(6)

The bootstrapped standard errors are all larger than the normal-theory standard errors.
The most dramatically different are the ones for the POPULATN coefficient (62.845
versus 0.226). It is well known that multicollinearity leads to large standard errors for
regression coefficients, but the bootstrap makes this even clearer.
Normal curves have been superimposed on the histograms, showing that the
coefficients are not normally distributed. We have run a relatively large number of
samples (2500) to reveal these long-tailed distributions. Were these data to be analyzed
formally, it would take a huge number of samples to get useful standard errors.
Beaton, Rubin, and Barone (1976) used a randomization technique to highlight this
problem. They added a uniform random extra digit to Longleys data so that their data
sets rounded to Longleys values and found in a simulation that the variance of the
simulated coefficient estimates was larger in many cases than the miscalculated
solutions from the poorer designed regression programs.

I-24
Chapter 2

Example 2
Spearman Rank Correlation
This example involves law school data from Efron and Tibshirani (1993). They use
these data to illustrate the usefulness of the bootstrap for calculating standard errors on
the Pearson correlation. There are similar calculations for a 95% confidence interval
on the Spearman correlation.
The bootstrap estimates are saved into a temporary file. The file format is
CORRELATION, meaning that 1000 correlation matrices will be saved, stacked on top
of each other in the file. Consequently, we need BASIC to sift through and delete every
odd line (the diagonal of the matrix). We also have to remember to change the file type
to RECTANGULAR so that we can sort and do other things later. Another approach
would have been to use the rectangular form of the correlation output:
SPEARMAN LSAT*GPA

Next, we reuse the new file and sort the correlations. Finally, we print the nearest
values to the percentiles. Following is the input:
CORR
GRAPH NONE
USE LAW
RSEED=54321
SAVE TEMP
SPEARMAN LSAT GPA / SAMPLE=BOOT(1000,15)
BASIC
USE TEMP
TYPE=RECTANGULAR
IF CASE<>2*INT(CASE/2) THEN DELETE
SAVE BLAW
RUN
USE BLAW
SORT LSAT
IF CASE=975 THEN PRINT 95% CI Upper:,LSAT
IF CASE=25 THEN PRINT 95% CI Lower:,LSAT
OUTPUT TEXT2
RUN
OUTPUT *
DENSITY LSAT

Following is the output, our asymmetric confidence interval:


95% CI Lower:
95% CI Upper:

0.476
0.953

I-25
Bootstrapping and Sampling

The histogram of the entire file shows the overall shape of the distribution. Notice its
asymmetry.
200

0.2

Count

100

0.1

50

0
0.0

0.2

0.4

0.6 0.8
LSAT

1.0

Proportion per Bar

150

0.0
1.2

Example 3
Confidence Interval on a Median
We will use the STATS module to compute a 95% confidence interval on the median
(Efron, 1979). The input is:
STATS
GRAPH NONE
USE OURWORLD
SAVE TEMP
STATS LIFE_EXP / MEDIAN,SAMPLE=BOOT(1000,57)
BASIC
USE TEMP
SAVE TEMP2
IF STATISTC$<>Median THEN DELETE
RUN
USE TEMP2
SORT LIFE_EXP
IF CASE=975 THEN PRINT 95% CI Upper:,LIFE_EXP
IF CASE=25 THEN PRINT 95% CI Lower:,LIFE_EXP
OUTPUT TEXT3
RUN
OUTPUT *
DENSITY LIFE_EXP

I-26
Chapter 2

Following is the output:


95% CI Lower:
95% CI Upper:

63.000
71.000

600

0.6

500

0.5

400

0.4

300

0.3

200

0.2

100

0.1

0
50

60
70
LIFE_EXP

Proportion per Bar

Count

Following is the histogram of the bootstrap sample medians:

0.0
80

Keep in mind that we are using the naive bootstrap method here, trusting the
unmodified distribution of the bootstrap sample to set percentiles. Looking at the
bootstrap histogram, we can see that the distribution is skewed and irregular. There are
improvements that can be made in these estimates. Also, we have to be careful about
how we interpret a confidence interval on a median.

Example 4
Canonical Correlations: Using Text Output
Most statistics can be bootstrapped by saving into SYSTAT files, as shown in the
examples. Sometimes you may want to search through bootstrap output for a single
number and compute standard errors or graphs for that statistic. The following example
uses SETCOR to compute the distribution of the two canonical correlations relating the

I-27
Bootstrapping and Sampling

species to measurements in the Fisher Iris data. The same correlations are computed in
the DISCRIM procedure. Following is the input:
SETCOR
USE IRIS
MODEL SPECIES=SEPALLEN..PETALWID
CATEGORY SPECIES
OUTPUT TEMP
ESTIMATE / SAMPLE=BOOT(500,150)
OUTPUT *
BASIC
GET TEMP
INPUT A$,B$
LET R1=.
LET R2=.
LET FOUND=.
IF A$=Canonical AND B$=correlations ,
THEN LET FOUND=CASE
IF LAG(FOUND,2)<>. THEN FOR
LET R1=VAL(A$)
LET R2=VAL(B$)
NEXT
IF R1=. AND R2=. THEN DELETE
SAVE CC
RUN
EXIT
USE CC
DENSITY R1 R2 / DIT

Notice how the BASIC program searches through the output file TEMP.DAT for the
words Canonical correlations at the beginning of a line. Two lines later, the actual
numbers are in the output, so we use the LAG function to check when we are at that
point after having located the string. Then we convert the printed values back to
numbers with the VAL() function. If you are concerned with precision, use a larger
format for the output. Finally, we delete unwanted rows and save the results into the
file CC. From that file, we plot the two canonical correlations. For fun, we do a dot
histogram (dit) plot.

I-28
Chapter 2

Following is the graph:

Notice the stripes in the plot on the left. These reveal the three-digit rounding we
incurred by using the standard FORMAT=3.

Computation
Computations are done by the respective statistical modules. Sampling is done on the
data.

Algorithms
Bootstrapping and other sampling is implemented via a one-pass algorithm that does
not use extra storage for the data. Samples are generated using the SYSTAT uniform
random number generator. It is always a good idea to reset the seed when running a
problem so that you can be certain where the random number generator started if it
becomes necessary to replicate your results.

Missing Data
Cases with missing data are handled by the specific module.

I-29
Bootstrapping and Sampling

References
Beaton, A. E., Rubin, D. B., and Barone, J. L. (1976). The acceptability of regression
solutions: Another look at computational accuracy. Journal of the American Statistical
Association, 71, 158168.
Block, J. (1960). On the number of significant findings to be expected by chance.
Psychometrika, 25, 369380.
DiCiccio, T. J. and Efron, B. (1966). Bootstrap confidence intervals. Statistical Science, 11,
189228.
Dwass, M. (1957). Modified randomization sets for nonparametric hypotheses. Annals of
Mathematical Statistics, 29, 181187.
Edginton, E. S. (1980). Randomization tests. New York: Marcel Dekker.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of
Statistics, 7, 126.
Efron, B. (1982). The jackknife, the bootstrap and other resampling plans. Vol. 38 of
CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Penn.:
SIAM.
Efron, B. and LePage, R. (1992). Introduction to bootstrap. In R. LePage and L. Billard
(eds.), Exploring the Limits of Bootstrap. New York: John Wiley & Sons, Inc.
Efron, B. and Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Fisher, R. A. (1935). The design of experiments. London: Oliver & Boyd.
Longley, J. W. (1967). An appraisal of least squares for the electronic computer from the
point of view of the user. Journal of the American Statistical Association, 62, 819841.
Noreen, E. W. (1989). Computer intensive methods for testing hypotheses: An introduction.
New York: John Wiley & Sons, Inc.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of
Mathematical Statistics, 29, 614.

Chapter

3
Classification and
Regression Trees
Leland Wilkinson

The TREES module computes classification and regression trees. Classification trees
include those models in which the dependent variable (the predicted variable) is
categorical. Regression trees include those in which it is continuous. Within these
types of trees, the TREES module can use categorical or continuous predictors,
depending on whether a CATEGORY statement includes some or all of the predictors.
For any of the models, a variety of loss functions is available. Each loss function
is expressed in terms of a goodness-of-fit statisticthe proportion of reduction in
2
error (PRE). For regression trees, this statistic is equivalent to the multiple R . Other
loss functions include the Gini index, twoing (Breiman et al., 1984), and the phi
coefficient.
TREES produces graphical trees called mobiles (Wilkinson, 1995). At the end of
each branch is a density display (box plot, dot plot, histogram, etc.) showing the
distribution of observations at that point. The branches balance (like a Calder mobile)
at each node so that the branch is level, given the number of observations at each end.
The physical analogy is most obvious for dot plots, in which the stacks of dots (one
for each observation) balance like marbles in bins.
TREES can also produce a SYSTAT BASIC program to code new observations
and predict the dependent variable. This program can be saved to a file and run from
the command window or submitted as a program file.

Statistical Background
Trees are directed graphs beginning with one node and branching to many. They are
fundamental to computer science (data structures), biology (classification),
psychology (decision theory), and many other fields. Classification and regression

I-31

I-32
Chapter 3

trees are used for prediction. In the last two decades, they have become popular as
alternatives to regression, discriminant analysis, and other procedures based on
algebraic models. Tree-fitting methods have become so popular that several
commercial programs now compete for the attention of market researchers and others
looking for software.
Different commercial programs produce different results with the same data,
however. Worse, some programs provide no documentation or supporting materials to
explain their algorithms. The result is a marketplace of competing claims, jargon, and
misrepresentation. Reviews of these packages (for example, Levine, 1991; Simon,
1991) use words like sorcerer, magic formula, and wizardry to describe the
algorithms and express frustration at vendors scant documentation. Some vendors, in
turn, have represented tree programs as state-of-the-art artificial intelligence
procedures capable of discovering hidden relationships and structures in databases.
Despite the marketing hyperbole, most of the now-popular tree-fitting algorithms
have been around for decades. The modern commercial packages are mainly
microcomputer ports (with attractive interfaces) of the mainframe programs that
originally implemented these algorithms. Warnings of abuse of these techniques are
not new either (for example, Einhorn, 1972; Bishop, Fienberg, and Holland, 1975).
Originally proposed as automatic procedures for detecting interactions among
variables, tree-fitting methods are actually closely related to classical cluster analysis
(Hartigan, 1975).
This introduction will attempt to sort out some of the differences between
algorithms and illustrate their use on real data. In addition, tree analyses will be
compared to discriminant analysis and regression.

The Basic Tree Model


The figure below shows a tree for predicting decisions by a medical school admissions
committee (Milstein et al., 1975). It was based on data for a sample of 727 applicants.
We selected a tree procedure for this analysis because it was easy to present the results
to the Yale Medical School admissions committee and because the tree model could
serve as a basis for structuring their discussions about admissions policy.
Notice that the values of the predicted variable (the committees decision to reject
or interview) are at the bottom of the tree and the predictors (Medical College
Admissions Test and college grade point average) come into the system at each node
of the tree.

I-33
Classification and Regression Trees

The top node contains the entire sample. Each remaining node contains a subset of
the sample in the node directly above it. Furthermore, each node contains the sum of
the samples in the nodes connected to and directly below it. The tree thus splits
samples.
Each node can be thought of as a cluster of objects, or cases, that is to be split by
further branches in the tree. The numbers in parentheses below the terminal nodes
show how many cases are incorrectly classified by the tree. A similar tree data structure
is used for representing the results of single and complete linkage and other forms of
hierarchical cluster analysis (Hartigan, 1975). Tree prediction models add two
ingredients: the predictor and predicted variables labeling the nodes and branches.
GRADE POINT AVERAGE
n=727
>3.47
<3.47
342
385
MCAT VERBAL

MCAT VERBAL
<555

REJECT
(9)

93

<535 51 354 >535

249 >555
MCAT QUANTITATIVE
<655

>655
122

REJECT
(19)

INTERVIEW
(49)

127

REJECT

INTERVIEW

(45)

(46)

The tree is binary because each node is split into only two subsamples. Classification
or regression trees do not have to be binary, but most are. Despite the marketing claims
of some vendors, nonbinary, or multibranch, trees are not superior to binary trees. Each
is a permutation of the other, as shown in the figure below.
The tree on the left (ternary) is not more parsimonious than that on the right (binary).
Both trees have the same number of parameters, or split points, and any statistics
associated with the tree on the left can be converted trivially to fit the one on the right.
A computer program for scoring either tree (IF ... THEN ... ELSE) would look identical.
For display purposes, it is often convenient to collapse binary trees into multibranch
trees, but this is not necessary.

I-34
Chapter 3

123

123

23

Some programs that do multibranch splits do not allow further splitting on a predictor
once it has been used. This has an appealing simplicity. However, it can lead to
unparsimonious trees. It is unnecessary to make this restriction before fitting a tree.
The figure below shows an example of this problem. The upper right tree classifies
objects on an attribute by splitting once on shape, once on fill, and again on shape. This
allows the algorithm to separate the objects into only four terminal nodes having
common values. The upper left tree splits on shape and then only on fill. By not
allowing any other splits on shape, the tree requires five terminal nodes to classify
correctly. This problem cannot be solved by splitting first on fill, as the lower left tree
shows. In general, restricting splits to only one branch for each predictor results in
more terminal nodes.
1

shape
1

shape
1

fill

fill
3

shape

4
fill

2 2
shape

I-35
Classification and Regression Trees

Categorical or Quantitative Predictors


The predictor variables in the figure on p. 33 are quantitative, so splits are created by
determining cut points on a scale. If predictor variables are categorical, as in the figure
above, splits are made between categorical values. It is not necessary to categorize
predictors before computing trees. This is as dubious a practice as recoding data wellsuited for regression into categories in order to use chi-square tests. Those who
recommend this practice are turning silk purses into sows ears. In fact, if variables are
categorized before doing tree computations, then poorer fits are likely to result.
Algorithms are available for mixed quantitative and categorical predictors, analogous
to analysis of covariance.

Regression Trees
Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a
quantitative variable. They called the method Automatic Interaction Detection
(AID). The algorithm performs stepwise splitting. It begins with a single cluster of
cases and searches a candidate set of predictor variables for a way to split the cluster
into two clusters. Each predictor is tested for splitting as follows: sort all the n cases on
the predictor and examine all n 1 ways to split the cluster in two. For each possible
split, compute the within-cluster sum of squares about the mean of the cluster on the
dependent variable. Choose the best of the n 1 splits to represent the predictors
contribution. Now do this for every other predictor. For the actual split, choose the
predictor and its cut point that yields the smallest overall within-cluster sum of squares.
Categorical predictors require a different approach. Since categories are unordered,
all possible splits between categories must be considered. For deciding on one split of
k
k categories into two groups, this means that 2 1 possible splits must be considered.
Once a split is found, its suitability is measured on the same within-cluster sum of
squares as for a quantitative predictor.
Morgan and Sonquist called their algorithm AID because it naturally incorporates
interaction among predictors. Interaction is not correlation. It has to do, instead, with
conditional discrepancies. In the analysis of variance, interaction means that a trend
within one level of a variable is not parallel to a trend within another level of the same
variable. In the ANOVA model, interaction is represented by cross-products between
predictors. In the tree model, it is represented by branches from the same node that
have different splitting predictors further down the tree.

I-36
Chapter 3

The figure below shows a tree without interactions on the left and with interactions on
the right. Because interaction trees are a natural by-product of the AID splitting
algorithm, Morgan and Sonquist called the procedure automatic. In fact, AID trees
without interactions are quite rare for real data, so the procedure is indeed automatic. To
search for interactions using stepwise regression or ANOVA linear modeling, we would
p
have to generate 2 interactions among p predictors and compute partial correlations for
every one of them in order to decide which ones to include in our formal model.

B
.

B
C

C
E

Classification Trees
Regression trees parallel regression/ANOVA modeling, in which the dependent
variable is quantitative. Classification trees parallel discriminant analysis and
algebraic classification methods. Kass (1980) proposed a modification to AID called
CHAID for categorized dependent and independent variables. His algorithm
incorporated a sequential merge-and-split procedure based on a chi-square test
statistic. Kass was concerned about computation time (although this has since proved
an unnecessary worry), so he decided to settle for a suboptimal split on each predictor
instead of searching for all possible combinations of the categories. Kasss algorithm
is like sequential crosstabulation. For each predictor:
n Crosstabulate the m categories of the predictor with the k categories of the

dependent variable.

2 k subtable is least
significantly different on a chi-square test and merge these two categories.

n Find the pair of categories of the predictor whose

n If the chi-square test statistic is not significant according to a preset critical value,

repeat this merging process for the selected predictor until no nonsignificant chisquare is found for a subtable.
n Choose the predictor variable whose chi-square is the largest and split the sample

into m l subsets, where l is the number of categories resulting from the merging
process on that predictor.

n Continue splitting, as with AID, until no significant chi-squares result.

I-37
Classification and Regression Trees

The CHAID algorithm saves computer time, but it is not guaranteed to find the splits
that predict best at a given step. Only by searching all possible category subsets can we
do that. CHAID is also limited to categorical predictors, so it cannot be used for
quantitative or mixed categorical-quantitative models, as in the figure on p. 33.
Nevertheless, it is an effective way to search heuristically through rather large tables
quickly.
Note: Within the computer science community, there is a categorical splitting literature
that often does not cite the statistical work and is, in turn, not frequently cited by
statisticians (although this has changed in recent years). Quinlan (1986, 1992), the best
known of these researchers, developed a set of algorithms based on information theory.
These methods, called ID3, iteratively build decision trees based on training samples
of attributes.

Stopping Rules, Pruning, and Cross-Validation


AID, CHAID, and other forward-sequential tree-fitting methods share a problem with
other tree-clustering methodswhere do we stop? If we keep splitting, a tree will end
up with only one case, or object, at each terminal node. We need a method for
producing a smaller tree other than the exhaustive one. One way is to use stepwise
statistical tests, as in the F-to-enter or alpha-to-enter rule for forward stepwise
regression. We compute a test statistic (chi-square, F, etc.), choose a critical level for
the test (sometimes modifying it with the Bonferroni inequality), and stop splitting any
branch that fails to meet the test (see Wilkinson, 1979, for a review of this procedure
in forward selection regression).
Breiman et al. (1984) showed that this method tends to yield trees with too many
branches and can also fail to pursue branches that can add significantly to the overall
fit. They advocate, instead, pruning the tree. After computing an exhaustive tree, their
program eliminates nodes that do not contribute to the overall prediction. They add
another essential ingredient, howeverthe cost of complexity. This measure is similar
to other cost statistics, such as Mallows C p (Neter, Wasserman, and Kutner, 1985),
which add a penalty for increasing the number of parameters in a model. Breimans
method is not like backward elimination stepwise regression. It resembles forward
stepwise regression with a cutting back on the final number of steps using a different
criterion than the F-to-enter. This method still cannot do as well as an exhaustive
search, which would be prohibitive for most practical problems.
Regardless of how a tree is pruned, it is important to cross-validate it. As with
stepwise regression, the prediction error for a tree applied to a new sample can be

I-38
Chapter 3

considerably higher than for the training sample on which it was constructed.
Whenever possible, data should be reserved for cross-validation.

Loss Functions
Different loss functions are appropriate for different forms of data. TREES offers a
variety of functions that are scaled as proportional reduction in error (PRE) statistics.
This allows you to try different loss functions on a problem and compare their
predictive validity.
For regression trees, the most appropriate loss functions are least squares, trimmed
mean, and least absolute deviations. Least-squares loss yields the classic AID tree. At
each split, cases are classified so that the within-group sum of squares about the mean
of the group is as small as possible. The trimmed mean loss works the same way but
first trims 20% of outlying cases (10% at each extreme) in a splittable subset before
computing the mean and sum of squares. It can be useful when you expect outliers in
subgroups and dont want them to influence the split decisions. LAD loss computes
least absolute deviations about the mean rather than squares. It, too, gives less weight
to extreme cases in each potential group.
For classification trees, use the phi coefficient (the default), Gini index, or twoing.
2
The phi coefficient is n for a 2 k table formed by the split on k categories of
the dependent variable. The Gini index is a variance estimate based on all comparisons
of possible pairs of values in a subgroup. Finally, twoing is a word coined by Breiman
et al. to describe splitting k categories as if it were a two-category splitting problem.
For more information about the effects of Gini and twoing on computations, see
Breiman et al. (1984).

Geometry
Most discussions of trees versus other classifiers compare tree graphs and algebraic
equations. There is another graphic view of what a tree classifier does, however. If we
look at the cases embedded in the space of the predictor variables, we can ask how a
linear discriminant analysis partitions the cases and how a tree classifier partitions them.

I-39
Classification and Regression Trees

The figure below shows how cases are split by a linear discriminant analysis. There
are three subgroups of cases in this example. The cutting planes are positioned
approximately halfway between each pair of group centroids. Their orientation is
determined by the discriminant analysis. With three predictors and four groups, there
are six cutting planes, although only four planes show in the figure. The fourth group
is assumed to be under the bottom plane in the figure. In general, if there are g groups,
the linear discriminant model cuts them with g ( g 1 ) 2 planes.

The figure below shows how a tree-fitting algorithm cuts the same data. Only the
nearest subgroup (dark spots) shows; the other three groups are hidden behind the rear
and bottom cutting planes. Notice that the cutting planes are parallel to the axes. While
this would seem to restrict the discrimination compared to the more flexible angles
allowed the discriminant planes, the tree model allows interactions between variables,
which do not appear in the ordinary linear discriminant model. Notice, for example,
that one plane splits on the X variable, but the second plane that splits on the Y variable
cuts only the values to the left of the X partition. The tree model can continue to cut
any of these subregions separately, unlike the discriminant model, which can cut only
globally and with g ( g 1 ) 2
planes. This is a mixed blessing, however, since tree
methods, as we have seen, can over-fit the data. It is critical to test them on new
samples.

I-40
Chapter 3

Tree models are not usually related by authors to dimensional plots in this way, but
it is helpful to see that they have a geometric interpretation. Alternatively, we can
construct algebraic expressions for trees. They would require dummy variables for any
categorical predictors and interaction (or product) terms for every split whose
descendants (or lower nodes) did not involve the same variables on both sides.

Y
X

Classification and Regression Trees in SYSTAT


Trees Main Dialog Box
To open the Trees dialog box, from the menus choose:
Statistics
Classification
Trees

I-41
Classification and Regression Trees

Model selection and estimation are available in the main Trees dialog box:
Dependent. The variable you want to examine. The dependent variable should be
continuous or categorical numeric variables (for example, INCOME).
Independent(s). Select one or more continuous or categorical variables (grouping
variables).
Expand Model. Adds all possible sums and differences of the predictors to the model.
Loss. Select a loss function from the drop-down list.
n Least squares. The least squared loss (AID) minimizes the sum of the squared

deviation.
n Trimmed mean. The trimmed mean loss (TRIM) trims the extreme observations

(20%) prior to computing the mean.


n Least absolute deviations. The least absolute deviations loss (LAD).
n Phi coefficient. The phi coefficient loss computes the correlation between two

dichotomous variables.
n Gini index. The Gini index loss measures inequality or dispersion.
n Twoing. The twoing loss function.

Display nodes as. Select the type of density display. The following types are available:
n Box plot. Plot that uses boxes to show a distribution shape, central tendency, and

variability.
n Dit plot. Dot histogram. Produces a density display that looks similar to a histogram.

Unlike histograms, dot histograms represent every observation with a unique


symbol, so they are especially suited for small- to moderate-size samples of
continuous data.
n Dot plot. Plot that displays dots at the exact locations of data values.
n Jitter plot. Density plot that calculates the exact locations of the data values, but

jitters points randomly on a short vertical axis to keep points from colliding.
n Stripe. Places vertical lines at the location of data values along a horizontal data

scale and looks like supermarket bar codes.


n Text. Displays text output in the tree diagram including the mode, sample size, and

impurity value.

I-42
Chapter 3

Stopping Criteria
The Stopping Criteria dialog box contains the parameters for controlling stopping.

Specify the criteria for splitting to stop.


Number of splits. Maximum number of splits.
Minimum proportion. Minimum proportion reduction in error for the tree allowed at any
split.
Split minimum. Minimum split value allowed at any node.
Minimum objects at end of trees. Minimum count allowed at any node.

Using Commands
After selecting a file with USE filename, continue with:
TREES
MODEL yvar = xvarlist / EXPAND
ESTIMATE / PMIN=d, SMIN=d, NMIN=n, NSPLIT=n,
LOSS=LSQ
TRIM
LAD
PHI
GINI
TWOING,
DENSITY=STRIPE
JITTER
DOT
DIT
BOX

I-43
Classification and Regression Trees

Usage Considerations
Types of data. TREES uses rectangular data only.
Print options. The default output includes the splitting history and summary statistics.
PRINT=LONG adds a BASIC program for classifying new observations. You can cut
and paste this BASIC program into a text window and run it in the BASIC module to
classify new data on the same variables for cross-validation and prediction.
Quick Graphs. TREES produces a Quick Graph for the fitted tree. The nodes may
contain text describing split parameters or they may contain density graphs of the data
being split. A dashed line indicates that the split is not significant.
Saving files. TREES does not save files. Use the BASIC program under PRINT=LONG
to classify your data, compute residuals, etc., on old or new data.
BY groups. TREES analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ = <variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in TREES.

Examples
The following examples illustrate the features of the TREES module. The first example
shows a classification tree for the Fisher-Anderson iris data set. The second example
is a regression tree on an example taken from Breiman et al. (1984), and the third is a
regression tree predicting the danger of a mammal being eaten by predators.

I-44
Chapter 3

Example 1
Classification Tree
This example shows a classification tree analysis of the Fisher-Anderson iris data set
featured in Discriminant Analysis. We use the Gini loss function and display a
graphical tree, or mobile, with dot histograms, or dit plots. The input is:
USE IRIS
LAB SPECIES/1=SETOSA,2=VERSICOLOR,3=VIRGINICA
TREES
MODEL SPECIES=SEPALLEN,SEPALWID,PETALLEN,PETALWID
ESTIMATE/LOSS=GINI,DENSITY=DIT

Following is the output:


Variables in the SYSTAT Rectangular file are:
SPECIES
SEPALLEN
SEPALWID
PETALLEN
PETALWID
Split
Variable
PRE Improvement
1
PETALLEN
0.500
0.500
2
PETALWID
0.890
0.390
Fitting Method: Gini Index
Predicted variable: SPECIES
Minimum split index value:
0.050
Minimum improvement in PRE:
0.050
Maximum number of nodes allowed:
22
Minimum count allowed in each node:
5
The final tree contains 3 terminal nodes
Proportional reduction in error:
0.890
Node from Count
Mode
Impurity
Split Var
Cut Value
1
0
150
2
1
50
SETOSA
0.0
3
1
100
4
3
54
VERSICOLOR
0.084
5
3
46
VIRGINICA
0.021
2

Fit

The PRE for the whole tree is 0.89 (similar to R for a regression model), which is not
bad. Before exulting, however, we should keep in mind that while Fisher chose the iris
data set to demonstrate his discriminant model on real data, it is barely worthy of the
effort. We can classify the data almost perfectly by looking at a scatterplot of petal
length against petal width.
The unique SYSTAT display of the tree is called a mobile (Wilkinson, 1995). The
dit plots are ideal for illustrating how it works. Imagine each case is a marble in a box
at each node. The mobile simply balances all of the boxes. The reason for doing this is
that we can easily see splits that cut only a few cases out of a group. These nodes will
hang out conspicuously. It is fairly evident in the first split, for example, which cuts the
population into half as many cases on the right (petal length less than 3) as on the left.

I-45
Classification and Regression Trees

This display has a second important characteristic that is different from other tree
displays. The mobile coordinates the polarity of the terminal nodes (red on color
displays) rather than the direction of the splits. This design has three consequences: we
can evaluate the distributions of the subgroups on a common scale, we can see the
direction of the splits on each splitting variable, and we can look at the distributions on
the terminal nodes from left to right to see how the whole sample is split on the
dependent variable.
The first consequence means that every box containing data is a miniature density
display of the subgroups values on a common scale (same limits and same direction).
We dont need to drill down on the data in a subgroup to see its distribution. It is
immediately apparent in the tree. If you prefer box plots or other density displays,
simply use
DENSITY = BOX

or another density as an ESTIMATE option. Dit plots are most suitable for classification
trees, however; because they spike at the category values, they look like bar charts for
categorical data. For continuous data, dit plots look like histograms. Although they are
my favorite density display for this purpose, they can be time consuming to draw on
large samples, so box plots are the default graphical display. If you omit DENSITY
altogether, you will get a text summary inside each box.
The second consequence of ordering the splits according to the polarity of the
dependent (rather than the independent) variable is that the direction of the split can be
recognized immediately by looking at which side (left or right) the split is displayed
on. Notice that PETALLEN < 3.000 occurs on the left side of the first split. This means
that the relation between petal length and species (coded 1..3) is positive. The same is
true for petal width within the second split group because the split banner occurs on the
left. Banners on the right side of a split indicate a negative relationship between the
dependent variable and the splitting variable within the group being split, as in the
regression tree examples.
The third consequence of ordering the splits is that we can look at the terminal nodes
from left to right and see the consequences of the split in order. In the present example,
notice that the three species are ordered from left to right in the same order that they
are coded. You can change this ordering for a categorical variable with the CATEGORY
and ORDER commands. Adding labels, as we did here, makes the output more
interpretable.

I-46
Chapter 3

Example 2
Regression Tree with Box Plots
This example shows a simple AID model. The data set is Boston housing prices, cited
in Belsley, Kuh, and Welsch (1980) and used in Breiman et al. (1984). We are
predicting median home values (MEDV) from a set of demographic variables. The
input is:
USE BOSTON
TREES
MODEL MEDV=CRIM..LSTAT
ESTIMATE/PMIN=.005,DENSITY=BOX

I-47
Classification and Regression Trees

Following is the output:


Variables in the SYSTAT Rectangular file are:
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
MEDV
Split
Variable
PRE Improvement
1
RM
0.453
0.453
2
RM
0.524
0.072
3
LSTAT
0.696
0.171
4
PTRATIO
0.706
0.010
5
LSTAT
0.723
0.017
6
DIS
0.782
0.059
7
CRIM
0.809
0.027
8
NOX
0.815
0.006
Fitting Method: Least Squares
Predicted variable: MEDV
Minimum split index value:
0.050
Minimum improvement in PRE:
0.005
Maximum number of nodes allowed:
22
Minimum count allowed in each node:
5
The final tree contains 9 terminal nodes
Proportional reduction in error:
0.815
Node from Count
Mean
SD
Split Var
Cut Value
1
0
506
22.533
9.197
RM
6.943
2
1
430
19.934
6.353
LSTAT
14.430
3
1
76
37.238
8.988
RM
7.454
4
3
46
32.113
6.497
LSTAT
11.660
5
3
30
45.097
6.156
PTRATIO
18.000
6
2
255
23.350
5.110
DIS
1.413
7
2
175
14.956
4.403
CRIM
7.023
8
5
25
46.820
3.768
9
5
5
36.480
8.841
10
4
41
33.500
4.594
11
4
5
20.740
9.080
12
6
5
45.580
9.883
13
6
250
22.905
3.866
14
7
101
17.138
3.392
NOX
0.538
15
7
74
11.978
3.857
16
14
24
20.021
3.067
17
14
77
16.239
2.975

Fit
0.453
0.422
0.505
0.382
0.405
0.380
0.337

0.227

The Quick Graph of the tree more clearly reveals the sample-size feature of the mobile
display. Notice that a number of the splits, because they separate out a few cases only,
are extremely unbalanced. This can be interpreted in two ways, depending on context.
On the one hand, it can mean that outliers are being separated so that subsequent splits
can be more powerful. On the other hand, it can mean that a split is wasted by focusing
on the outliers when further splits dont help to improve the prediction. The former
case appears to apply in our example. The first split separates out a few expensive
housing tracts (the median values have a positively skewed distribution for all tracts),
which makes subsequent splits more effective. The box plots in the terminal nodes are
narrow.

I-48
Chapter 3

Example 3
Regression Tree with Dit Plots
This example involves predicting the danger of a mammal being eaten by predators
(Allison and Cicchetti, 1976). The predictors are hours of dreaming and nondreaming
sleep, gestational age, body weight, and brain weight. Although the danger index has
only five values, we are treating it as a quantitative variable with meaningful numerical
values. The input is:
USE SLEEP
TREES
MODEL DANGER=BODY_WT,BRAIN_WT,
SLO_SLEEP,DREAM_SLEEP,GESTATE
ESTIMATE / DENSITY=DIT

I-49
Classification and Regression Trees

The resulting output is:


Variables in the SYSTAT Rectangular file are:
SPECIES$
BODY_WT
BRAIN_WT
SLO_SLEEP
DREAM_SLEEP TOTAL_SLEEP
LIFE
GESTATE
PREDATION
EXPOSURE
DANGER
18 cases deleted due to missing data.
Split
Variable
PRE Improvement
1 DREAM_SLEEP
0.404
0.404
2
BRAIN_WT
0.479
0.074
3
SLO_SLEEP
0.547
0.068
Fitting Method: Least Squares
Predicted variable: DANGER
Minimum split index value:
0.050
Minimum improvement in PRE:
0.050
Maximum number of nodes allowed:
22
Minimum count allowed in each node:
5
The final tree contains 4 terminal nodes
Proportional reduction in error:
0.547
Node from Count
Mean
SD
Split Var
Cut Value
Fit
1
0
44
2.659
1.380 DREAM_SLEEP
1.200
0.404
2
1
14
3.929
1.072
BRAIN_WT
58.000
0.408
3
1
30
2.067
1.081
SLO_SLEEP
12.800
0.164
4
2
6
3.167
1.169
5
2
8
4.500
0.535
6
3
23
2.304
1.105
7
3
7
1.286
0.488

I-50
Chapter 3

The prediction is fairly good (PRE = 0.547). The Quick Graph of this tree illustrates
another feature of mobiles. The dots in each terminal node are assigned a separate
color. This way, we can follow their path up the tree each time they are merged. If the
prediction is perfect, the top density plot will have colored dots perfectly separated.
The extent to which the colors are mixed in the top plot is a visual indication of the
badness-of-fit of the model. The fairly good separation of colors for the sleep data is
quite clear on the computer screen or with color printing but less evident in a blackand-white figure.

Computation
Computations are in double precision.

Algorithms
TREES uses algorithms from Breiman et al. (1984) for its splitting computations.

Missing Data
Missing data are eliminated from the calculation of the loss function for each split
separately.

References
Allison, T. and Cicchetti, D. (1976). Sleep in mammals: Ecological and constitutional
correlates. Science, 194, 732734.
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying
influential data and sources of collinearity. New York: John Wiley & Sons, Inc.
Bishop, Y. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate analysis.
Cambridge, Mass.: MIT Press.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. I. (1984). Classification and
regression trees. Belmont, Calif.: Wadsworth.
Einhorn, H. (1972). Alchemy in the behavioral sciences. Public Opinion Quarterly, 3,
367378.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.

I-51
Classification and Regression Trees

Kass, G. V. (1980). An exploratory technique for investigating large quantities of


categorical data. Applied Statistics, 29, 119127.
Levine, M. (1991). Statistical analysis for the executive. Byte, 17, 183184.
Milstein, R. M., Burrow, G. N., Wilkinson, L., and Kessen, W. (1975). Prediction of
screening decisions in a medical school admission process. Journal of Medical
Education, 51, 626633.
Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a
proposal. Journal of the American Statistical Association, 58, 415434.
Neter, J., Wasserman, W., and Kutner, M. (1985). Applied linear statistical models, 2nd ed.
Homewood, Ill.: Richard D. Irwin, Inc.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81106.
Quinlan, J. R. (1992). C4.5: Programs for machine learning. New York: Morgan
Kaufmann.
Simon, B. (1991). Knowledge seeker: Statistics for decision makers. PC Magazine
(January 29), 50.
Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin,
86, 168174.
Wilkinson, L. (1995). Mobiles. Department of Statistics, Northwestern University,
Evanston, Ill.

Chapter

4
Cluster Analysis
Leland Wilkinson, Laszlo Engelman, James Corter, and Mark Coward

SYSTAT provides a variety of cluster analysis methods on rectangular or symmetric


data matrices. Cluster analysis is a multivariate procedure for detecting natural
groupings in data. It resembles discriminant analysis in one respectthe researcher
seeks to classify a set of objects into subgroups although neither the number nor
members of the subgroups are known.
Cluster provides three procedures for clustering: Hierarchical Clustering, K-means,
and Additive Trees. The Hierarchical Clustering procedure comprises hierarchical
linkage methods. The K-means Clustering procedure splits a set of objects into a
selected number of groups by maximizing between-cluster variation and minimizing
within-cluster variation. The Additive Trees Clustering procedure produces a SattathTversky additive tree clustering.
Hierarchical Clustering clusters cases, variables, or both cases and variables
simultaneously; K-means clusters cases only; and Additive Trees clusters a similarity
or dissimilarity matrix. Eight distance metrics are available with Hierarchical
Clustering and K-means, including metrics for quantitative and frequency count data.
Hierarchical Clustering has six methods for linking clusters and displays the results
as a tree (dendrogram) or a polar dendrogram. When the MATRIX option is used to
cluster cases and variables, SYSTAT uses a gray-scale or color spectrum to represent
the values.

I-53

I-54
Chapter 4

Statistical Background
Cluster analysis is a multivariate procedure for detecting groupings in data. The objects
in these groups may be:
n Cases (observations or rows of a rectangular data file). For example, if health

indicators (numbers of doctors, nurses, hospital beds, life expectancy, etc.) are
recorded for countries (cases), then developed nations may form a subgroup or
cluster separate from underdeveloped countries.
n Variables (characteristics or columns of the data). For example, if causes of death

(cancer, cardiovascular, lung disease, diabetes, accidents, etc.) are recorded for
each U.S. state (case), the results show that accidents are relatively independent of
the illnesses.
n Cases and variables (individual entries in the data matrix). For example, certain

wines are associated with good years of production. Other wines have other years
that are better.

Types of Clustering
Clusters may be of two sorts: overlapping or exclusive. Overlapping clusters allow the
same object to appear in more than one cluster. Exclusive clusters do not. All of the
methods implemented in SYSTAT are exclusive.
There are three approaches to producing exclusive clusters: hierarchical,
partitioned, and additive trees. Hierarchical clusters consist of clusters that completely
contain other clusters that completely contain other clusters, and so on. Partitioned
clusters contain no other clusters. Additive trees use a graphical representation in
which distances along branches reflect similarities among the objects.
The cluster literature is diverse and contains many descriptive synonyms:
hierarchical clustering (McQuitty, 1960; Johnson, 1967); single linkage clustering
(Sokal and Sneath, 1963), and joining (Hartigan, 1975). Output from hierarchical
methods can be represented as a tree (Hartigan, 1975) or a dendrogram (Sokal and
Sneath, 1963). (The linkage of each object or group of objects is shown as a joining of
branches in a tree. The root of the tree is the linkage of all clusters into one set, and
the ends of the branches lead to each separate object.)

I-55
Cluster Analy sis

Correlations and Distances


To produce clusters, we must be able to compute some measure of dissimilarity
between objects. Similar objects should appear in the same cluster, and dissimilar
objects, in different clusters. All of the methods available in CORR for producing
matrices of association can be used in cluster analysis, but each has different
implications for the clusters produced. Incidentally, CLUSTER converts correlations to
dissimilarities by negating them.
In general, the correlation measures (Pearson, Mu2, Spearman, Gamma, Tau) are
not influenced by differences in scales between objects. For example, correlations
between states using health statistics will not in general be affected by some states
having larger average numbers or variation in their numbers. Use correlations when
you want to measure the similarity in patterns across profiles regardless of overall
magnitude.
On the other hand, the other measures such as Euclidean and City (city-block
distance) are significantly affected by differences in scale. For health data, two states
will be judged to be different if they have differing overall incidences even when they
follow a common pattern. Generally, you should use the distance measures when
variables are measured on common scales.

Standardizing Data
Before you compute a dissimilarity measure, you may need to standardize your data
across the measured attributes. Standardizing puts measurements on a common scale.
In general, standardizing makes overall level and variation comparable across
measurements. Consider the following data:
OBJECT

X1

X2

X3

X4

A
B
C
D

10
11
13
14

2
3
4
1

11
15
12
13

900
895
760
874

If we are clustering the four cases (A through D), variable X4 will determine almost
entirely the dissimilarity between cases, whether we use correlations or distances. If
we are clustering the four variables, whichever correlation measure we use will adjust
for the larger mean and standard deviation on X4. Thus, we should probably

I-56
Chapter 4

standardize within columns if we are clustering rows and use a correlation measure if
we are clustering columns.
In the example below, case A will have a disproportionate influence if we are
clustering columns.
OBJECT

X1

X2

X3

X4

A
B
C
D

410
1
10
12

311
3
11
13

613
2
12
13

514
4
10
11

We should probably standardize within rows before clustering columns. This requires
transposing the data before standardization. If we are clustering rows, on the other
hand, we should use a correlation measure to adjust for the larger mean and standard
deviation of case A.
These are not immutable laws. The suggestions are only to make you realize that
scales can influence distance and correlation measures.

Hierarchical Clustering
To understand hierarchical clustering, its best to look at an example. The following
data reflect various attributes of selected performance cars.
ACCEL

5.0
5.3
5.8
7.0
7.6
7.9
8.5
8.7
9.3
10.8
13.0

BRAKE

SLALOM

MPG

SPEED

NAME$

245
242
243
267
271
259
263
287
258
287
253

61.3
61.9
62.6
57.8
59.8
61.7
59.9
64.2
64.1
60.8
62.3

17.0
12.0
19.0
14.5
21.0
19.0
17.5
35.0
24.5
25.0
27.0

153
181
154
145
124
130
131
115
129
100
95

Porsche 911T
Testarossa
Corvette
Mercedes 560
Saab 9000
Toyota Supra
BMW 635
Civic CRX
Acura Legend
VW Fox GL
Chevy Nova

I-57
Cluster Analy sis

Cluster Displays
SYSTAT displays the output of hierarchical clustering in several ways. For joining
rows or columns, SYSTAT prints a tree. For matrix joining, it prints a shaded matrix.
Trees. A tree is printed with a unique ordering in which every branch is lined up such
that the most similar objects are closest to each other. If a perfect seriation (onedimensional ordering) exists in the data, the tree reproduces it. The algorithm for
ordering the tree is given in Gruvaeus and Wainer (1972). This ordering may differ
from that of trees printed by other clustering programs if they do not use a seriation
algorithm to determine how to order branches. The advantage of using seriation is most
apparent for single linkage clusterings.
If you join rows, the end branches of the tree are labeled with case numbers or
labels. If you join columns, the end branches of the tree are labeled with variable
names.
Direct display of a matrix. As an alternative to trees, SYSTAT can produce a shaded
display of the original data matrix in which rows and columns are permuted according
to an algorithm in Gruvaeus and Wainer (1972). Different characters represent the
magnitude of each number in the matrix (Ling, 1973). A legend showing the range of
data values that these characters represent appears with the display.
Cutpoints between these values and their associated characters are selected to
heighten contrast in the display. The method for increasing contrast is derived from
techniques used in computer pattern recognition, in which gray-scale histograms for
visual displays are modified to heighten contrast and enhance pattern detection. To
find these cutpoints, we sort the data and look for the largest gaps between adjacent
values. Tukeys gapping method (Wainer and Schacht, 1978) is used to determine how
many gaps (and associated characters) should be chosen to heighten contrast for a
given set of data. This procedure, time consuming for large matrices, is described in
detail in Wilkinson (1978).
If you have a course to grade and are looking for a way to find rational cutpoints in
the grade distribution, you might want to use this display to choose the cutpoints.
Cluster the n 1 matrix of numeric grades (n students by 1 grade) and let SYSTAT
choose the cutpoints. Only cutpoints asymptotically significant at the 0.05 level are
chosen. If no cutpoints are chosen in the display, give everyone an A, flunk them all,
or hand out numeric grades (unless you teach at Brown University or Hampshire
College).

I-58
Chapter 4

Clustering Rows
First, lets look at possible clusters of the cars in the example. Since the variables are
on such different scales, we will standardize them before doing the clustering. This will
give acceleration comparable influence to braking, for example. Then we select
Pearson correlations as the basis for dissimilarity between cars. The result is:
Cluster Tree
Corvette
Porsche 911T
Testarossa
Toyota Supra
Acura Legend
Chevy Nova
Civic CRX
VW Fox GL
Saab 9000
Mercedes 560
BMW 635

0.0

0.1

0.2 0.3 0.4


Distances

0.5

0.6

If you look at the correlation matrix for the cars, you will see how these clusters hang
together. Cars within the same cluster (for example, Corvette, Testarossa, Porsche)
generally correlate highly.

Porsche
Testa
Corv
Merc
Saab
Toyota
BMW
Civic
Acura
VW
Chevy

Porsche

Testa

Corv

Merc

Saab

1.00
0.94
0.94
0.09
0.51
0.24
0.32
0.50
0.05
0.96
0.73

1.00
0.87
0.21
0.52
0.43
0.10
0.73
0.10
0.93
0.70

1.00
0.24
0.76
0.40
0.56
0.39
0.30
0.98
0.49

1.00
0.66
0.38
0.85
0.52
0.98
0.08
0.53

1.00
0.68
0.63
0.26
0.77
0.70
0.13

I-59
Cluster Analy sis

Toyota

Toyota
BMW
Civic
Acura
VW
Chevy

1.00
0.25
0.30
0.53
0.35
0.03

BMW

1.00
0.50
0.79
0.39
0.06

Civic

1.00
0.35
0.55
0.32

Acura

1.00
0.16
0.54

VW

1.00
0.53

Clustering Columns
We can cluster the performance attributes of the cars more easily. Here, we do not need
to standardize within cars (by rows) because all of the values are comparable between
cars. Again, to give each variable comparable influence, we will use Pearson
correlations as the basis for the dissimilarities. The result based on the data
standardized by variable (column) is:
Cluster Tree

BRAKE
MPG
ACCEL
SLALOM
SPEED
0.0

0.2

0.4 0.6
0.8
Distances

1.0

1.2

Clustering Rows and Columns


To cluster the rows and columns jointly, we should first standardize the variables to
give each of them comparable influence on the clustering of cars. Once we have
standardized the variables, we can use Euclidean distances because the scales are
comparable. We used single linkage to produce the following result:

I-60
Chapter 4

L
E
D
OM
AK PG CCE LAL PEE
S
BR
M
S
A

Testarossa
Porsche 911T
Corvette
Acura Legend
Toyota Supra
BMW 635
Saab 9000
Mercedes 560
VW Fox GL
Chevy Nova
Civic CRX

3
2
1
0
-1
-2

This figure displays the standardized data matrix itself with rows and columns
permuted to reveal clustering and each data value replaced by one of three symbols.
Notice that the rows are ordered according to overall performance, with the fastest cars
at the top.
Matrix clustering is especially useful for displaying large correlation matrices. You
may want to cluster the correlation matrix this way and then use the ordering to
produce a scatterplot matrix that is organized by the multivariate structure.

Partitioning via K-Means


To produce partitioned clusters, you must decide in advance how many clusters you
want. K-means clustering searches for the best way to divide your objects into different
sections so that they are separated as well as possible. The procedure begins by picking
seed cases, one for each cluster, which are spread apart from the center of all of the
cases as much as possible. Then it assigns all cases to the nearest seed. Next, it attempts
to reassign each case to a different cluster in order to reduce the within-groups sum of
squares. This continues until the within-groups sum of squares can no longer be
reduced.

I-61
Cluster Analy sis

K-means clustering does not search through every possible partitioning of the data,
so it is possible that some other solution might have smaller within-groups sums of
squares. Nevertheless, it has performed relatively well on globular data separated in
several dimensions in Monte Carlo studies of cluster algorithms.
Because it focuses on reducing within-groups sums of squares, k-means clustering
is like a multivariate analysis of variance in which the groups are not known in
advance. The output includes analysis of variance statistics, although you should be
cautious in interpreting them. Remember, the program is looking for large F ratios in
the first place, so you should not be too impressed by large values.
Following is a three-group analysis of the car data. The clusters are similar to those
we found by joining. K-means clustering uses Euclidean distances instead of Pearson
correlations, so there are minor differences because of scaling. To keep the influences
of all variables comparable, we standardized the data before running the analysis.
Summary Statistics for
Variable

3 Clusters

Between SS

DF

Within SS

DF

F-Ratio

Prob

ACCEL
7.825
2
2.175
8
14.389
0.002
BRAKE
5.657
2
4.343
8
5.211
0.036
SLALOM
5.427
2
4.573
8
4.747
0.044
MPG
7.148
2
2.852
8
10.027
0.007
SPEED
7.677
2
2.323
8
13.220
0.003
------------------------------------------------------------------------------Cluster Number:
1
Members
Case

Statistics

Distance

Mercedes 560
Saab 9000
Toyota Supra
BMW 635

Variable

Minimum

Mean

Maximum

St.Dev.

0.60
0.31
0.49
0.16

|
ACCEL
-0.45
-0.14
0.17
0.23
|
BRAKE
-0.15
0.23
0.61
0.28
|
SLALOM
-1.95
-0.89
0.11
0.73
|
MPG
-1.01
-0.47
-0.01
0.37
|
SPEED
-0.34
0.00
0.50
0.31
------------------------------------------------------------------------------Cluster Number:
2
Members
Case

Statistics

Distance

Civic CRX
Acura Legend
VW Fox GL
Chevy Nova

Variable

Minimum

Mean

Maximum

St.Dev.

0.81
0.67
0.71
0.76

|
ACCEL
0.26
0.99
2.05
0.69
|
BRAKE
-0.53
0.62
1.62
1.00
|
SLALOM
-0.37
0.72
1.43
0.74
|
MPG
0.53
1.05
2.15
0.65
|
SPEED
-1.50
-0.91
-0.14
0.53
------------------------------------------------------------------------------Cluster Number:
3
Members
Case
Porsche 911T
Testarossa
Corvette

Statistics

Distance

Variable

Minimum

Mean

Maximum

St.Dev.

0.25
0.43
0.31

|
|
|
|
|

ACCEL
BRAKE
SLALOM
MPG
SPEED

-1.29
-1.22
-0.10
-1.40
0.82

-1.13
-1.14
0.23
-0.78
1.21

-0.95
-1.03
0.59
-0.32
1.94

0.14
0.08
0.28
0.45
0.52

I-62
Chapter 4

Additive Trees
Sattath and Tversky (1977) developed additive trees for modeling
similarity/dissimilarity data. Hierarchical clustering methods require objects in the
same cluster to have identical distances to each other. Moreover, these distances must
be smaller than the distances between clusters. These restrictions prove problematic for
similarity data, and as a result hierarchical clustering cannot fit this data well.
In contrast, additive trees use tree branch length to represent distances between
objects. Allowing the within-cluster distances to vary yields a tree diagram with
varying branch lengths. Objects within a cluster can be compared by focusing on the
horizontal distance along the branches connecting them. The additive tree for the car
data follows:
Additive Tree
Chevy
VW
Civic
Acura
Toyota
Corv
Testa
Porsche
Saab
BMW
Merc

I-63
Cluster Analy sis

The distances between nodes of the graph are:


Node

Length

Child

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

0.10
0.49
0.14
0.52
0.19
0.13
0.11
0.71
0.30
0.42
0.62
0.06
0.08
0.49
0.18
0.35
0.04
0.13
0.0
0.04
0.0

Porsche
Testa
Corv
Merc
Saab
Toyota
BMW
Civic
Acura
VW
Chevy
1,2
8,10
12,3
13,11
9,15
14,6
17,16
5,18
4,7
20,19

Each object is a node in the graph. In this example, the first 11 nodes represent the cars.
Other graph nodes correspond to groupings of the objects. Here, the 12th node
represents Porsche and Testa.
The distance between any two nodes is the sum of the (horizontal) lengths between
them. The distance between Chevy and VW is 0.62 + 0.08 + 0.42 = 1.12 . The
distance between Chevy and Civic is 0.62 + 0.08 + 0.71 = 1.41 . Consequently,
Chevy is more similar to VW than to Civic.

I-64
Chapter 4

Cluster Analysis in SYSTAT


Hierarchical Clustering Main Dialog Box
Hierarchical clustering produces hierarchical clusters that are displayed in a tree.
Initially, each object (case or variable) is considered a separate cluster. SYSTAT
begins by joining the two closest objects as a cluster and continues (in a stepwise
manner) joining an object with another object, an object with a cluster, or a cluster with
another cluster until all objects are combined into one cluster.
To obtain a hierarchical cluster analysis, from the menus choose:
Statistics
Classification
Hierarchical Clustering

You must select the elements of the data file to cluster (Join):
n Rows. Rows (cases) of the data matrix are clustered.
n Columns. Columns (variables) of the data matrix are clustered.
n Matrix. Rows and columns of the data matrix are clusteredthey are permuted to

bring similar rows and columns next to one another.


Linkage allows you to specify the type of joining algorithm used to amalgamate
clusters (that is, define how distances between clusters are measured).

I-65
Cluster Analy sis

n Single. Single linkage defines the distance between two objects or clusters as the

distance between the two closest members of those clusters. This method tends to
produce long, stringy clusters. If you use a SYSTAT file that contains a similarity
or dissimilarity matrix, you get clustering via Johnsons min method.
n Complete. Complete linkage uses the most distant pair of objects in two clusters to

compute between-cluster distances. This method tends to produce compact,


globular clusters. If you use a similarity or dissimilarity matrix from a SYSTAT
file, you get Johnsons max method.
n Centroid. Centroid linkage uses the average value of all objects in a cluster (the

cluster centroid) as the reference point for distances to other objects or clusters.
n Average. Average linkage averages all distances between pairs of objects in

different clusters to decide how far apart they are.


n Median. Median linkage uses the median distances between pairs of objects in

different clusters to decide how far apart they are.


n Ward. Wards method averages all distances between pairs of objects in different

clusters, with adjustments for covariances, to decide how far apart the clusters are.
For some data, the last four methods cannot produce a hierarchical tree with strictly
increasing amalgamation distances. In these cases, you may see stray branches that do
not connect to others. If this happens, you should consider Single or Complete linkage.
For more information on these problems, see Fisher and Van Ness (1971). These
reviewers concluded that these and other problems made Centroid, Average, Median,
and Ward (as well as k-means) inadmissible clustering procedures. In practice and
in Monte Carlo simulations, however, they sometimes perform better than Single and
Complete linkage, which Fisher and Van Ness considered admissible. Milligan
(1980) tested all of the hierarchical joining methods in a large Monte Carlo simulation
of clustering algorithms. Consult his paper for further details.
In addition, the following options can be specified:
Distance. Specifies the distance metric used to compare clusters.
Polar. Produces a polar (circular) cluster tree.
Save cluster identifier variable. Saves cluster identifiers to a SYSTAT file. You can
specify the number of clusters to identify for the saved file. If not specified, two
clusters are identified.

I-66
Chapter 4

Clustering Distances
Both hierarchical clustering and k-means clustering allow you to select the type of
distance metric to use between objects. From the Distance drop-down list, you can
select:
n Gamma. Distances are computed using 1 minus the Goodman-Kruskal gamma

correlation coefficient. Use this metric with rank order or ordinal scales. Missing
values are excluded from computations.
n Pearson. Distances are computed using 1 minus the Pearson product-moment

correlation coefficient for each pair of objects. Use this metric for quantitative
variables. Missing values are excluded from computations.
n RSquared. Distances are computed using 1 minus the square of the Pearson

product-moment correlation coefficient for each pair of objects. Use this metric
with quantitative variables. Missing values are excluded from computations.
n Euclidean. Clustering is computed using normalized Euclidean distance (root mean

squared distances). Use this metric with quantitative variables. Missing values are
excluded from computations.
n Minkowski. Clustering is computed using the pth root of the mean pth powered

distances of coordinates. Use this metric for quantitative variables. Missing values
are excluded from computations. Use the Power text box to specify the value of p.
n Chisquare. Distances are computed as the chi-square measure of independence of

rows and columns on 2-by-n frequency tables, formed by pairs of cases (or
variables). Use this metric when the data are counts of objects or events.
n Phisquare. Distances are computed as the phi-square (chi-square/total) measure on

2-by-n frequency tables, formed by pairs of cases (or variables). Use this metric
when the data are counts of objects or events.
n Percent (available for hierarchical clustering only). Clustering uses a distance

metric that is the percentage of comparisons of values resulting in disagreements


in two profiles. Use this metric with categorical or nominal scales.
n MW (available for k-means clustering only). Distances are computed as the

increment in within sum of squares of deviations, if the case (or variable) would
belong to a cluster. The case (or variable) is moved into the cluster that minimizes
the within sum of squares of deviations. Use this metric with quantitative variables.
Missing values are excluded from computations.

I-67
Cluster Analy sis

K-Means Main Dialog Box


K-means clustering splits a set of objects into a selected number of groups by
maximizing between-cluster variation relative to within-cluster variation. It is similar
to doing a one-way analysis of variance where the groups are unknown and the largest
F value is sought by reassigning members to each group.
K-means starts with one cluster and splits it into two clusters by picking the case
farthest from the center as a seed for a second cluster and assigning each case to the
nearest center. It continues splitting one of the clusters into two (and reassigning cases)
until a specified number of clusters are formed. K-means reassigns cases until the
within-groups sum of squares can no longer be reduced.
To obtain a k-means cluster analysis, from the menus choose:
Statistics
Classification
K-means Clustering

The following options can be specified:


Groups. Enter the number of desired clusters. If the number (Groups) of clusters is not
specified, two are computed (one split of the data).
Iterations. Enter the maximum number of iterations. If not stated, this maximum is 20.
Save identifier variable. Saves cluster identifiers to a SYSTAT file.
Distance. Specifies the distance metric used to compare clusters.

I-68
Chapter 4

Additive Trees Main Dialog Box


Additive trees were developed by Sattath and Tversky (1977) for modeling
similarity/dissimilarity data, which are not fit well by hierarchical joining trees.
Hierarchical trees imply that all within-cluster distances are smaller than all betweencluster distances and that within-cluster distances are equal. This so-called
ultrametric condition seldom applies to real similarity data from direct judgment.
Additive trees, on the other hand, represent similarities with a network model in the
shape of a tree. Distances between objects are represented by the lengths of the
branches connecting them in the tree.
To obtain additive trees, from the menus choose:
Statistics
Classification
Additive Tree Clustering

The following options can be specified:


Data. Display the raw data matrix.
Transformed. Include the transformed data (distance-like measures) with the output.
Model. Display the model (tree) distances between the objects.
Residuals. Show the differences between the distance-transformed data and the model
distances.
Nonumbers. Objects in the tree graph are not numbered.

I-69
Cluster Analy sis

Nosubtract. Use of an additive constant. Additive Trees assumes interval-scaled data,


which implies complete freedom in choosing an additive constant, so it adds or
subtracts to exactly satisfy the triangle inequality. Use Nosubtract to allow strict
inequality and subtract no constant.
Height. Prints the distance of each node from the root.
Minvar. Combines the last few remaining clusters into the root node by searching for
the root that minimizes the variances of the distances from the root to the leaves.

Using Commands
For the hierarchical tree method:
CLUSTER
USE filename
IDVAR var$
PRINT
SAVE filename / NUMBER=n DATA
JOIN varlist / POLAR DISTANCE=metric
LINKAGE=method

POWER=p

The distance metric is EUCLIDEAN, GAMMA, PEARSON, RSQUARED, MINKOWSKI,


CHISQUARE, PHISQUARE, or PERCENT. For MINKOWSKI, specify the root using
POWER=p.
The linkage methods include SINGLE, COMPLETE, CENTROID, AVERAGE,
MEDIAN, and WARD.

For the k-means splitting method:


CLUSTER
USE filename
IDVAR var$
PRINT
SAVE filename / NUMBER=n DATA
KMEANS varlist / NUMBER=n ITER=n DISTANCE=metric POWER=p

The distance metric is EUCLIDEAN, GAMMA, PEARSON, RSQUARED, MINKOWSKI,


CHISQUARE, PHISQUARE, or MW. For MINKOWSKI, specify the root using POWER=p.

I-70
Chapter 4

For additive trees:


CLUSTER
USE filename
ADD varlist / DATA TRANSFORMED MODEL RESIDUALS
TREE NUMBERS NOSUBTRACT HEIGHT
MINVAR ROOT=n1,n2

Usage Considerations
Types of data. Hierarchical Clustering works on either rectangular SYSTAT files or
files containing a symmetric matrix, such as those produced with Correlations. KMeans works only on rectangular SYSTAT files. Additive Trees works only on
symmetric (similarity or dissimilarity) matrices.
Print options. Using PRINT=LONG for Hierarchical Clustering yields an ASCII
representation of the tree diagram (instead of the Quick Graph). This option is useful
if you are joining more than 100 objects.
Quick Graphs. Cluster analysis includes Quick Graphs for each procedure. Hierarchical
Clustering and Additive Trees have tree diagrams. For each cluster, K-Means displays
a profile plot of the data and a display of the variable means and standard deviations.
To omit Quick Graphs, specify GRAPH NONE.
Saving files. CLUSTER saves cluster indices as a new variable.
BY groups. CLUSTER analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Labeling output. For Hierarchical Clustering and K-Means, be sure to consider using ID
Variable (on the Data menu) for labeling the output.

I-71
Cluster Analy sis

Examples
Example 1
K-Means Clustering
The data in the file SUBWORLD are a subset of cases and variables from the
OURWORLD file:
URBAN
BIRTH_RT
DEATH_RT
B_TO_D
BABYMORT
GDP_CAP
LIFEEXPM
LIFEEXPF
EDUC
HEALTH
MIL
LITERACY

Percentage of the population living in cities


Births per 1000 people
Deaths per 1000 people
Ratio of births to deaths
Infant deaths during the first year per 1000 live births
Gross domestic product per capita (in U.S. dollars)
Years of life expectancy for males
Years of life expectancy for females
U.S. dollars spent per person on education
U.S. dollars spent per person on health
U.S. dollars spent per person on the military
Percentage of the population who can read

The distributions of the economic variables (GDP_CAP, EDUC, HEALTH, and MIL)
are skewed with long right tails, so these variables are analyzed in log units.
This example clusters countries (cases). The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
KMEANS urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ,
mil health / NUMBER=4

Note that KMEANS must be specified last.

I-72
Chapter 4

The resulting output is:


Distance metric is Euclidean distance
k-means splitting cases into 4 groups
Summary statistics for all cases
Variable
Between SS df
Within SS df
URBAN
18.6065
3
9.3935 25
BIRTH_RT
26.2041
3
2.7959 26
DEATH_RT
23.6626
3
5.3374 26
BABYMORT
26.0275
3
2.9725 26
GDP_CAP
26.9585
3
2.0415 26
EDUC
25.3712
3
3.6288 26
HEALTH
24.9226
3
3.0774 25
MIL
24.7870
3
3.2130 25
LIFEEXPM
24.7502
3
4.2498 26
LIFEEXPF
25.9270
3
3.0730 26
LITERACY
24.8535
3
4.1465 26
B_TO_D
22.2918
3
6.7082 26
** TOTAL **
294.3624 36
50.6376 309

F-ratio
16.5065
81.2260
38.4221
75.8869
114.4464
60.5932
67.4881
64.2893
50.4730
73.1215
51.9470
28.7997

------------------------------------------------------------------------------Cluster 1 of 4 contains 12 cases


Members
Statistics
Case
Distance | Variable
Minimum
Mean
Maximum
St.Dev.
Austria
0.28 | URBAN
-0.17
0.60
1.59
0.54
Belgium
0.09 | BIRTH_RT
-1.14
-0.93
-0.83
0.10
Denmark
0.19 | DEATH_RT
-0.77
0.00
0.26
0.35
France
0.14 | BABYMORT
-0.85
-0.81
-0.68
0.05
Switzerland
0.26 | GDP_CAP
0.33
1.01
1.28
0.26
UK
0.14 | EDUC
0.47
0.95
1.28
0.28
Italy
0.16 | HEALTH
0.52
0.99
1.31
0.23
Sweden
0.23 | MIL
0.28
0.81
1.11
0.25
WGermany
0.31 | LIFEEXPM
0.23
0.75
0.99
0.23
Poland
0.39 | LIFEEXPF
0.43
0.79
1.07
0.18
Czechoslov
0.26 | LITERACY
0.54
0.72
0.75
0.06
Canada
0.30 | B_TO_D
-1.09
-0.91
-0.46
0.18

------------------------------------------------------------------------------Cluster 2 of 4 contains 5 cases


Members
Statistics
Case
Distance | Variable
Minimum
Mean
Maximum
St.Dev.
Ethiopia
0.40 | URBAN
-2.01
-1.69
-1.29
0.30
Guinea
0.52 | BIRTH_RT
1.46
1.58
1.69
0.10
Somalia
0.38 | DEATH_RT
1.28
1.85
3.08
0.76
Afghanistan
0.38 | BABYMORT
1.38
1.88
2.41
0.44
Haiti
0.30 | GDP_CAP
-2.00
-1.61
-1.27
0.30
| EDUC
-2.41
-1.58
-1.10
0.51
| HEALTH
-2.22
-1.64
-1.29
0.44
| MIL
-1.76
-1.51
-1.37
0.17
| LIFEEXPM
-2.78
-1.90
-1.38
0.56
| LIFEEXPF
-2.47
-1.91
-1.48
0.45
| LITERACY
-2.27
-1.83
-0.76
0.62
| B_TO_D
-0.38
-0.02
0.25
0.26
-------------------------------------------------------------------------------

I-73
Cluster Analy sis

Cluster 3 of 4 contains 11 cases


Members
Statistics
Case
Distance | Variable
Minimum
Mean
Maximum
St.Dev.
Argentina
0.45 | URBAN
-0.88
0.16
1.14
0.76
Brazil
0.32 | BIRTH_RT
-0.60
0.07
0.92
0.49
Chile
0.40 | DEATH_RT
-1.28
-0.70
0.00
0.42
Colombia
0.42 | BABYMORT
-0.70
-0.06
0.55
0.47
Uruguay
0.61 | GDP_CAP
-0.75
-0.38
0.04
0.28
Ecuador
0.36 | EDUC
-0.89
-0.39
0.14
0.36
ElSalvador
0.52 | HEALTH
-0.91
-0.47
0.28
0.38
Guatemala
0.65 | MIL
-1.25
-0.59
0.37
0.49
Peru
0.37 | LIFEEXPM
-0.63
0.06
0.77
0.49
Panama
0.51 | LIFEEXPF
-0.57
0.04
0.61
0.44
Cuba
0.58 | LITERACY
-0.94
0.20
0.73
0.51
| B_TO_D
-0.65
0.63
1.68
0.76
------------------------------------------------------------------------------Cluster 4 of 4 contains 2 cases
Members
Statistics
Case
Distance | Variable
Minimum
Mean
Maximum
St.Dev.
Iraq
0.29 | URBAN
-0.30
0.06
0.42
0.51
Libya
0.29 | BIRTH_RT
0.92
1.27
1.61
0.49
| DEATH_RT
-0.77
-0.77
-0.77
0.0
| BABYMORT
0.44
0.47
0.51
0.05
| GDP_CAP
-0.25
0.05
0.36
0.43
| EDUC
-0.04
0.44
0.93
0.68
| HEALTH
-0.51
-0.04
0.42
0.66
| MIL
1.34
1.40
1.46
0.08
| LIFEEXPM
-0.09
-0.04
0.02
0.08
| LIFEEXPF
-0.30
-0.21
-0.11
0.13
| LITERACY
-0.94
-0.86
-0.77
0.12
| B_TO_D
1.61
2.01
2.42
0.57

Cluster Parallel Coordinate Plots


2

GDP_CAP

GDP_CAP

BIRTH_RT

BIRTH_RT

BABYMORT

BABYMORT

LIFEEXPF

LIFEEXPF

Index of Case

Index of Case

HEALTH
MIL
EDUC
LITERACY
LIFEEXPM

HEALTH
MIL
EDUC
LITERACY
LIFEEXPM

DEATH_RT

DEATH_RT

B_TO_D

B_TO_D

URBAN

-3

URBAN

-2

-1

-3

-2

-1

GDP_CAP

GDP_CAP

BIRTH_RT

BIRTH_RT

BABYMORT

BABYMORT

LIFEEXPF

LIFEEXPF

HEALTH
MIL
EDUC

LITERACY

HEALTH
MIL
EDUC
LITERACY

LIFEEXPM

LIFEEXPM

DEATH_RT

DEATH_RT

B_TO_D

B_TO_D

URBAN

-3

Index of Case

Index of Case

URBAN

-2

-1

-3

-2

-1

I-74
Chapter 4

Cluster Profile Plots


1

GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN

GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN

GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN

GDP_CAP
BIRTH_RT
BABYMORT
LIFEEXPF
HEALTH
MIL
EDUC
LITERACY
LIFEEXPM
DEATH_RT
B_TO_D
URBAN

For each variable, cluster analysis compares the between-cluster mean square (Between
SS/df) to the within-cluster mean square (Within SS/df) and reports the F-ratio. However,
do not use these F ratios to test significance because the clusters are formed to characterize
differences. Instead, use these statistics to characterize relative discrimination. For
example, the log of gross domestic product (GDP_CAP) and BIRTH_RT are better
discriminators between countries than URBAN or DEATH_RT. For a good graphical view
of the separation of the clusters, you might rotate the data using the three variables with
the highest F ratios.
Following the summary statistics, for each cluster, cluster analysis prints the
distance from each case (country) in the cluster to the center of the cluster. Descriptive
statistics for these countries appear on the right. For the first cluster, the standard scores
for LITERACY range from 0.54 to 0.75 with an average of 0.72. B_TO_D ranges from
1.09 to 0.46. Thus, for these predominantly European countries, literacy is well
above the average for the sample and the birth-to-death ratio is below average. In
cluster 2, LITERACY ranges from 2.27 to 0.76 for these five countries, and B_TO_D
ranges from 0.38 to 0.25. Thus, the countries in cluster 2 have a lower literacy rate

I-75
Cluster Analy sis

and a greater potential for population growth than those in cluster 1. The fourth cluster
(Iraq and Libya) has an average birth-to-death ratio of 2.01, the highest among the four
clusters.

Cluster Parallel Coordinates


The variables in this Quick Graph are ordered by their F ratios. In the top left plot, there
is one line for each country in cluster 1 that connects its z scores for each of the
variables. Zero marks the average for the complete sample. The lines for these 12
countries all follow a similar pattern: above average values for GDP_CAP, below for
BIRTH_RT, and so on. The lines in cluster 3 do not follow such a tight pattern.

Cluster Profiles
The variables in cluster profile plots are ordered by the F ratios. The vertical line under
each cluster number indicates the grand mean across all data. A variable mean within
each cluster is marked by a dot. The horizontal lines indicate one standard deviation
above or below the mean. The countries in cluster 1 have above average means of gross
domestic product, life expectancy, literacy, and urbanization, and spend considerable
money on health care and the military, while the means of their birth rates, infant
mortality rates, and birth-to-death ratios are low. The opposite is true for cluster 2.

Example 2
Hierarchical Clustering: Clustering Cases
This example uses the SUBWORLD data (see the k-means example for a description)
to cluster cases. The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
JOIN urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ mil health

I-76
Chapter 4

The resulting output is:


Distance metric is Euclidean distance
Single linkage method (nearest neighbor)
Cluster
and
containing
-----------WGermany
WGermany
WGermany
Sweden
Austria
Austria
Austria
Austria
Uruguay
Switzerland
Czechoslov
Switzerland
Guatemala
Guatemala
Uruguay
Cuba
Haiti
Switzerland
Guatemala
Peru
Colombia
Ethiopia
Panama
Switzerland
Libya
Afghanistan
Ethiopia
Switzerland
Switzerland

Cluster
containing
-----------Belgium
Denmark
UK
WGermany
Sweden
France
Italy
Canada
Argentina
Austria
Poland
Czechoslov
ElSalvador
Ecuador
Chile
Uruguay
Somalia
Cuba
Brazil
Guatemala
Peru
Haiti
Colombia
Panama
Iraq
Guinea
Afghanistan
Libya
Ethiopia

Were joined
at distance
-----------0.0869
0.1109
0.1127
0.1275
0.1606
0.1936
0.1943
0.2112
0.2154
0.2364
0.2411
0.2595
0.3152
0.3155
0.3704
0.3739
0.3974
0.4030
0.4172
0.4210
0.4433
0.4743
0.5160
0.5560
0.5704
0.5832
0.5969
0.8602
0.9080

No. of members
in new cluster
-------------2
3
4
5
6
7
8
9
2
10
2
12
2
3
3
4
2
16
4
5
6
3
7
23
2
2
5
25
30

I-77
Cluster Analy sis

The numerical results consist of the joining history. The countries at the top of the
panel are joined first at a distance of 0.087. The last entry represents the joining of the
largest two clusters to form one cluster of all 30 countries. Switzerland is in one of the
clusters and Ethiopia is in the other.
The clusters are best illustrated using a tree diagram. Because the example joins
rows (cases) and uses COUNTRY as an ID variable, the branches of the tree are labeled
with countries. If you join columns (variables), then variable names are used. The scale
for the joining distances is printed at the bottom. Notice that Iraq and Libya, which
form their own cluster as they did in the k-means example, are the second-to-last
cluster to link with others. They join with all the countries listed above them at a
distance of 0.583. Finally, at a distance of 0.908, the five countries at the bottom of the
display are added to form one large cluster.

Polar Dendrogram
Adding the POLAR option to JOIN yields a polar dendrogram.

I-78
Chapter 4

Example 3
Hierarchical Clustering: Clustering Variables
This example joins columns (variables) instead of rows (cases) to see which variables
cluster together. The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ,
STANDARDIZE / SD
JOIN urban birth_rt
lifeexpf
educ mil

mil, health) = L10(@)


death_rt babymort lifeexpm,
gdp_cap b_to_d literacy,
health / COLS

The resulting output is:


Distance metric is Euclidean distance
Single linkage method (nearest neighbor)
Cluster
and
containing
-----------LIFEEXPF
HEALTH
EDUC
LIFEEXPF
BABYMORT
EDUC
MIL
MIL
B_TO_D
B_TO_D
MIL

Cluster
containing
-----------LIFEEXPM
GDP_CAP
HEALTH
LITERACY
BIRTH_RT
LIFEEXPF
EDUC
URBAN
BABYMORT
DEATH_RT
B_TO_D

Were joined
at distance
-----------0.1444
0.2390
0.2858
0.3789
0.3859
0.4438
0.4744
0.5414
0.8320
0.8396
1.5377

No. of members
in new cluster
-------------2
2
3
3
2
6
7
8
3
4
12

I-79
Cluster Analy sis

The scale at the bottom of the tree for the distance (1r ) ranges from 0.0 to 1.5. The
smallest distance is 0.011thus, the correlation of LIFEEXPM with LIFEEXPF is
0.989.

Example 4
Hierarchical Clustering: Clustering Variables and Cases
To produce a shaded display of the original data matrix in which rows and columns are
permuted according to an algorithm in Gruvaeus and Wainer (1972), use the MATRIX
option. Different shadings or colors represent the magnitude of each number in the
matrix (Ling, 1973).
If you use the MATRIX option with Euclidean distance, be sure that the variables are
on comparable scales because both rows and columns of the matrix are clustered.
Joining a matrix containing inches of annual rainfall and annual growth of trees in feet,
for example, would split columns more by scales than by covariation. In cases like this,
you should standardize your data before joining.
The input is:
CLUSTER
USE subworld
IDVAR = country$
LET (gdp_cap, educ, mil, health) = L10(@)
STANDARDIZE / SD
JOIN urban birth_rt death_rt babymort lifeexpm,
lifeexpf gdp_cap b_to_d literacy educ,
mil health / MATRIX

I-80
Chapter 4

The resulting output is:


Permuted Data Matrix
T RT T
Y F M P
_R O _R D
N ACXP XP_CA
TH
TH YM H _
PEALDUC
BIATEIRFEIEFEE
L EAAB IRT_TO
I
D
R
U L L L G H E M D B B B
Canada
Italy
France
Sweden
WGermany
Belgium
Denmark
UK
Austria
Switzerland
Czechoslov
Poland
Cuba
Chile
Argentina
Uruguay
Panama
Colombia
Peru
Brazil
Ecuador
ElSalvador
Guatemala
Iraq
Libya
Ethiopia
Haiti
Somalia
Afghanistan
Guinea

4
3
2
1
0
-1
-2
-3

This clustering reveals three groups of countries and two groups of variables. The
countries with more urban dwellers and literate citizens, longest life-expectancies,
highest gross domestic product, and most expenditures on health care, education, and
the military are on the top left of the data matrix; countries with the highest rates of
death, infant mortality, birth, and population growth (see B_TO_D) are on the lower
right. You can also see that, consistent with the k-means and join examples, Iraq and
Libya spend much more on military, education, and health than their immediate
neighbors.

I-81
Cluster Analy sis

Example 5
Hierarchical Clustering: Distance Matrix Input
This example clusters a matrix of distances. The data, stored as a dissimilarity matrix
in the CITIES data file, are airline distances in hundreds of miles between 10 global
cities. The data are adapted from Hartigan (1975).
The input is:
CLUSTER
USE cities
JOIN berlin bombay capetown chicago london,
montreal newyork paris sanfran seattle

Following is the output:


Single linkage method (nearest neighbor)
Cluster
and
containing
-----------PARIS
NEWYORK
BERLIN
CHICAGO
SEATTLE
SEATTLE
BERLIN
BOMBAY
BOMBAY

Cluster
containing
-----------LONDON
MONTREAL
PARIS
NEWYORK
SANFRAN
CHICAGO
SEATTLE
BERLIN
CAPETOWN

Were joined
at distance
-----------2.0000
3.0000
5.0000
7.0000
7.0000
17.0000
33.0000
39.0000
51.0000

No. of members
in new cluster
-------------2
2
3
3
2
5
8
9
10

I-82
Chapter 4

The tree is printed in seriation order. Imagine a trip around the globe to these cities.
SYSTAT has identified the shortest path between cities. The itinerary begins at San
Francisco, leads to Seattle, Chicago, New York, and so on, and ends in Capetown.
Note that the CITIES data file contains the distances between the cities; SYSTAT
did not have to compute those distances. When you save the file, be sure to save it as
a dissimilarity matrix.
This example is used both to illustrate direct distance input and to give you an idea
of the kind of information contained in the order of the SYSTAT cluster tree. For
distance data, the seriation reveals shortest paths; for typical sample data, the seriation
is more likely to replicate in new samples so that you can recognize cluster structure.

Example 6
Additive Trees
This example uses the ROTHKOPF data file. The input is:
CLUSTER
USE rothkopf
ADD a .. z

The output includes:


Similarities linearly transformed into distances.
77.0000 needed to make distances positive.
104.0000 added to satisfy triangle inequality.
Checking 14950 quadruples.
Checking 1001 quadruples.
Checking 330 quadruples.
Checking 70 quadruples.
Checking 1 quadruples.
Stress formula 1
Stress formula 2
r(monotonic) squared
r-squared (p.v.a.f.)

=
=
=
=

0.0609
0.3985
0.8412
0.7880

Node

Length

Child

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

23.3958
15.3958
14.8125
13.3125
24.1250
34.8370
15.9167
27.8750
25.6042
19.8333
13.6875
28.6196
21.8125
22.1875
19.0833
14.1667

A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P

I-83
Cluster Analy sis

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

18.9583
21.4375
28.0000
23.8750
23.0000
27.1250
21.5625
14.6042
17.1875
18.0417
16.9432
15.3804
15.7159
19.5833
26.0625
23.8426
6.1136
17.1750
18.8068
13.7841
15.6630
8.8864
4.5625
1.7000
8.7995
4.1797
1.1232
5.0491
2.4670
4.5849
2.6155
2.7303
0.0
3.8645
0.0

Q
R
S
T
U
V
W
X
Y
Z
1,
2,
3,
4,
5,
7,
8,
10,
13,
17,
18,
19,
27,
29,
33,
39,
12,
34,
42,
30,
32,
6,
45,
46,
50,

9
24
25
11
20
15
22
16
14
26
23
21
35
36
38
31
28
40
41
43
44
37
48
47
49

(SYSTAT also displays the raw data, as well as the model distances.)
Additive Tree
W
R
F
U
S
V
H
T
E
N
M
I
A
Z
Q
Y
C
P
J
O
G
X
B
L
K
D

I-84
Chapter 4

Computation
Algorithms
JOIN follows the standard hierarchical amalgamation method described in Hartigan

(1975). The algorithm in Gruvaeus and Wainer (1972) is used to order the tree.
KMEANS follows the algorithm described in Hartigan (1975). Modifications from
Hartigan and Wong (1979) improve speed. There is an important difference between
SYSTATs KMEANS algorithm and that of Hartigan (or implementations of Hartigans
in BMDP, SAS, and SPSS). In SYSTAT, seeds for new clusters are chosen by finding
the case farthest from the centroid of its cluster. In Hartigans algorithm, seeds forming
new clusters are chosen by splitting on the variable with largest variance.

Missing Data
In cluster analysis, all distances are computed with pairwise deletion of missing values.
Since missing data are excluded from distance calculations by pairwise deletion, they
do not directly influence clustering when you use the MATRIX option for JOIN. To use
the MATRIX display to analyze patterns of missing data, create a new file in which
missing values are recoded to 1, and all other values, to 0. Then use JOIN with MATRIX
to see whether missing values cluster together in a systematic pattern.

References
Campbell, D. T. and Fiske, D. W. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81105.
Fisher, L. and Van Ness, J. W. (1971). Admissible clustering procedures. Biometrika, 58,
91104.
Gower, J. C. (1967). A comparison of some methods of cluster analysis. Biometrics, 23,
623637.
Gruvaeus, G. and Wainer, H. (1972). Two additions to hierarchical cluster analysis. The
British Journal of Mathematical and Statistical Psychology, 25, 200206.
Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review,
139150.
Hartigan, J. A. (1975). Clustering algorithms. New York: John Wiley & Sons, Inc.
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241254.

I-85
Cluster Analy sis

Ling, R. F. (1973). A computer generated aid for cluster analysis. Communications of the
ACM, 16, 355361.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate
observations. 5th Berkeley symposium on mathematics, statistics, and probability, Vol.
1, 281298.
McQuitty, L. L. (1960). Hierarchical syndrome analysis. Educational and Psychological
Measurement, 20, 293303.
Milligan, G. W. (1980). An examination of the effects of six types of error perturbation on
fifteen clustering algorithms. Psychometrika, 45, 325342.
Sattath, S. and Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319345.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic
relationships. University of Kansas Science Bulletin, 38, 14091438.
Sokal, R. R. and Sneath, P. H. A. (1963). Principles of numerical taxonomy. San Francisco:
W. H. Freeman and Company.
Wainer, H. and Schacht, S. (1978). Gappint. Psychometrika, 43, 203212.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236244.
Wilkinson, L. (1978). Permuting a matrix to a simple structure. Proceedings of the
American Statistical Association.

Chapter

5
Conjoint Analysis
Leland Wilkinson

Conjoint analysis fits metric and nonmetric conjoint measurement models to observed
data. It is designed to be a general additive model program using a simple
optimization procedure. As such, conjoint analysis can handle measurement models
not normally amenable to other specialized conjoint programs.

Statistical Background
Conjoint measurement (Luce and Tukey, 1964; Krantz, 1964; Luce, 1966; Tversky,
1967; Krantz and Tversky, 1971) is an axiomatic theory of measurement that defines
the conditions under which there exist measurement scales for two or more variables
that jointly define a common scale under an additive composition rule. This theory
became the basis for a group of related numerical techniques for fitting additive
models, called conjoint analysis (Green and Rao, 1971; Green, Carmone, and Wind,
1972; Green and DeSarbo, 1978; Green and Srinivasan, 1978, 1990; Louviere, 1988,
1994). For an interesting historical comment on Sir Ronald Fishers appropriate
scores method for fitting additive models, see Heiser and Meulman (1995).
To see how conjoint analysis is based on additive models, well first graph an
additive table and then examine a multiplicative table to encounter one example of a
non-additive table. Then well consider the problem of computing margins of a
general table based on an additive model.

I-87

I-88
Chapter 5

Additive Tables
The following is an additive table. Notice that any cell (in roman) is the sum of the
corresponding row and column marginal values (in italic).
1

A common way to represent a two-way table like this is with a graph. I made a file
(PCONJ.SYD) containing all possible ordered pairs of the row and column indices. Then
I formed Y values by adding the indices:
USE PCONJ
LET Y=A+B
LINE Y*A/GROUP=B,OVERLAY

The following graph of the additive table shows a plot of Y (the values in the cells)
against A (rows) stratified by B (columns) in the legend. Notice that the lines are
parallel.

Since we really have a three-dimensional graph (Y*A*B), it is sometimes convenient


to represent a two-way table as a 3-D or contour plot rather than as a stratified line
graph. Following is the input to do so:

I-89
Conjoint Analy sis

PLOT Y*A*B/SMOO=QUAD, CONTOUR,


XMIN=0,XMAX=4,YMIN=0,YMAX=5,INDENT

The following contour plot of the additive table shows the result. Notice that the lines
in the contour plot are parallel for additive tables. Furthermore, although I used a
quadratic smoother, the contours are linear because I used a simple linear combination
of A and B to make Y.

5
8

2
5

1
0
0

2
B

Multiplicative Tables
Following is a multiplicative table. Notice that any cell is the product of the
corresponding marginal values. We commonly encounter these tables in cookbooks
(for sizing recipes) or in, well, multiplication tables. These tables are one instance of
two-way tables that are not additive.
1

12

I-90
Chapter 5

Lets look at a graph of this multiplicative table:


LET Y=A*B
LINE Y*A/GROUP=B,OVERLAY

Notice that the lines are not parallel.

And the following figure shows the contour plot for the multiplicative model. Notice,
again, that the contours are not parallel.

Multiplicative tables and graphs may be pleasing to look at, but theyre not simple. We
all learned to add before multiplying. Scientists often simplify multiplicative functions
by logging them, since logs of products are sums of logs. This is also one of the reasons
we are told to be suspicious of fan-fold interactions (as in the line graph of the
multiplicative table) in the analysis of variance. If we can log the variables and remove
them (usually improving the residuals in the process), we should do so because it
leaves us with a simple linear model.

I-91
Conjoint Analy sis

Computing Table Margins Based on an Additive Model


If we believe in Occams razor and assume that additive tables are generally preferable
to non-additive, we may want to fit additive models to a table of numbers before
accepting a more complex model. So far, we have been assuming that the marginal
indices are known. Testing for additivity is simply a matter of using these indices in a
formal model. What if the marginal indices are not known? All we have is a table of
numbers bordered by labeled categories. Can we find marginal values such that a linear
model based on these values would reproduce the table?
This is exactly what conjoint analysis does. Conjoint analysis originated in an
axiomatic approach to measurement (Luce and Tukey, 1964). An additive model
underlies a basic axiom of fundamental measurementscale values of separate
measurements can be added to produce a joint measurement. This powerful property
allows us to say that for all measurements a and b, we have made on a set of objects,
( a + b ) > a and ( a + b ) > b , assuming that a and b are positive.
The following table is an example of such data. How do we find values for a i and
b j such that y ij = a i + b j ? Luce and Tukey devised rules for computing these values
assuming that the cell values can be fit by the additive model.
b1

b2

b3

a4

1.38

2.07

2.48

a3

1.10

1.79

2.20

a2

.69

1.38

1.79

a1

.00

.69

1.10

The following figure shows a solution. The values for a are a 1 = 0. , a 2 = 0.69 ,
a 3 = 1.10 , and a 4 = 1.38 . The values for b are b 1 = 0. , b 2 = 0.69 , and
b 3 = 1.10 .

I-92
Chapter 5

Applied Conjoint Analysis


In the last few decades, conjoint analysis has become popular, especially among
market researchers and some economists, for analyzing consumer preferences for
goods based on multiple attributes. Green and Srinivasan (1978, 1990), Crowe (1980),
and Louviere (1988) summarize this activity. The focus of most of these techniques has
been on the development of products with attributes ideally suited to consumer
preferences. Several trends in this area have been apparent.
First, psychometricians decided that the axiomatic approach was impractical for
large data sets and for data in which the conjoint measurement axioms were violated
or contained errors (for example, Emery and Barron, 1979). This trend was partly a
consequence of the development of numerical methods that could fit conjoint models
nonmetrically (Kruskal, 1965; Kruskal and Carmone, 1969; Srinivasan and Shocker,
1973; DeLeew et al., 1976). Green and Srinivasan (1978) coined the term conjoint
analysis for the application of these numerical methods.
Second, applied researchers began to substitute linear methods (usually leastsquares linear regression or ANOVA) for nonmetric algorithms. The justification for
this was usually practicalthe results appeared to be similar for all of the fitting
methods, so why not use the simple linear ones? Louviere (1988) articulates this
position, partly based on results from Green and Srinivasan (1978) and partly from his
own experience with real data sets. This argument is similar to one made by Weeks and
Bentler (1979), in which multidimensional scalings using a linear distance function
produced configurations almost indistinguishable from those using monotonic or
moderately nonlinear distance functions. This is a rather ad hoc conclusion, however,
and does not justify ignoring possible nonlinearities in the modeling process. We will
look at such a case in the examples.
Third, recent conjoint analysis applied methodology has moved toward designing
experiments rather than analyzing received ratings. Green and Srinivasan (1990) and
Louviere (1991) have pioneered this approach. Response surfaces for fractional
designs are analyzed to identify optimal combinations of product features. In
SYSTAT, this approach amounts to using DESIGN for setting up an experimental
design and then GLM for analyzing the results. With PRINT LONG, least-squares means
are produced for factorial designs. Otherwise, response surfaces can be plotted.
Fourth, discrete choice logistic regression has recently emerged as a rival to conjoint
analysis for modeling choice and preference behavior (Hensher and Johnson, 1981).
Steinberg (1992) describes the advantages and limitations of this approach. The LOGIT
procedure in SYSTAT offers this method.

I-93
Conjoint Analy sis

Finally, a commercial industry supplying the practical tools for conjoint studies has
produced a variety of software packages. Oppewal (1995) reviews some of these. In
many cases, more efforts are devoted to card decks and other stimulus materials
management than to the actual analysis of the models. CONJOINT in SYSTAT
represents the opposite end of the spectrum from these approaches. CONJOINT
presents methods for fitting these models that are inspired more by Luce and Tukeys
and Green and Raos original theoretical formulations than by the practical
requirements of data collection. The primary goal of SYSTAT CONJOINT is to provide
tools for scaling small- to moderate-sized data sets in which additive models can
simplify the presentation of data. Metric and nonmetric loss functions are available for
exploring the effects of nonlinearity on scaling. The examples highlight this
distinction.

Conjoint Analysis in SYSTAT


Conjoint Analysis Main Dialog Box
To open the Conjoint Analysis dialog box, from the menus choose:
Statistics
Classification
Conjoint Analysis

I-94
Chapter 5

Conjoint analyses are computed by specifying and then estimating a model.


Dependent(s). Select the variable(s) you want to examine. The dependent variable(s)
should be continuous numeric variables (for example, INCOME).
Independent(s). Select one or more continuous or categorical variables (grouping
variables).
Iterations. Enter the maximum number of iterations. If not stated, the maximum is 50.
Convergence. Enter the relative change in estimatesif all such changes are less than
the specified value, convergence is assumed.
Polarity. Enter the polarity of the preferences when doing preference mapping. If the
smaller number indicates the least and the higher number the most, select Positive. For
example, a questionnaire may include the question please rate a list of movies where
one star is the worst and five stars is the best. If the higher number indicates a lower
ranking and the lower number indicates a higher ranking, select Negative. For
example, a questionnaire may include the question please rank your favorite sports
team where 1 is the best and 10 is the worst.
Loss. Specify a loss function to apply in model estimation:
n Stress. Conjoint analysis minimizes Kruskals STRESS.
n Tau. Conjoint analysis maximizes Kendalls tau-b.

Regression. Specify the regression form:


n Monotonic. Regression function is monotonically increasing or decreasing. If
LOSS=STRESS, this is Kruskals MONANOVA model.
n Linear. Regression function is ordinary linear regression.
n Log. Regression function is logarithmic.
n Power. Regression function is of the form

y = ax . This is useful for Box-Cox

models.
Save file. Saves parameter estimates into filename.SYD.

I-95
Conjoint Analy sis

Using Commands
To request a conjoint analysis:
CONJOINT
MODEL depvarlist = indvarlist
ESTIMATE / ITERATIONS=n CONVERGENCE=d ,
LOSS = STRESS
TAU ,
REGRESSION = MONOTONIC
LINEAR
LOG
POWER ,
POLARITY = POSITIVE
NEGATIVE

Usage Considerations
Types of data. CONJOINT uses rectangular data only.
Print options. The output is standard for all print options.
Quick Graphs. Quick Graphs produced by CONJOINT are utility functions for each
predictor variable in the model.
Saving files. CONJOINT saves parameter estimates as one case into a file if you precede
ESTIMATE with SAVE.
BY groups. CONJOINT analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is not available in CONJOINT.

I-96
Chapter 5

Examples
Example 1
Choice Data
The classical application of conjoint analysis is to product choice. The following
example from Green and Rao (1971) shows how to fit a nonmetric conjoint model to
some typical choice data. The input is:
CONJOINT
USE BRANDS
MODEL RESPONSE=DESIGN$..GUARANT$
ESTIMATE / POLARITY=NEGATIVE

Following is the output:


Iterative Conjoint Analysis
Monotonic Regression Model
Data are ranks
Loss function is Kruskal STRESS
Factors and Levels
DESIGN$
A
B
C
BRAND$
Bissell
Glory
K2R
PRICE
1.19
1.39
1.59
SEAL$
NO
YES
GUARANT$
NO
YES
Convergence Criterion: 0.000010
Maximum Iterations:
50

I-97
Conjoint Analy sis

Iteration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Loss
0.5389079
0.4476390
0.3170808
0.1746641
0.1285278
0.1050734
0.0877708
0.0591691
0.0407008
0.0166571
0.0101404
0.0058237
0.0013594
0.0006314
0.0001157
0.0000065
0.0000000
0.0000000
0.0000000
0.0000000

Max parameter change


0.2641755
0.2711012
0.2482502
0.3290621
0.1702260
0.1906332
0.1261961
0.2336527
0.1665511
0.1448756
0.1399945
0.2048317
0.1900774
0.0345039
0.0466520
0.0192437
0.0155169
0.0032732
0.0000032
0.0000000

Parameter Estimates (Part Worths)


A
B
C
-0.331
0.400
0.209
PRICE(1)
PRICE(2)
PRICE(3)
0.302
0.159
-0.429
YES
0.504

Bissell
-0.122
NO
-0.131

Glory
-0.226
YES
-0.102

K2R
-0.195
NO
-0.039

Goodness of Fit (Kendall tau)


RESPONSE
1.000
Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
A
B
C
Bissell
Glory
K2R
0.856
0.699
0.935
0.922
0.843
0.856
PRICE(1)
PRICE(2)
PRICE(3)
NO
YES
NO
0.778
0.922
0.817
0.948
0.974
0.987
YES
0.791

I-98
Chapter 5

Shepard Diagram
1

1.0

Measure

Joint Score

0.5

0
0.0

-1
-0.5

10
Data

15

-1.0

20

1.0

1.0

0.5

0.5
Measure

Measure

-2
0

0.0

-0.5

1.19

1.39
PRICE

1.59

0.0

Bissell

Glory
BRAND$

-1.0

K2R

1.0

1.0

0.5

0.5

0.0

-0.5

-1.0

B
DESIGN$

-0.5

Measure

Measure

-1.0

0.0

-0.5

NO

YES
SEAL$

-1.0

NO
YES
GUARANT$

I-99
Conjoint Analy sis

The fitting method chosen for this example is the default nonmetric loss using
Kruskals STRESS statistic. This is the same method used in the MONANOVA
program (Kruskal and Carmone, 1969). Although the minimization algorithm differs
from that program, the result should be comparable.
The iterations converged to a perfect fit (LOSS = 0). That is, there exists a set of
parameter estimates such that their sums fit the observed data perfectly when Kendalls
tau-b is used to measure fit. This rarely occurs with real data.
The parameter estimates are scaled to have zero sum and unit sum of squares. There
is a single goodness-of-fit value for this example because there is one response.
The root-mean-square deleted goodness-of-fit values are the goodness of fit when
each respective parameter is set to zero. This serves as an informal test of sensitivity.
The lowest value for this example is for the B parameter, indicating that the estimate
for B cannot be changed without substantially affecting the overall goodness of fit.
The Shepard diagram displays the goodness of fit in a scatterplot. The Data axis
represents the observed data values. The Joint Score axis represents the values of the
combined parameter estimates. For example, if we have parameters a1, a2, a3 and b1,
b2, then every case measured on, say, a2 and b1 will be represented by a point in the
plot whose ordinate (y value) is a2 + b1. This example involves only one condition per
card or case, so that the Shepard diagram has no duplicate values on the y axis.
Conjoint analysis can easily handle duplicate measurements either with multiple
dependent variables (multiple subjects exposed to common stimuli) or with duplicate
values for the same subject (replications).
The fitted jagged line is the best fitting monotonic regression of these fitted values
on the observed data. For a similar diagram, see the Multidimensional Scaling chapter
in SYSTAT 10 Statistics II. And note carefully the warnings about degenerate
solutions and other problems.
You may want to try this example with REGRESSION = LINEAR to see how the
results compare. The linear fit yields an almost perfect Pearson correlation. This also
means that GLM (MGLH) can produce nearly the same estimates:
GLM
MODEL RESPONSE = CONSTANT + DESIGN$..GUARANT$
CATEGORY DESIGN$..GUARANT$
PRINT LONG
ESTIMATE

The PRINT LONG statement causes GLM to print the least-squares estimates of the
marginal means that, for an additive model, are the parameters we seek. The GLM
parameter estimates will differ from the ones printed here only by a constant and
scaling parameter. Conjoint analysis always scales parameter estimates to have zero

I-100
Chapter 5

sum and unit sum of squares. This way, they can be thought of as utilities over the
experimental domainsome negative, some positive.

Example 2
Word Frequency
The data set WORDS contains the most frequently used words in American English
(Carroll et al., 1971). Three measures have been added to the data. The first is the (most
likely) part of speech (PART$). The second is the number of letters (LETTERS) in the
word. The third is a measure of the meaning (MEANING$). This admittedly informal
measure represents the amount of harm done to comprehension (1 = a little, 4 = a lot)
by omitting the word from a sentence. While linguists may argue over these
classifications, they do reveal basic differences. Instead of using a measure of
frequency, we will work with the rank order itself to see if there is enough information
to fit a model. This time, we will maximize Kendalls tau-b directly.
Following is the input:
USE WORDS
CONJOINT
LET RANK=CASE
MODEL RANK = LETTERS PART$ MEANING
ESTIMATE / LOSS=TAU,POLARITY=NEGATIVE

Following is the output:


Iterative Conjoint Analysis
Monotonic Regression Model
Data are ranks
Loss function is 1-(1+tau)/2
Factors and Levels
LETTERS
1
2
3
4
PART$
adjective
adverb
conjunction
preposition
pronoun
verb
MEANING
1
2
3
Convergence Criterion: 0.000010

I-101
Conjoint Analy sis

Maximum Iterations:
Iteration
1
2
3
4
5
6
7
8

Loss
0.2042177
0.1988071
0.1897893
0.1861822
0.1843787
0.1825751
0.1825751
0.1825751

50
Max parameter change
0.0955367
0.0911670
0.0708985
0.0308284
0.0259976
0.0131758
0.0000175
0.0000000

Parameter Estimates (Part Worths)


LETTERS(1)
LETTERS(2)
LETTERS(3)
0.154
0.174
-0.076
conjunction preposition
pronoun
-0.262
0.215
0.173
MEANING(3)
-0.182

LETTERS(4)
-0.270
verb
-0.162

adjective
-0.119
MEANING(1)
0.749

adverb
-0.273
MEANING(2)
-0.121

Goodness of Fit (Kendall tau)


RANK
0.635
Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
LETTERS(1)
LETTERS(2)
LETTERS(3)
LETTERS(4)
adjective
adverb
0.628
0.610
0.635
0.606
0.635
0.617
conjunction preposition
pronoun
verb
MEANING(1)
MEANING(2)
0.602
0.613
0.610
0.631
0.494
0.617
MEANING(3)
0.610

I-102
Chapter 5

Shepard Diagram
1.5

1.0

1.0

Measure

Joint Score

0.5

0.5
0.0

-0.5

-0.5
-1.0
0

10

20
Data

30

-1.0

40

1.0

1.0

0.5

0.5

Measure

Measure

0.0

0.0

-0.5

2
3
LETTERS

0.0

-0.5

-1.0
je
Ad

ive
ct

n
n
n
rb
ou
tio
itio
ve
on
nc
os
Ad
Pr
ep
nju
Pr
Co

rb
Ve

-1.0

2
3
MEANING

PART$

The Shepard diagram reveals a slightly curvilinear relationship between the data and
the fitted values. We can parameterize that relationship by refitting the model as
follows:
ESTIMATE / REGRESSION=POWER,POLARITY=NEGATIVE

SYSTAT will then print Computed Exponent: 1.392. We will further examine this type
of power function in the Box-Cox example.
The output tells us that, in general, shorter words are higher on the list, adverbs are
lower, and prepositions are higher. Also, the most frequently occurring words are
generally the most disposable. These statements must be made in the context of the
model, however. To the extent that the separate statements are inaccurate when the
data are examined separately for each, the additive model is violated. This is another

I-103
Conjoint Analy sis

way of saying that the additive model is appropriate when there are no interactions or
configural effects. Incidentally, when these data are analyzed with GLM using the
(inverse transformed) word frequencies themselves rather than rank order in the list,
the conclusions are substantially the same.

Example 3
Box-Cox Model
Box and Cox (1964) devised a maximum likelihood estimator for the exponent in the
following model:

E { y ( ) } = X
where X is a matrix of known values, is a vector of unknown parameters associated
with the transformed observations, and the residuals of the model are assumed to be
normally distributed and independent. The transformation itself is assumed to take the
following form:

%K y 1
=&
K'log( y)

y( )

( 0)
( = 0 )

Following is a SYSTAT program (originally coded by Grant Blank) to compute the


Box-Cox exponent and its standard error. The comments document the program flow:
USE BOXCOX
REM First we need GLM to code dummy variables.
GLM
CATEGORY TREATMEN,POISON
MODEL Y=CONSTANT+TREATMEN+POISON
SAVE TEMP / MODEL
ESTIMATE
REM Now use STATS to compute geometric mean.
STATS
USE TEMP
SAVE GMEAN
LET LY=LOG(Y)
STATS LY / MEAN

I-104
Chapter 5

REM Now duplicate the geometric mean for every case.


MERGE GMEAN(LY) TEMP (Y,X(1..5))
IF CASE=1 THEN LET GMEAN=EXP(LY)
IF CASE>1 THEN LET GMEAN=LAG(GMEAN)
REM Now estimate the exponent, following Box&Cox
NONLIN
MODEL Y = B0 + B1*X(1) + B2*X(2) + B3*X(3) + B4*X(4) + B5*X(5)
LOSS = ((Y^POWER-1) /(POWER*GMEAN^(POWER-1))-ESTIMATE)^2
ESTIMATE

This program produces an estimate of 0.750 for lambda, with a 95% Wald confidence
interval of (-1.181, -0.319). This is in agreement with the results in the original paper.
Box and Cox recommend rounding the exponent to 1 because of its natural
interpretation (rate of dying from poison). In general, it is wise to round such
transformations to interpretable values such as ... 1, 0.5, 0, 0.5, 2 ... to facilitate the
interpretation of results.
The Box-Cox procedure is based on a specific model that assumes normality in the
transformed data and that focuses on the dependent variable. We might ask whether it
is worthwhile to examine transformations of this sort without assuming normality and
resorting to maximum likelihood for our answer. This is especially appropriate if our
general method is to find an optimal estimate of the exponent and then round it to
the nearest interpretable value based on a confidence interval. Indeed, two discussants
of the Box and Cox paper, John Hartigan and John Tukey, asked just that.
The conjoint model offers one approach to this question. Specifically, we can use a
power function relating the y data values to the predictor variables in our model and
see how it converges.
Following is the input:
USE BOXCOX
CONJOINT
MODEL Y=POISON TREATMEN
ESTIMATE / REGRESS=POWER

I-105
Conjoint Analy sis

Following is the output:


Iterative Conjoint Analysis
Power Regression Model
Data are dissimilarities
Loss function is least squares
Factors and Levels
POISON
1
2
3
TREATMEN
1
2
3
4
Convergence Criterion: 0.000010
Maximum Iterations:
50
Iteration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Loss
0.1977795
0.1661894
0.1594770
0.1571216
0.1562271
0.1559910
0.1559285
0.1559166
0.1559135
0.1559131
0.1559129
0.1559134
0.1559129
0.1559130
0.1559127

Computed Exponent:

Max parameter change


0.1024469
0.0530742
0.1473320
0.0973117
0.0156619
0.0193429
0.0149959
0.0034746
0.0024772
0.0016637
0.0005579
0.0004575
0.0000321
0.0000188
0.0000021
-1.015

Parameter Estimates (Part Worths)


POISON(1)
POISON(2)
POISON(3)
-0.375
-0.138
0.634
TREATMEN(4)
-0.264

TREATMEN(1)
0.423

TREATMEN(2)
-0.414

TREATMEN(3)
0.133

Goodness of Fit (Pearson correlation)


Y
-0.919
Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
POISON(1)
POISON(2)
POISON(3) TREATMEN(1) TREATMEN(2) TREATMEN(3)
0.872
0.912
0.785
0.866
0.868
0.914
TREATMEN(4)
0.898

I-106
Chapter 5

Shepard Diagram
1.0
0.5

Measure

Joint Score

-1
0.0

0.0
-0.5

0.5

1.0

1.5

-1.0

Data

2
3
POISON

1.0

Measure

0.5
0.0
-0.5
-1.0

2
3
4
TREATMENT

On each iteration, CONJOINT transforms the observed (y) values by the current
estimate of the exponent, regresses them on the currently weighted X variables (using
the conjoint parameter estimates), and computes the loss from the residuals of that
regression. Over iterations, this loss is minimized and we get to view the final fit in the
plotted Shepard diagram.
The CONJOINT program produced an estimate of 1.015 for the exponent. Draper and
Hunter (1969) reanalyzed the poison data using several criteria suggested in the
discussion to Box and Coxs paper and elsewhere (minimizing interaction F ratio,
maximizing main-effects F ratios, and minimizing Levenes test for heterogeneity of
within-group variances). They found the best exponent to be in the neighborhood of 1.

I-107
Conjoint Analysis

Example 4
Employment Discrimination
The following table shows the mean salaries (SALNOW) of employees at a Chicago
bank. These data are from the BANK.SYD data set used in many SYSTAT manuals.
The bank was involved in a discrimination lawsuit, and the focus of our interest is
whether we can represent the salaries by a simple additive model. At the time these
data were collected, there were no black females with a graduate school education
working at the bank. The education variable records the highest level reached.
High School

College

Grad School

White Males

11735

16215

28251

Black Males

11513

13341

20472

White Females

9600

13612

11640

Black Females

8874

10278

Lets regress beginning salary (SALBEG) and current salary (SALNOW) on the gender
and education data. To represent our model, we will code the categories with integers:
for gender/race, 1=black females, 2=white females, 3=black males, 4=white males; for
education, 1=high school, 2=college, 3=grad school. These codings order the salaries
for both racial/gender status and educational levels.
Following is the input:
USE BANK
IF SEX=1 AND MINORITY=1 THEN LET GROUP=1
IF SEX=1 AND MINORITY=0 THEN LET GROUP=2
IF SEX=0 AND MINORITY=1 THEN LET GROUP=3
IF SEX=0 AND MINORITY=0 THEN LET GROUP=4
LET EDUC=1
IF EDLEVEL>12 THEN LET EDUC=2
IF EDLEVEL>16 THEN LET EDUC=3
LABEL GROUP / 1=Black_Females,2=White_Females,
3=Black_Males,4=White_Males
LABEL EDUC / 1=High_School,2=College,3=Grad_School
CONJOINT
MODEL SALBEG,SALNOW=GROUP EDUC
ESTIMATE / REGRESS=POWER

I-108
Chapter 5

Following is the output:


Iterative Conjoint Analysis
Power Regression Model
Data are dissimilarities
Loss function is least squares
Factors and Levels
GROUP
Black_Female
White_Female
Black_Males
White_Males
EDUC
High_School
College
Grad_School
Convergence Criterion: 0.000010
Maximum Iterations:
50
Iteration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Loss
0.3932757
0.3734472
0.3631769
0.3606965
0.3589525
0.3585654
0.3584647
0.3584328
0.3584239
0.3584233
0.3584215
0.3584225
0.3584253
0.3584253
0.3584231
0.3584189

Computed Exponent:

Max parameter change


0.0931128
0.2973392
0.2928259
0.1416823
0.0244544
0.0090515
0.0252027
0.0068830
0.0016764
0.0047662
0.0009750
0.0001914
0.0001697
0.0000182
0.0000021
0.0000004
-0.072

Parameter Estimates (Part Worths)


GROUP(1)
GROUP(2)
GROUP(3)
-0.366
-0.200
-0.034
EDUC(3)
0.823

GROUP(4)
0.144

EDUC(1)
-0.356

EDUC(2)
-0.010

Goodness of Fit (Pearson correlation)


SALBEG
SALNOW
0.815
0.787
Root-mean-square deleted goodness of fit values, i.e. fit when param(i)=0
GROUP(1)
GROUP(2)
GROUP(3)
GROUP(4)
EDUC(1)
EDUC(2)
0.782
0.785
0.801
0.795
0.753
0.801
EDUC(3)
0.696

I-109
Conjoint Analy sis

Shepard Diagram
2

1.0

1.0

0.5

0.5

Joint Score

Joint Score

0.0

-0.5

0.0

-0.5

-1.0

-1.0
0
00
10

0
00
20

0
00
30

0
00
40

0
00
50

0
00
60

0
00
10

0
00
20

1.0

1.0

0.5

0.5

0.0

-0.5

0
00
40

0
00
50

0
00
60

0.0

-0.5

-1.0
ac
Bl

0
00
30

Data

Measure

Measure

Data

-1.0
k_

le
les
les
ma
Ma
Ma
Fe
k_
e_
e_
hit
l ac
hi t
B
W
W

le
ma
Fe

GROUP

_
gh
Hi

h
Sc

l
oo

ll
Co

e
eg
ad
Gr

c
_S

o
ho

EDUC

The computed exponent (0.072) suggests that a log transformation would be


appropriate for fitting a parametric model. The two salary measurements (salary at
time of hire and at time of the study) perform similarly, although beginning salary
shows a slightly better fit to the additive model (0.815 versus 0.787). You can see the
difference in the two printed Shepard diagrams. The estimates of the parameters show
clear orderings in the categories.
Check for sensitivity of the parameter estimates by examining the root-mean-square
deleted goodness of fit values. The reported values are averages of the fits for both SALBEG
and SALNOW when the respective parameter is set to zero. Here we find that the greatest
change in goodness of fit corresponds to a change in the Grad School parameter.

I-110
Chapter 5

Transformed Additive Model


The transformed additive model removes the highly significant interaction for
SALNOW and almost removes it for SALBEG in these data. You can see this by
recoding the education and gender/race variables with the parameter estimates from
the conjoint analysis:
IF GROUP=1 THEN LET G=-.365
IF GROUP=2 THEN LET G=-.2
IF GROUP=3 THEN LET G=-.033
IF GROUP=4 THEN LET G=.147
IF EDUC=1 THEN LET E=-.359
IF EDUC=2 THEN LET E=-.011
IF EDUC=3 THEN LET E=.822
LET LSALB=LOG(SALBEG)
LET LSALN=LOG(SALNOW)
GLM
MODEL LSALB,LSALN = CONSTANT+E+G+E*G
ESTIMATE
HYPOTHESIS
EFFECT=E*G
TEST

Following is the output:


Number of cases processed: 474
Dependent variable means
LSALB
8.753

LSALN
9.441
-1

Regression coefficients

B = (XX)

XY

LSALB

LSALN

CONSTANT

8.829

9.531

0.576

0.653

0.723

0.722

E
G

0.558

0.351

Multiple correlations
LSALB
0.817

LSALN
0.789

Squared multiple correlations


LSALB
0.667
2

LSALN
0.622

Adjusted R = 1-(1-R )*(N-1)/df, where N = 474, and df = 470


LSALB
LSALN
0.665
0.620
------------------------------------------------------------------------------------

I-111
Conjoint Analy sis

*** WARNING ***


Case
297 has large leverage
Test for effect called:
E*G

(Leverage =

0.128)

Univariate F Tests
Effect

SS

df

MS

LSALB
Error

0.275
19.628

1
470

0.275
0.042

6.596

0.011

LSALN
Error

0.109
28.219

1
470

0.109
0.060

1.818

0.178

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.986
3.447

df =

2, 469

Prob =

0.033

Pillai Trace =
F-Statistic =

0.014
3.447

df =

2, 469

Prob =

0.033

Hotelling-Lawley Trace =
F-Statistic =

0.015
3.447

df =

2, 469

Prob =

0.033

Ordered Scatterplots
Finally, lets use SYSTAT to produce scatterplots of beginning and current salary
ordered by the conjoint coefficients. The SYSTAT code to do this can be found in the
file CONJO4.SYC. The spacing of the scatterplots should tell the story.

I-112
Chapter 5

The story is mainly in this graph: regardless of educational level, minorities and
women received lower salaries. There are a few exceptions to the general pattern, but
overall the bank had reason to settle the lawsuit.

Computation
All computations are in double precision.

Algorithms
CONJOINT uses a direct search optimization method to minimize the loss function.

This enables minimization of Kendalls tau. There is no guarantee that the program will
find the global minimum of tau, so it is wise to try several regression types and the
STRESS loss to be sure that they all reach approximately the same neighborhood.

I-113
Conjoint Analy sis

Missing Data
Missing values are processed by omitting them from the loss function.

References
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society, Series B, 26, 211252.
Brogden, H. E. (1977). The Rasch model, the law of comparative judgment and additive
conjoint measurement. Psychometrika, 42, 631634.
Carmone, F. J., Green, P. E., and Jain, A. K. (1978). Robustness of conjoint analysis: Some
Monte Carlo results. Journal of Marketing Research, 15, 300303.
Carroll, J. B., Davies, P., and Richmond, B. (1971). The word frequency book. Boston,
Mass.: Houghton, Mifflin.
Carroll, J. D. and Green, P. E. (1995). Psychometric methods in marketing research: Part I,
conjoint analysis. Journal of Marketing Research, 32, 385391.
Crowe, G. (1980). Conjoint measurements design considerations. PMRS Journal, 1, 813.
DeLeew, J., Young, F. W., and Takane, Y. (1976). Additive structure in qualitative data:
An alternating least squares method with optimal scaling features. Psychometrika, 41,
471503.
Draper, N. R. and Hunter, W. G. (1969). Transformations: Some examples revisited.
Technometrics, 11, 2340.
Emery, D. R. and Barron, F. H. (1979). Axiomatic and numerical conjoint measurement:
An evaluation of diagnostic efficacy. Psychometrika, 44, 195210.
Green, P. E., Carmone, F. J., and Wind, Y. (1972). Subjective evaluation models and
conjoint measurement. Behavioral Science, 17, 288299.
Green, P. E. and DeSarbo, W. S. (1978). Additive decomposition of perceptions data via
conjoint analysis. Journal of Consumer Research, 5, 5865.
Green, P. E. and Rao, V. R. (1971). Conjoint measurement for quantifying judgmental data.
Journal of Marketing Research, 8, 355363.
Green, P. E. and Srinivasan, V. (1978). Conjoint analysis in consumer research: Issues and
outlook. Journal of Consumer Research, 5, 103123.
Green, P. E. and Srinivasan, V. (1990). Conjoint analysis in marketing: New developments
with implications for research and practice. Journal of Marketing, 54, 319.
Heiser, W. J. and Meulman, J. J. (1995). Nonlinear methods for the analysis of
homogeneity and heterogeneity. In W. J. Krzanowski (ed.), Recent advances in
descriptive multivariate analysis, 5189. Oxford: Clarendon Press.

I-114
Chapter 5

Hensher, D. A. and Johnson, L. W. (1981). Applied discrete choice modeling. London:


Croom Helm.
Krantz, D. H. (1964). Conjoint measurement: The Luce-Tukey axiomatization and some
extensions. Journal of Mathematical Psychology, 1, 284277.
Krantz, D. H. and Tversky, A. (1971). Conjoint measurement analysis of composition rules
in psychology. Psychological Review, 78, 151169.
Kruskal, J. B. (1965). Analysis of factorial experiments by estimating monotone
transformations of the data. Journal of the Royal Statistical Society, Series B, 27,
251263.
Kruskal, J. B. and Carmone, F. J. (1969). MONANOVA: A Fortran-IV program for
monotone analysis of variance (non-metric analysis of factorial experiments).
Behavioral Science, 14, 165166.
Louviere, J. J. (1988). Analyzing decision making: Metric conjoint analysis. Newbury
Park, Calif.: Sage Publications.
Louviere, J. J. (1991). Experimental choice analysis: Introduction and review. Journal of
Business Research. 23, 291297.
Louviere, J. J. (1994). Conjoint analysis. In R. Bagozzi (ed.), Handbook of Marketing
Research, 223259. Oxford: Blackwell Publishers.
Luce, R. D. (1966). Two extensions of conjoint measurement. Journal of Mathematical
Psychology, 3, 348370.
Luce, R. D. and Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of
fundamental measurement. Journal of Mathematical Psychology, 1, 127.
Nygren, T. E. (1986). A two-stage algorithm for assessing violations
of additivity via axiomatic and numerical conjoint analysis. Psychometrika, 51,
483491.
Oppewal, H. (1995). A review of conjoint software. Journal of Retailing and Consumer
Services, 2, 5561.
Srinivasan, V. and Shocker, A. D. (1973). Linear programming techniques for
multidimensional analysis of preference. Psychometrika, 38, 337369.
Steinberg, D. (1992). Applications of logit models in market research. 1992 SawtoothSYSTAT Software Conference Proceedings, 405424. Ketchum, Idaho: Sawtooth
Software, Inc.
Tversky, A. (1967). A general theory of polynomial conjoint measurement. Journal of
Mathematical Psychology, 4, 120.
Umesh, U. N. and Mishra, S. (1990). A Monte Carlo investigation of conjoint analysis
index-of-fit: Goodness of fit, significance and power. Psychometrika, 55, 3344.
Weeks, D. G. and Bentler, P. M. (1979). A comparison of linear and monotone
multidimensional scaling models. Psychological Bulletin, 86, 349354.

Chapter

6
Correlations, Similarities, and
Distance Measures
Leland Wilkinson, Laszlo Engelman, and Rick Marcantonio

Correlations computes correlations and measures of similarity and distance. It prints


the resulting matrix and, if requested, saves it in a SYSTAT file for further analysis,
such as multidimensional scaling, cluster, or factor analysis.
For continuous data, Correlations provides the Pearson correlation, covariances,
and sums of squares of deviations from the mean and sums of cross-products of
deviations (SSCP). In addition to the usual probabilities, the Bonferroni and DunnSidak adjustments are available with Pearson correlations. If distances are desired,
Euclidean or city-block distances are available. Similarity measures for continuous
data include the Bray-Curtis coefficient and the QSK quantitative symmetric
coefficient (or Kulczynski measure).
For rank-order data, Correlations provides Goodman-Kruskals gamma,
Guttmans mu2, Spearmans rho, and Kendalls tau.
For binary data, Correlations provides S2, the positive matching dichotomy
coefficient; S3, Jaccards dichotomy coefficient; S4, the simple matching dichotomy
coefficient; S5, Anderbergs dichotomy coefficient; and S6, Tanimotos dichotomy
coefficient. When underlying distributions are assumed to be normal, the tetrachoric
correlation is available.
When data are missing, listwise and pairwise deletion methods are available for all
measures. An EM algorithm is an option for maximum likelihood estimates of
correlation, covariance, and cross-products of deviations matrices. For robust ML
estimates where outliers are downweighted, the user can specify the degrees of
freedom for the t distribution or contamination for a normal distribution. Correlations
includes a graphical display of the pattern of missing values. Littles MCAR test is
printed with the display. The EM algorithm also identifies cases with extreme
Mahalanobis distances.

I-115

I-116
Chapter 6

Hadis robust outlier detection and estimation procedure is an option for


correlations, covariances, and SSCP; cases identified as outliers by the procedure are
not used to compute estimates.

Statistical Background
SYSTAT computes many different measures of the strength of association between
variables. The most popular measure is the Pearson correlation, which is appropriate
for describing linear relationships between continuous variables. However, CORR
offers a variety of alternative measures of similarity and distance appropriate if the data
are not continuous.
Lets look at an example. The following data, from the CARS file, are taken from
various issues of Car and Driver and Road & Track magazine. They are the car
enthusiasts equivalent of Consumer Reports performance ratings. The cars rated
include some of the most expensive and exotic cars in the world (for example, Ferrari
Testarossa) as well as some of the least expensive but sporty cars (for example, Honda
Civic CRX). The attributes measured are 060 m.p.h. acceleration, braking distance in
feet from 600 m.p.h., slalom times (speed over a twisty course), miles per gallon, and
top speed in miles per hour.
ACCEL

BRAKE

SLALOM

MPG

SPEED

NAME$

5.0
5.3
5.8
7.0
7.6
7.9
8.5
8.7
9.3
10.8
13.0

245
242
243
267
271
259
263
287
258
287
253

61.3
61.9
62.6
57.8
59.8
61.7
59.9
64.2
64.1
60.8
62.3

17.0
12.0
19.0
14.5
21.0
19.0
17.5
35.0
24.5
25.0
27.0

153
181
154
145
124
130
131
115
129
100
95

Porsche 911T
Testarossa
Corvette
Mercedes 560
Saab 9000
Toyota Supra
BMW 635
Civic CRX
Acura Legend
VW Fox GL
Chevy Nova

I-117
Correlations, Similarities, and Distance Measures

The Scatterplot Matrix (SPLOM)

SPEED

MPG

SLALOM BRAKE

ACCEL

A convenient summary that shows the relationships between the performance variables
is to arrange them in a matrix. A matrix is a rectangular array. We can put any sort of
numbers in the cells of the matrix, but we will focus on measures of association. Before
doing that, however, lets examine a graphical matrix, the scatterplot matrix
(SPLOM).

ACCEL

BRAKE SLALOM

MPG

SPEED

This matrix shows the histograms of each variable on the diagonal and the scatterplots
(x-y plots) of each variable against the others. For example, the scatterplot of
acceleration versus braking is at the top of the matrix. Since the matrix is symmetric,
only the bottom half is shown. In other words, the plot of acceleration versus braking
is the same as the transposed scatterplot of braking versus acceleration.

The Pearson Correlation Coefficient


Now, assume that we want a single number that summarizes how well we could predict
acceleration from braking using a straight line. For linear regression, we discuss how
we calculate such a line, but it is enough here to know that we are interested in drawing
a line through the area covered by the points in the scatterplot such that, on average,
the acceleration of a car could be predicted rather well by the value on the line
corresponding to its braking. The closer the points cluster around this line, the better
would be the prediction.

I-118
Chapter 6

In addition, we want this number to represent simultaneously how well we can


predict braking from acceleration using a similar line. This symmetry we seek is
fundamental to all the measures available in CORR. It means that, whatever the scales
on which we measure our variables, the coefficient of association we compute will be
the same for either prediction. If this symmetry makes no sense for a certain data set,
then you probably should not be using CORR.
The most common measure of association is the Pearson correlation coefficient,
which varies between 1 and +1. A Pearson correlation of 0 indicates that neither of
two variables can be predicted from the other by using a linear equation. A Pearson
correlation of 1 indicates that one variable can be predicted perfectly by a positive
linear function of the other, and vice versa. And a value of 1 indicates the same,
except that the function has a negative sign for the slope of the line.
Following is the Pearson correlation matrix corresponding to this SPLOM:
Pearson Correlation Matrix

ACCEL
BRAKE
SLALOM
MPG
SPEED
Number of Observations:

ACCEL

BRAKE

1.000
0.466
0.176
0.651
-0.908

1.000
-0.097
0.622
-0.665

SLALOM

1.000
0.597
-0.115

MPG

SPEED

1.000
-0.768

1.000

11

Try superimposing in your mind the correlation matrix on the SPLOM. The Pearson
correlation for acceleration versus braking is 0.466. This correlation is positive and
moderate in size. On the other hand, the correlation between acceleration and speed is
negative and quite large (0.908). You can see in the lower left corner of the SPLOM
that the points cluster around a downward sloping line. In fact, all of the correlations
of speed with the other variables are negative, which makes sense since greater speed
implies greater performance. The same is true for slalom performance, but this is
clouded by the fact that some small but slower cars like the Honda Civic CRX are
extremely agile.
Keep in mind that the Pearson correlation measures linear predictability. Do not
assume that a Pearson correlation near 0 implies no relationship between variables.
Many nonlinear associations (U- and S-shaped curves, for example) can have Pearson
correlations of 0.

I-119
Correlations, Similarities, and Distance Measures

Other Measures of Association


CORR offers a variety of other association measures. There is not room here to discuss

all of them, but lets review some briefly.

Measures for Rank-Order Data


Several measures are available for rank-order data: Goodman-Kruskals gamma,
Guttmans mu2, Spearmans rho, and Kendalls tau. Each measures an aspect of rankorder association. The one closest to Pearson is the Spearman. Spearmans rho is
simply a Pearson correlation computed on the same data after converting them to
ranks. Goodman-Kruskals gamma and Kendalls tau reflect the tendency for two cases
to have similar orderings on two variables. However, the former focuses on cases
which are not tied in rank orderings. If no ties exist, these two measures will be equal.
Following is the same matrix computed for Spearmans rho:
Matrix of Spearman Correlation Coefficients

ACCEL
BRAKE
SLALOM
MPG
SPEED
Number of observations:

ACCEL

BRAKE

SLALOM

MPG

SPEED

1.000
0.501
0.245
0.815
-0.891

1.000
-0.305
0.502
-0.651

1.000
0.487
-0.109

1.000
-0.884

1.000

11

It is often useful to compute both a Spearman and Pearson matrix on the same data.
The absolute difference between the two can reveal unusual features. For example, the
greatest difference for our data is on the slalom-braking correlation. This is because the
Honda Civic CRX is so fast through the slalom, despite its inferior brakes, that it
attenuates the Pearson correlation between slalom and braking. The Spearman
correlation reduces its influence.

Dissimilarity and Distance Measures


These measures include the Bray-Curtis (BC) dissimilarity measure, the quantitative
symmetric dissimilarity coefficient, the Euclidean distance, and the city-block
distance.

I-120
Chapter 6

Euclidean and city-block distance measures have been widely available in software
packages for many years; Bray-Curtis and QSK are less common. For each pair of
variables,

x ik x jk

Bray-Curtis = -------------------------------x ik +
xj

SK = 1 ---

1
1

min (x ik,x jk) ------------- + ------------

x ik
x jk

where i and j are variables and k is cases. After an extensive computer simulation study,
Faith, Minchin, and Belbin (1987) concluded that BC and QSK were effective as
robust measures in terms of both rank and linear correlation. The use of these
measures is similar to that for Correlations (Pearson, Covariance, and SSCP), except
the EM, Prob, Bonferroni, Dunn-Sidak, and Hadi options are not available.

Measures for Binary Data


Correlations offers the following association measures for binary data: positive
matching dichotomy coefficients (S2), Jaccards dichotomy coefficients (S3), simple
matching dichotomy coefficients (S4), Anderbergs dichotomy coefficients (S5),
Tanimotos dichotomy coefficients (S6), and tetrachoric correlations.
Dichotomy coefficients. These coefficients relate variables whose values may represent
the presence or absence of an attribute or simply two values. They are documented in
Gower (1985). These coefficients were chosen for SYSTAT because they are metric
and produce symmetric positive semidefinite (Gramian) matrices, provided that you do
not use the pairwise deletion option. This makes them suitable for multidimensional
scaling and factoring as well as clustering. The following table shows how the
similarity coefficients are computed:

I-121
Correlations, Similarities, and Distance Measures

xj
xi

1
0

1
a
c

0
b
d

a+b
c+d

a+c b+d

a
a+b+c+d
a
--------------------a+b+c
a+d
-----------------------------a+b+c+d
a
----------------------------a + 2(b + c)
a+d
--------------------------------------

S2 = ------------------------------

Proportion of pairs with both values present

S3 =

Proportion of pairs with both values present


given that at least one occurs

S4 =
S5 =
S6 =

Proportion of pairs where the values of both


variables agree
S3 standardized by all possible patterns of
agreement and disagreement
S4 standardized by all possible patterns of
agreement and disagreement

a + 2(b + c) + d

When the absence of an attribute in both variables is deemed to convey no information,


d should not be included in the coefficient (see S3 and S5).
Tetrachoric correlation. While the data for this measure are binary, they are assumed to
be a random sample from a bivariate normal distribution. For example, lets draw a
horizontal line and a vertical line on this bivariate normal distribution and count the
number of observations in each quadrant.
X0
3

-3

-3

Y0

19

17

I-122
Chapter 6

A large proportion of the observations fall in the upper right and lower left quadrants
because the relationship is positive (the Pearson correlation is approximately 0.70).
Correspondingly, if there were a strong negative relationship, the points would
concentrate in the upper left and lower right quadrants. If the original observations are
no longer available but you do have the frequency counts for the four quadrants, try a
tetrachoric correlation.
The computations for the tetrachoric correlation begin by finding estimates of the
inverse cumulative marginal distributions:

+ 5 and z value for y = -1 17 + 4


z value for x0 = -1 17
0
------------- -------------45
45
and using these values as limits when integrating the bivariate normal density
expressed in terms of , the correlation, and then solving for .
If you have the original data, dont bother dichotomizing them because the
tetrachoric correlation has an efficiency of 0.40 compared with the efficient Pearson
correlation estimate.

Transposed Data
You can use CORR to compute measures of association on the rows or columns of your
data. Simply transpose the data and then use CORR. This makes sense when you want
to assess similarity between rows. We might be interested in identifying similar cars
from our performance measures, for example. Recall that you cannot transpose a file
that contains character data.
When you compute association measures across rows, however, be sure that the
variables are on comparable scales. Otherwise, a single variable will influence most of
the association. With the cars data, braking and speed are so large that they would
almost uniquely determine the similarity between cars. Consequently, we standardized
the data before transposing them. That way, the correlations measure the similarities
comparably across attributes.

I-123
Correlations, Similarities, and Distance Measures

Following is the Pearson correlation matrix for our cars:


Pearson Correlation Matrix

PORSCHE
FERRARI
CORVETTE
MERCEDES
SAAB
TOYOTA
BMW
HONDA
ACURA
VW
CHEVY

TOYOTA
BMW
HONDA
ACURA
VW
CHEVY

PORSCHE

FERRARI

1.000
0.940
0.939
0.093
-0.506
0.238
-0.319
-0.504
-0.046
-0.962
-0.731

CORVETTE

MERCEDES

SAAB

1.000
0.868
0.212
-0.523
0.429
-0.095
-0.730
-0.102
-0.928
-0.698

1.000
-0.240
-0.760
0.402
-0.557
-0.393
0.298
-0.980
-0.491

1.000
0.664
-0.379
0.854
-0.519
-0.978
0.079
-0.532

1.000
-0.680
0.634
0.265
-0.770
0.704
-0.131

TOYOTA

BMW

HONDA

ACURA

VW

1.000
-0.247
-0.298
0.533
-0.353
-0.034

1.000
-0.500
-0.788
0.391
-0.064

1.000
0.349
0.552
0.320

1.000
-0.156
0.536

1.000
0.525

CHEVY
CHEVY

1.000

Number of observations:

Hadi Robust Outlier Detection


Hadi robust outlier detection identifies specific cases as outliers (if there are any) and
then uses the acceptable cases to compute the requested measure in the usual way.
Following are the steps for this procedure:
n Compute a robust covariance matrix by finding the median (instead of the mean)

for each variable and using ( x i median ) in the calculation of each


covariance. If the resulting matrix is singular, reconstruct another after inflating the
smallest eigenvalues by a small amount.
2

n Use this robust estimate of the covariance matrix to compute Mahalanobis

distances and then use the distance to rank the cases.


n Use the half of the sample with the lowest ranks to compute the usual covariance

matrix (that is, deviations from the mean).


n Use this covariance matrix to compute new distances for the complete sample and

rerank the cases.

I-124
Chapter 6

n After ranking, select the same number of cases with small ranks as before but add

the case with the next largest rank and repeat the process, each time updating the
covariance matrix, computing and sorting new distances, and increasing the
subsample size by one.
n Continue adding cases until the entering one exceeds an internal limit based on a

chi-square statistic (see Hadi, 1994). The cases remaining (not entered) are
identified as outliers.
n Use the cases that are not identified as outliers to compute the measure requested

in the usual way.

Correlations in SYSTAT
Correlations Main Dialog Box
To open the Correlations dialog box, from the menus choose:
Statistics
Correlations
Simple

Variables. Available only if One is selected for Sets. All selected variables are
correlated with all other variables in the list, producing a triangular correlation matrix.

I-125
Correlations, Similarities, and Distance Measures

Rows. Available only if Two is selected for Sets. Selected variables are correlated with
all column variables, producing a rectangular matrix.
Columns. Available only if Two is selected for Sets. Selected variables are correlated
with all row variables, producing a rectangular matrix.
Sets. One set creates a single, triangular correlation matrix of all variables in the
Variable(s) list. Two sets creates a rectangular matrix of variables in the Row(s) list
correlated with variables in the Column(s) list.
Listwise. Listwise deletion of missing data. Any case with missing data for any variable
in the list is excluded.
Pairwise. Pairwise deletion of missing data. Only cases with missing data for one of
the variables in the pair being correlated are excluded.
Save file. Saves the correlation matrix to a file.
Types. Type of data or measure. You can select from a variety of distance measures, as
well as measures for continuous data, rank-order data, and binary data.

Measures for Continuous Data


The following measures are available for continuous data:
n Pearson. Produces a matrix of Pearson product-moment correlation coefficients.

Pearson correlations vary between 1 and +1. A value of 0 indicates that neither of
two variables can be predicted from the other by using a linear equation. A Pearson
correlation of 1 or 1 indicates that one variable can be predicted perfectly by a
linear function of the other.
n Covariance. Produces a covariance matrix.
n SSCP. Produces a sum of cross-products matrix. If the Pairwise option is chosen,

sums are weighted by N/n, where n is the count for a pair.


The Pearson, Covariance, and SSCP measures are related. The entries in an SSCP
matrix are sums of squares of deviations (from the mean) and sums of cross-products
of deviations. If you divide each entry by ( n 1 ) , variances result from the sums of
squares and covariances from the sums of cross-products. Divide each covariance by
the product of the standard deviations (of the two variables) and the result is a
correlation.

I-126
Chapter 6

Distance and Dissimilarity Measures


Correlations offers two dissimilarity measures and two distance measures:
n Bray-Curtis. Produces a matrix of dissimilarity measures for continuous data.
n QSK. Produces a matrix of symmetric dissimilarity coefficients. Also called the

Kulczynski measure.
n Euclidean. Produces a matrix of Euclidean distances normalized by the sample size.
n City. Produces a matrix of city-block, or first-power, distances (sum of absolute

discrepancies) normalized by the sample size.

Measures for Rank-Order Data


If your data are simply ranks of attributes, or if you want to see how well variables are
associated when you pay attention to rank ordering, you should consider the following
measures available for ranked data:
n Spearman. Produces a matrix of Spearman rank-order correlation coefficients. This

measure is a nonparametric version of the Pearson correlation coefficient, based on


the ranks of the data rather than the actual values.
n Gamma. Produces a matrix of Goodman-Kruskals gamma coefficients.
n MU2. Produces a matrix of Guttmans mu2 monotonicity coefficients.
n Tau. Produces a matrix of Kendalls tau-b rank-order coefficients.

Measures for Binary Data


These coefficients relate variables assuming only two values. The dichotomy
coefficients work only for dichotomous data scored as 0 or 1.
The following measures are available for binary data:
n Positive matching (S2). Produces a matrix of positive matching dichotomy

coefficients.
n Jaccard (S3). Produces a matrix of Jaccards dichotomy coefficients.
n Simple matching (S4). Produces a matrix of simple matching dichotomy

coefficients.
n Anderberg (S5). Produces a matrix of Anderbergs dichotomy coefficients.

I-127
Correlations, Similarities, and Distance Measures

n Tanimoto (S6). Produces a matrix of Tanimotos dichotomy coefficients.


n Tetra. Produces a matrix of tetrachoric correlations.

Correlations Options
To specify options for correlations, click Options in the Correlations dialog box.

The following options are available:


Probabilities. Requests probability of each correlation coefficient to test that the
correlation is 0. Appropriate if you select only one correlation coefficient to test.
Bonferroni and Dunn-Sidak use adjusted probabilities. Available only for Pearson
product-moment correlations.
(EM) Estimation. Requests the EM algorithm to estimate Pearson correlation,
covariance, or SSCP matrices from data with missing values. Littles MCAR test is
displayed with a graphical display of the pattern of missing values. For robust
estimates where outliers are downweighted, select Normal or t.
n Normal produces maximum likelihood estimates for a contaminated multivariate

normal sample. For the contaminated normal, SYSTAT assumes that the
distribution is a mixture of two normal distributions (same mean, different
variances) with a specified probability of contamination. The Probability value is
the probability of contamination (for example, 0.10), and Variance is the variance

I-128
Chapter 6

of contamination. Downweighting for the normal model tends to be concentrated


in a few outlying cases.
n t produces maximum likelihood estimates for a t distribution, where df is the

degrees of freedom. Downweighting for the multivariate t model tends to be more


spread out than for the normal model. The degree of downweighting is inversely
related to the degrees of freedom.
Iterations. Specifies the maximum number of iterations for computing the estimates.
Convergence. Defines the convergence criterion. If the relative change of covariance
entries are less than the specified value, convergence is assumed.
Hadi outlier identification and estimation. Requests the HADI multivariate outlier
detection algorithm to identify outliers and to compute the correlation, covariance,
or SSCP matrix from the remaining cases. Tolerance omits variables with a multiple
R-square value greater than (1 n), where n is the specified tolerance value.

Using Commands
First, specify your data with USE filename. Then, type CORR and choose your measure
and type:
Full matrix
Portion of matrix

MEASURE varlist / options


MEASURE rowlist * collist / options

MEASURE is one of:


BC
GAMMA
S3
COVARIANCE

QSK
MU2
S4
SSCP

EUCLIDEAN
TAU
S5

CITY
TETRA
S6

SPEARMAN
S2
PEARSON

For PEARSON, COVARIANCE, and SSCP, the following options are available:
EM
T=df
NORMAL=n1,n2
ITER=n
CONV=n
HADI
TOL=n

In addition, PEARSON offers BONF, DUNN, and PROB as options.

I-129
Correlations, Similarities, and Distance Measures

Usage Considerations
Types of data. CORR uses rectangular data only.
Print options. With PRINT=LONG, SYSTAT prints the mean of each variable. In
addition, for EM estimation, SYSTAT prints an iteration history, missing value
patterns, Littles MCAR test, and mean estimates.
Quick Graphs. CORR includes a SPLOM (matrix of scatterplots) where the data in each
plot correspond to a value in the matrix.
Saving files. CORR saves the correlation matrix or other measure computed. SYSTAT
automatically defines the type of file as CORR, DISS, COVA, SSCP, SIMI, or RECT.
BY groups. CORR analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=<variable> increases the number of cases by the FREQ
variable.
Case weights. WEIGHT is available in CORR.

Examples
Example 1
Pearson Correlations
This example uses data from the OURWORLD file that contains records (cases) for 57
countries. We are interested in correlations among variables recording the percentage
of the population living in cities, birth rate, gross domestic product per capita, dollars
expended per person for the military, ratio of birth rates to death rates, life expectancy
(in years) for males and females, percentage of the population who can read, and gross
national product per capita in 1986. The input is:
CORR
USE ourworld
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86

I-130
Chapter 6

The output follows:


Pearson correlation matrix
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LITERACY
GNP_86

URBAN BIRTH_RT
1.000
-0.800
1.000
0.625
-0.762
0.597
-0.672
-0.307
0.511
0.776
-0.922
0.801
-0.949
0.800
-0.930
0.592
-0.689
LITERACY
1.000
0.611

GDP_CAP

MIL

1.000
0.899
-0.659
0.664
0.704
0.637
0.964

1.000
-0.607
0.582
0.619
0.562
0.873

B_TO_D LIFEEXPM LIFEEXPF

1.000
-0.211
-0.265
-0.274
-0.560

1.000
0.989
0.911
0.633

1.000
0.935
0.665

GNP_86
1.000

GNP_86

LITERACY

LIFEEXPF

LIFEEXPM

B_TO_D

MIL

GDP_CAP

BIRTH_RT

URBAN

Number of observations: 49

URBAN

BIRTH_RT

GDP_CAP

MIL

B_TO_D

LIFEEXPM

LIFEEXPF

LITERACY

GNP_86

The correlations for all pairs of the nine variables are shown here. The bottom of the
output panel shows that the sample size is 49, but the data file has 57 countries. If a
country has one or more missing values, SYSTAT, by default, omits all of the data for
the case. This is called listwise deletion.
The Quick Graph is a matrix of scatterplots with one plot for each entry in the
correlation matrix and histograms of the variables on the diagonal. For example, the
plot of BIRTH_RT against URBAN is at the top left under the histogram for URBAN.

I-131
Correlations, Similarities, and Distance Measures

If linearity does not hold for your variables, your results may be meaningless. A
good way to assess linearity, the presence of outliers, and other anomalies is to
examine the plot for each pair of variables in the scatterplot matrix. The relationships
between GDP_CAP and BIRTH_RT, B_TO_D, LIFEEXPM, and LIFEEXPF do not
appear to be linear. Also, the points in the MIL versus GDP_CAP and GNP_86 versus
MIL displays clump in the lower left corner. It is not wise to use correlations for
describing these relations.

Altering the Format


The correlation matrix for this example wraps (the results for nine variables do not fit
in one panel). You squeeze in more results by specifying a field width and the number
of decimal places. For example, the same correlations printed in a field 6 characters
wide is shown below. We request only 2 digits to the right of the decimal instead of 3.
CORR
USE ourworld
FORM 6 2
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86

(Using the command language, press F9 to retrieve the previous PEARSON statement
instead of retyping it.)
The output is:
Pearson correlation matrix
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN BIRTH_ GDP_CA


1.00
-0.80
1.00
0.62 -0.76
1.00
0.60 -0.67
0.90
-0.31
0.51 -0.66
0.78 -0.92
0.66
0.80 -0.95
0.70
0.80 -0.93
0.64
0.59 -0.69
0.96

MIL B_TO_D LIFEEX LIFEEX LITERA GNP_86

1.00
-0.61
0.58
0.62
0.56
0.87

1.00
-0.21
-0.26
-0.27
-0.56

1.00
0.99
0.91
0.63

1.00
0.93
0.67

1.00
0.61

1.00

Number of observations: 49

Notice that while the top row of variable names is truncated to fit within the field
specification, the row names remain complete.

I-132
Chapter 6

Requesting a Portion of a Matrix


You can request that only a portion of the matrix be computed. The input follows:
CORR
USE ourworld
FORMAT
PEARSON lifeexpm lifeexpf literacy gnp_86 *,
urban birth_rt gdp_cap mil b_to_d

The resulting output is:


Pearson correlation matrix
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
0.776
0.801
0.800
0.592

BIRTH_RT
-0.922
-0.949
-0.930
-0.689

GDP_CAP
0.664
0.704
0.637
0.964

MIL
0.582
0.619
0.562
0.873

B_TO_D
-0.211
-0.265
-0.274
-0.560

Number of observations: 49

These correlations correspond to the lower left corner of the first matrix.

Example 2
Transformations
If relationships between variables appear nonlinear, using a measure of linear
association is not advised. Fortunately, transformations of the variables may yield
linear relationships. You can then use the linear relation measures, but all conclusions
regarding the relationships are relative to the transformed variables instead of the
original variables.
In the Pearson correlations example, we observed nonlinear relationships involving
GDP_CAP, MIL, and GNP_86. Here we log transform these variables and compare the
resulting correlations to those for the untransformed variables. The input is:
CORR
USE ourworld
LET (gdp_cap,mil,gnp_86) = L10(@)
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86

I-133
Correlations, Similarities, and Distance Measures

Notice that we use SYSTATs shortcut notation to make the transformation.


Alternatively, you could use:
LET gdp_cap = L10(gdp_cap)
LET mil = L10(mil)
LET gnp_86 = L10(gnp_86)

The output follows:


Means
URBAN
52.8776
LIFEEXPM
65.4286

BIRTH_RT
25.9592
LIFEEXPF
70.5714

GDP_CAP
3.3696
LITERACY
74.7265

MIL
1.6954
GNP_86
3.2791

B_TO_D
2.8855

BIRTH_RT

GDP_CAP

MIL

B_TO_D

1.0000
-0.9189
-0.8013
0.5106
-0.9218
-0.9488
-0.9302
-0.8786
LIFEEXPF

1.0000
0.8947
-0.5293
0.8599
0.8954
0.8337
0.9736
LITERACY

1.0000
-0.5374
0.7267
0.7634
0.7141
0.8773
GNP_86

1.0000
-0.2113
-0.2648
-0.2737
-0.4411

1.0000
0.9350
0.8861

1.0000
0.8404

1.0000

Pearson correlation matrix


URBAN
1.0000
-0.8002
0.7636
0.6801
-0.3074
0.7756
0.8011
0.7997
0.7747
LIFEEXPM
1.0000
0.9887
0.9110
0.8610

URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

GNP_86

LITERACY

LIFEEXPF

LIFEEXPM

B_TO_D

MIL

GDP_CAP

BIRTH_RT

URBAN

Number of observations: 49

URBAN

BIRTH_RT

GDP_CAP

MIL

B_TO_D

LIFEEXPM

LIFEEXPF

LITERACY

GNP_86

I-134
Chapter 6

In the scatterplot matrix, linearity has improved in the plots involving GDP_CAP, MIL,
and GNP_86. Look at the difference between the correlations before and after
transformation.
Transformation
no

yes

0.625
0.762
0.664
0.704
0.637

0.764
0.919
0.860
0.895
0.834

gdp_cap vs.

urban
birth_rt
lifeexpm
lifeexpf
literacy

Transformation
no

yes

0.597
0.672
0.582
0.619
0.562

0.680
0.801
0.727
0.763
0.714

mil vs.

urban
birth_rt
lifeexpm
lifeexpf
literacy

Transformation
no

yes

0.592
0.689
0.633
0.665
0.611

0.775
0.879
0.861
0.886
0.840

gnp_86 vs.

urban
birth_rt
lifeexpm
lifeexpf
literacy

After log transforming the variables, linearity has improved in the plots, and many of
the correlations are stronger.

Example 3
Missing Data: Pairwise Deletion
To specify pairwise deletion, the input is:
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR

The output is:


Means
URBAN
52.821
LIFEEXPM
65.088

BIRTH_RT
26.351
LIFEEXPF
70.123

GDP_CAP
3.372
LITERACY
73.563

MIL
1.775
GNP_86
3.293

B_TO_D
2.873

BIRTH_RT

GDP_CAP

MIL

B_TO_D

1.000
-0.895
-0.687
0.535
-0.892
-0.924
-0.930
-0.881

1.000
0.857
-0.472
0.854
0.891
0.832
0.974

1.000
-0.377
0.696
0.721
0.646
0.881

1.000
-0.172
-0.230
-0.291
-0.455

Pearson correlation matrix


URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
1.000
-0.781
0.778
0.683
-0.248
0.796
0.816
0.807
0.775

I-135
Correlations, Similarities, and Distance Measures

LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

LIFEEXPM
1.000
0.989
0.911
0.863

LIFEEXPF

LITERACY

GNP_86

1.000
0.937
0.888

1.000
0.842

1.000

BIRTH_RT

GDP_CAP

MIL

B_TO_D

57
57
56
57
57
57
57
50
LIFEEXPF

57
56
57
57
57
57
50
LITERACY

56
56
56
56
56
50
GNP_86

57
57
57
57
50

57
57
50

57
50

50

Pairwise frequency table


URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
56
56
56
55
56
56
56
56
49
LIFEEXPM
57
57
57
50

The sample size for each variable is reported as the diagonal of the pairwise frequency
table; sample sizes for complete pairs of cases are reported off the diagonal. There are
57 countries in this sample56 reported the percentage living in cities (URBAN), and
50 reported the gross national product per capita in 1986 (GNP_86). There are 49
countries that have values for both URBAN and GNP_86.
The means are printed because we specified PRINT=LONG. Since pairwise deletion
is requested, all available values are used to compute each meanthat is, these means
are the same as those computed by the Statistics procedure.

Example 4
Missing Data: EM Estimation
This example uses the same variables used in the transformations example. To specify
EM estimation, the input is:
CORR
USE ourworld
LET (gdp_cap,mil,gnp_86) = L10(@)
IDVAR = country$
GRAPH = NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm,
lifeexpf literacy gnp_86 / EM

I-136
Chapter 6

The output follows:


EM Algorithm

No.of
Cases
49
1
6
1

Iteration
--------1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Maximum Error
------------1.092328
1.023878
0.643113
0.666125
0.857590
2.718236
0.728468
0.196577
0.077590
0.034510
0.016278
0.007986
0.004050
0.002120
0.001145
0.000637

-2*log(likelihood)
-----------------24135.483249
7625.491302
6932.605472
6691.458724
6573.199525
6538.852550
6531.689766
6530.369252
6530.167056
6530.159651
6530.176410
6530.190050
6530.198695
6530.203895
6530.207008
6530.208887

Missing value patterns


(X=nonmissing; .=missing)
XXXXXXXXX
.XXXXXXXX
XXXXXXXX.
XXX.XXXX.

Little MCAR test statistic:

35.757

df =

23

prob = 0.044

EM estimate of means
URBAN
53.152
LIFEEXPM
65.088

BIRTH_RT
26.351
LIFEEXPF
70.123

GDP_CAP
3.372
LITERACY
73.563

MIL
1.754
GNP_86
3.284

B_TO_D
2.873

BIRTH_RT

GDP_CAP

MIL

B_TO_D

1.000
-0.895
-0.697
0.535
-0.892
-0.924
-0.930
-0.831
LIFEEXPF

1.000
0.863
-0.472
0.854
0.891
0.832
0.968
LITERACY

1.000
-0.357
0.713
0.738
0.668
0.874
GNP_86

1.000
-0.172
-0.230
-0.291
-0.342

1.000
0.937
0.885

1.000
0.828

1.000

EM estimated correlation matrix


URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
1.000
-0.782
0.779
0.700
-0.259
0.796
0.816
0.808
0.796
LIFEEXPM
1.000
0.989
0.911
0.863

SYSTAT prints missing-value patterns for the data. Forty-nine cases in the sample are
complete (an X is printed for each of the nine variables). Periods are inserted where
data are missing. The value of the first variable, URBAN, is missing for one case, while
the value of the last variable, GNP_86, is missing for six cases. The last row of the
pattern indicates that the values of the fourth variable, MIL, and the last variable,
GNP_86, are both missing for one case.

I-137
Correlations, Similarities, and Distance Measures

Littles MCAR (missing completely at random) test has a probability less than 0.05,
indicating that we reject the hypothesis that the nine missing values are randomly
missing. This test has limited power when the sample of incomplete cases is small and
it also offers no direct evidence on the validity of the MAR assumption.

Example 5
Probabilities Associated with Correlations
To request the usual (uncorrected) probabilities for a correlation matrix using pairwise
deletion:
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR PROB

The output is:


Bartlett Chi-square statistic:

815.067 df=36 Prob= 0.000

Matrix of Probabilities
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
0.0
0.000
0.000
0.000
0.065
0.000
0.000
0.000
0.000
LIFEEXPM
0.0
0.000
0.000
0.000

BIRTH_RT

GDP_CAP

MIL

B_TO_D

0.0
0.000
0.000
0.000
0.000
0.000
0.000
0.000
LIFEEXPF

0.0
0.000
0.000
0.000
0.000
0.000
0.000
LITERACY

0.0
0.004
0.000
0.000
0.000
0.000
GNP_86

0.0
0.202
0.085
0.028
0.001

0.0
0.000
0.000

0.0
0.000

0.0

The p values that are appropriate for making statements regarding one specific
correlation are shown here. By themselves, these values are not very informative.
These p values are pseudo-probabilities because they do not reflect the number of
correlations being tested. If pairwise deletion is used, the problem is even worse,
although many statistics packages print probabilities as if they meant something in this
case, too.

I-138
Chapter 6

SYSTAT computes the Bartlett chi-square test whenever you request probabilities
for more than one correlation. This tests a global hypothesis concerning the
significance of all of the correlations in the matrix

(2p + 1) ln R

2 = N 1

where N is the total sample size (or the smallest sample size for any pair in the matrix
if pairwise deletion is used), p is the number of variables, and |R| is the determinant of
the correlation matrix. This test is sensitive to non-normality, and the test statistic is
only asymptotically distributed (for large samples) as chi-square. Nevertheless, it can
serve as a guideline.
If the Bartlett test is not significant, dont even look at the significance of individual
correlations. In this example, the test is significant, which indicates that there may be
some real correlations among the variables. The Bartlett test is sensitive to nonnormality and can be used only as a guide. Even if the Bartlett test is significant, you
cannot accept the nominal p values as the true family probabilities associated with each
correlation.

Bonferroni Probabilities with Pairwise Deletion


Lets now examine the probabilities adjusted by the Bonferroni method that provides
protection for multiple tests. Remember that the log-transformed values from the
transformations example are still in effect. The input is:
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / PAIR BONF

I-139
Correlations, Similarities, and Distance Measures

The output follows:


Bartlett Chi-square statistic:

815.067 df=36 Prob= 0.000

Matrix of Bonferroni Probabilities


URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
0.0
0.000
0.000
0.000
1.000
0.000
0.000
0.000
0.000
LIFEEXPM
0.0
0.000
0.000
0.000

BIRTH_RT

GDP_CAP

MIL

B_TO_D

0.0
0.000
0.000
0.001
0.000
0.000
0.000
0.000
LIFEEXPF

0.0
0.000
0.008
0.000
0.000
0.000
0.000
LITERACY

0.0
0.150
0.000
0.000
0.000
0.000
GNP_86

0.0
1.000
1.000
1.000
0.032

0.0
0.000
0.000

0.0
0.000

0.0

Compare these results with those for the 36 tests using uncorrected probabilities.
Notice that some correlations, such as those for B_TO_D with MIL, LITERACY, and
GNP_86, are no longer significant.

Bonferroni Probabilities for EM Estimates


You can request the Bonferroni adjusted probabilities for an EM estimated matrix by
specifying:
USE ourworld
CORR
LET (gdp_cap,mil,gnp_86) = L10(@)
GRAPH NONE
PRINT = LONG
PEARSON urban birth_rt gdp_cap mil b_to_d lifeexpm lifeexpf,
literacy gnp_86 / EM BONF

The probabilities follow:


Bartlett Chi-square statistic:

821.288 df=36 Prob= 0.000

Matrix of Bonferroni Probabilities


URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
0.0
0.000
0.000
0.000
1.000
0.000
0.000
0.000
0.000

BIRTH_RT

GDP_CAP

MIL

B_TO_D

0.0
0.000
0.000
0.001
0.000
0.000
0.000
0.000

0.0
0.000
0.008
0.000
0.000
0.000
0.000

0.0
0.248
0.000
0.000
0.000
0.000

0.0
1.000
1.000
1.000
0.537

I-140
Chapter 6

LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

LIFEEXPM
0.000
0.000
0.000
0.000

LIFEEXPF

LITERACY

GNP_86

0.0
0.000
0.000

0.0
0.000

0.000

Example 6
Hadi Robust Outlier Detection
If only one or two variables have outliers among many well behaved variables, the
outliers may be masked. Lets look for outliers among four variables. The input is:
USE ourworld
CORR
LET (gdp_cap, mil) = L10(@)
GRAPH = NONE
PRINT = LONG
IDVAR = country$
PEARSON gdp_cap mil b_to_d literacy / HADI
PLOT GDP_CAP*B_TO_D*LITERACY / SPIKE XGRID YGRID AXES=BOOK,
SCALE=L SYMBOL=GROUP$ SIZE= 1.250 ,1.250 ,1.250

The output is:


These 15 outliers are identified:
Case
Distance
------------ -----------Venezuela
4.48653
CostaRica
4.55336
Senegal
4.66615
Sudan
4.74882
Ethiopia
4.82013
Pakistan
5.05827
Libya
5.10295
Haiti
5.44901
Bangladesh
5.47974
Yemen
5.84027
Gambia
5.84202
Iraq
5.84507
Guinea
6.12308
Somalia
6.18465
Mali
6.30091
Means of variables of non-outlying cases
GDP_CAP
MIL
3.634
1.967

B_TO_D
2.533

LITERACY
88.183

MIL

B_TO_D

LITERACY

1.000
-0.753
0.642

1.000
-0.698

1.000

HADI estimated correlation matrix


GDP_CAP
MIL
B_TO_D
LITERACY

GDP_CAP
1.000
0.860
-0.839
0.729

Number of observations: 56

I-141
Correl ati ons, Si m il ari ti es, an d Di stance Me asures

N
N

I
I
IN
N

N
E
EE
E
E
E
N
EE
E
E
N
N
NN
N E EE
N
NN
N N
I N
N

I I
NI I
II
I I
I

Fifteen countries are identified as outliers. We suspect that the sample may not be
homogeneous so we request a plot labeled by GROUP$. The panel is set to
PRINT=LONG; the country names appear because we specified COUNTRY$ as an ID
variable. The correlations at the end of the output are computed using the 30 or so cases
that are not identified as outliers.
In the plot, we see that Islamic countries tend to fall between New World and
European countries with respect to birth-to-death ratio and have the lowest literacy.
European countries have the highest literacy and GDP_CAP values.

Stratifying the Analysis


Well use Hadi for each of the three groups separately:
USE ourworld
CORR
LET (gdp_cap, mil) = L10(@)
GRAPH = NONE
PRINT = LONG
IDVAR = country$
BY group$
PEARSON gdp_cap mil b_to_d literacy / HADI
BY

I-142
Chapter 6

For clarity, we edited the following output by moving the panels of means to the end:
The following results are for:
GROUP$
= Europe
These 1 outliers are identified:
Case
Distance
------------ -----------Portugal
5.72050
HADI estimated correlation matrix
GDP_CAP
MIL
B_TO_D
LITERACY

GDP_CAP
1.000
0.474
-0.092
0.259

MIL

B_TO_D

LITERACY

1.000
-0.173
0.263

1.000
0.136

1.000

Number of observations: 20
The following results are for:
GROUP$
= Islamic
HADI estimated correlation matrix
MIL

B_TO_D

LITERACY

1.000
0.882
0.605

1.000
0.649

1.000

MIL

B_TO_D

LITERACY

1.000
-0.287
0.561

1.000
-0.045

1.000

Means of variables of non-outlying cases (Europe)


GDP_CAP
MIL
B_TO_D
4.059
2.404
1.260

LITERACY
98.316

Means of variables of non-outlying cases (Islamic)


GDP_CAP
MIL
B_TO_D
2.764
1.400
3.547

LITERACY
36.733

Means of variables of non-outlying cases (NewWorld)


GDP_CAP
MIL
B_TO_D
3.214
1.466
3.951

LITERACY
79.957

GDP_CAP
MIL
B_TO_D
LITERACY

GDP_CAP
1.000
0.877
0.781
0.600

Number of observations: 15
The following results are for:
GROUP$
= NewWorld
HADI estimated correlation matrix
GDP_CAP
MIL
B_TO_D
LITERACY

GDP_CAP
1.000
0.674
-0.246
0.689

Number of observations: 21

When computations are done separately for each group, Portugal is the only outlier,
and the within-groups correlations differ markedly from group to group and from those
for the complete sample. By scanning the means, we also see that the centroids for the
three groups are quite different.

I-143
Correlations, Similarities, and Distance Measures

Example 7
Spearman Correlations
As an example, we request Spearman correlations for the same data used in the Pearson
correlation and Tranformations examples. It is often useful to compute both a
Spearman and a Pearson matrix using the same data. The absolute difference between
the two can reveal unusual features such as outliers and highly skewed distributions.
The input is:
USE ourworld
CORR
GRAPH = NONE
SPEARMAN urban birth_rt gdp_cap mil b_to_d,
lifeexpm lifeexpf literacy gnp_86 / PAIR

The correlation matrix follows:


Spearman correlation matrix
URBAN
BIRTH_RT
GDP_CAP
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86
LIFEEXPM
LIFEEXPF
LITERACY
GNP_86

URBAN
1.000
-0.749
0.777
0.678
-0.381
0.731
0.771
0.760
0.767
LIFEEXPM
1.000
0.965
0.813
0.834

BIRTH_RT

GDP_CAP

MIL

B_TO_D

1.000
-0.874
-0.670
0.689
-0.856
-0.902
-0.868
-0.847
LIFEEXPF

1.000
0.848
-0.597
0.834
0.910
0.882
0.973
LITERACY

1.000
-0.498
0.633
0.709
0.696
0.867
GNP_86

1.000
-0.410
-0.501
-0.576
-0.543

1.000
0.866
0.901

1.000
0.909

1.000

Note that many of these correlations are closer to the Pearson correlations for the logtransformed data than they are to the correlations for the raw data.

Example 8
S2 and S3 Coefficients
The choice among the binary S measures depends on what you want to state about your
variables. In this example, we request S2 and S3 to study responses made by 256
subjects to a depression inventory (Afifi and Clark, 1984). These data are stored in the
SURVEY2 data file that has one record for each respondent with answers to 20
questions about depression. Each subject was asked, for example, Last week, did you
cry less than 1 day (code 0), 1 to 2 days (code 1), 3 to 4 days (code 2), or 5 to 7 days
(code 3)? The distributions of the answers appear to be Poisson, so they are not

I-144
Chapter 6

satisfactory for Pearson correlations. Here we dichotomize the behaviors or feelings as


Did it occur or did it not? by using transformations of the form:
LET blue = blue <> 0

The result is true (1) when the behavior or feeling is present or false (0) when it is
absent. We use SYSTATs shortcut notation to do this for 7 of the 20 questions. For
each pair of feelings or behaviors, S2 indicates the proportion of subjects with both,
and S3 indicates the proportion of times both occurred given that one occurs. To
perform this example:
USE survey2
CORR
LET (blue,depress,cry,sad,no_eat,getgoing,talkless) = @ <> 0
GRAPH = NONE
S2 blue depress cry sad no_eat getgoing talkless
S3 blue depress cry sad no_eat getgoing talkless

The matrices follow:


S2 (Russell and Rao) binary similarity coefficients
BLUE
DEPRESS
CRY
SAD
NO_EAT
GETGOING
TALKLESS
GETGOING
TALKLESS

BLUE
0.254
0.207
0.090
0.188
0.117
0.180
0.117
GETGOING
0.520
0.172

DEPRESS

CRY

SAD

NO_EAT

0.422
0.113
0.313
0.129
0.309
0.156
TALKLESS

0.133
0.117
0.051
0.086
0.059

0.391
0.137
0.258
0.145

0.246
0.152
0.098

DEPRESS

CRY

SAD

NO_EAT

1.000
0.257
0.625
0.239
0.488
0.305
TALKLESS

1.000
0.288
0.155
0.152
0.183

1.000
0.273
0.395
0.294

1.000
0.248
0.248

0.246

Number of observations: 256


S3 (Jaccard) binary similarity coefficients
BLUE
DEPRESS
CRY
SAD
NO_EAT
GETGOING
TALKLESS
GETGOING
TALKLESS

BLUE
1.000
0.442
0.303
0.410
0.306
0.303
0.306
GETGOING
1.000
0.289

Number of observations: 256

1.000

I-145
Correlations, Similarities, and Distance Measures

The frequencies for DEPRESS and SAD are:


Sad
1
Depress

80

20

28

128

For S2, the result is 80/256 = 0.313; for S3, 80/128 = 0.625.

Example 9
Tetrachoric Correlation
As an example, we use the bivariate normal data in the SYSTAT data file named
TETRA. The input is:
USE tetra
FREQ = count
CORR
TETRA x y

The output follows:


Tetrachoric correlations
X
Y

X
1.000
0.810

Y
1.000

Number of observations: 45

For our single pair of variables, the tetrachoric correlation is 0.81.

Computation
All computations are implemented in double precision.

Algorithms
The computational algorithms use provisional means, sums of squares, and crossproducts (Spicer, 1972). Starting values for the EM algorithm use all available values
(see Little and Rubin, 1987, p. 42).

I-146
Chapter 6

For the rank-order coefficients (Gamma, Mu2, Spearman, and Tau), keep in mind
that these are time consuming. Spearman requires sorting and ranking the data before
doing the same work done by Pearson. The Gamma and Mu2 items require
computations between all possible pairs of observations. Thus, their computing time is
combinatoric.

Missing Data
If you have missing data, CORR can handle them in three ways: listwise deletion,
pairwise deletion, and EM estimation. Listwise deletion is the default. If there are
missing data and pairwise deletion is used, SYSTAT displays a table of frequencies
between all possible pairs of variables after the correlation matrix.
Pairwise deletion takes considerably more computer time because the sums of
cross-products for each pair must be saved in a temporary disk file. If you use the
pairwise deletion to compute an SSCP matrix, the sums of squares and cross-products
are weighted by N/n, where N is the number of cases in the whole file and n is the
number of cases with nonmissing values in a given pair.
See Chapter II-1 for a complete discussion of handling missing values.

References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Faith, D. P., Minchin, P., and Belbin, L. (1987). Compositional dissimilarity as a robust
measure of ecological distance. Vegetatio, 69, 5768.
Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for crossclassification. Journal of the American Statistical Association, 49, 732764.
Gower, J. C. (1985). Measures of similarity, dissimilarity, and distance. In Kotz, S. and
Johnson, N. L. Encyclopedia of Statistical Sciences, vol. 5. New York: John Wiley &
Sons, Inc.
Hadi, A. S. (1994). A modification of a method for the detection of outliers in multivariate
samples. In Journal of the Royal Statistical Society, Series (B), 56, No. 2.
Little, R. J. A. and Rubin, D. B. (1987). Statistical analyses with missing data. New York:
John Wiley & Sons, Inc.
Shye, S., ed. (1978). Theory construction and data analysis in the behavioral sciences. San
Francisco: Jossey-Bass, Inc.

Chapter

7
Correspondence Analysis
Leland Wilkinson

Correspondence analysis allows you to examine the relationship between categorical


variables graphically. It computes simple and multiple correspondence analysis for
two-way and multiway tables of categorical variables, respectively. Tables are
decomposed into row and column coordinates, which are displayed in a graph.
Categories that are similar to each other appear close to each other in the graphs.

Statistical Background
Correspondence analysis is a method for decomposing a table of data into row and
column coordinates that can be displayed graphically. With this technique, a two-way
table can be represented in a two-dimensional graph with points for rows and
columns. These coordinates are computed with a Singular Value Decomposition
(SVD), which factors a matrix into the product of three matrices: a collection of left
singular vectors, a matrix of singular values, and a collection of right singular
vectors. Greenacre (1984) is the most comprehensive reference. Hill (1974) and
Jobson (1992) cover the major topics more briefly.

The Simple Model


The simple correspondence analysis model decomposes a two-way table. This
decomposition begins with a matrix of standardized deviates, computed for each cell
in the table as follows:

o ij e ij
1
z ij = -------- ---------------
N
e ij

I-147

I-148
Chapter 7

where N is the sum of the table counts for all n ij , o ij is the observed count for cell ij,
and e ij is the expected count for cell ij based on an independence model. The second
2
term in this equation is a cells contribution to the test-for-independence statistic.
2
Thus, the sum of the squared z ij over all cells in the table is the same as N .
Finally, the row mass for row i is n i . N and the column mass for column j is n .j N .
The next step is to compute the matrix of cross-products from this matrix of
deviates:

S = ZZ
This S matrix has t = min ( r 1 ,c 1 ) nonzero eigenvalues, where r and c are the
row and column dimensions of the original table, respectively. The sum of these
2
eigenvalues is N (which is termed total inertia). It is this matrix that is
decomposed as follows:

S = UDV
where U is a matrix of row vectors, V is a matrix of column vectors, and D is a diagonal
matrix of the eigenvalues. The coordinates actually plotted are standardized from U
(for rows), so that
t

N =
2

i=1

( ni N
)

2
ij

j=1

The coordinates are similarly standardized from V (for columns).

The Multiple Model


The multiple correspondence model decomposes higher-way tables. Suppose we have
a multiway table of dimension k 1 by k 2 by k 3 by .... The multiple model begins with
an n by p matrix Z of dummy-coded profiles, where n = the total number of cases in the
table and p = k 1 + k 2 + k 3 + ... . This matrix is used to create a cross-products matrix:

S = ZZ
which is rescaled and decomposed with a singular value decomposition, as before. See
Jobson (1992) for further information.

I-149
Correspondence Analy sis

Correspondence Analysis in SYSTAT


Correspondence Analysis Main Dialog Box
To open the Correspondence Analysis dialog box, from the menus choose:
Statistics
Data Reduction
Correspondence Analysis

A correspondence analysis is conducted by specifying a model and estimating it.


Dependent(s). Select the variable(s) you want to examine. The dependent variable(s)
should be categorical. To analyze a two-way table (simple correspondence analysis),
select a variable defining the rows. Selecting multiple dependent variables (and no
independent variables) yields a multiple correspondence model.
Independent(s). To analyze a two-way table, select a categorical variable defining the
columns of the table.
Save coordinates to file. Saves coordinates and labels to a data file.

You can specify one of two methods for handling missing data:
n Pairwise deletion. Pairwise deletion examines each pair of variables and uses all

cases with both values present.


n Listwise deletion. Listwise deletion deletes any case with missing data for any

variable in the list.

I-150
Chapter 7

Using Commands
First, specify your data with USE filename. For a simple correspondence analysis,
continue with:
CORAN
MODEL depvar=indvar
ESTIMATE

For a multiple correspondence analysis:


CORAN
MODEL varlist
ESTIMATE

If data are aggregated and there is a variable in the file representing frequency of
profiles, use FREQ to identify that variable.

Usage Considerations
Types of data. CORAN uses rectangular data only.
Print options. There are no print options.
Quick Graphs. Quick Graphs produced by CORAN are correspondence plots for the
simple or multiple models.
Saving files. For simple correspondence analysis, CORAN saves the row variable
coordinates in DIM(1)...DIM(N) and the column variable coordinates in
FACTOR(1)...FACTOR(N), where the subscript indicates the dimension number. For
multiple correspondence analysis, DIM(1)...DIM(N) contain the variable coordinates
and FACTOR(1)...FACTOR(N) contain the case coordinates. Label information is
saved to LABEL$.
BY groups. CORAN analyzes data by groups. Your file need not be sorted on the BY
variable(s).
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. FREQ=variable increases the number of cases by the FREQ variable.
Case weights. WEIGHT is not available in CORAN.

I-151
Correspondence Analy sis

Examples
The examples begin with a simple correspondence analysis of a two-way table from
Greenacre (1984). This is followed by a multiple correspondence analysis example.

Example 1
Correspondence Analysis (Simple)
Here we illustrate a simple correspondence analysis model. The data comprise a
hypothetical smoking survey in a company (Greenacre, 1984). Notice that we use
value labels to describe the categories in the output and plot. The FREQ command
codes the cell frequencies. The input is:
USE SMOKE
LABEL STAFF / 1=Sr.Managers,2=Jr.Managers,3=Sr.Employees,
4=Jr.Employees,5=Secretaries
LABEL SMOKE / 1=None,2=Light,3=Moderate,4=Heavy
FREQ=FREQ
CORAN
MODEL STAFF=SMOKE
ESTIMATE

The resulting output is:


Variables in the SYSTAT Rectangular file are:
STAFF
SMOKE
FREQ
Case frequencies determined by value of variable FREQ.
Categorical values encountered during processing are:
STAFF (5 levels)
Sr.Managers, Jr.Managers, Sr.Employees, Jr.Employees, Secretaries
SMOKE (4 levels)
None, Light, Moderate, Heavy
Simple Correspondence Analysis
Chi-Square =
16.442.
Degrees of freedom = 12.
Probability =
0.172.
Factor
1
2
3

Eigenvalue Percent Cum Pct


0.075
87.76
87.76 ----------------------------------0.010
11.76
99.51 ---0.000
.49 100.00

Sum

0.085

(Total Inertia)

Row Variable Coordinates


Name
Sr.Managers
Jr.Managers
Sr.Employees
Jr.Employees
Secretaries

Mass
0.057
0.093
0.264
0.456
0.130

Quality
0.893
0.991
1.000
1.000
0.999

Inertia
0.003
0.012
0.038
0.026
0.006

Factor 1
0.066
-0.259
0.381
-0.233
0.201

Factor 2
0.194
0.243
0.011
-0.058
-0.079

I-152
Chapter 7

Row variable contributions to factors


Name
Sr.Managers
Jr.Managers
Sr.Employees
Jr.Employees
Secretaries

Factor 1
0.003
0.084
0.512
0.331
0.070

Factor 2
0.214
0.551
0.003
0.152
0.081

Row variable squared correlations with factors


Name
Sr.Managers
Jr.Managers
Sr.Employees
Jr.Employees
Secretaries

Factor 1
0.092
0.526
0.999
0.942
0.865

Factor 2
0.800
0.465
0.001
0.058
0.133

Column variable coordinates


Name
None
Light
Moderate
Heavy

Mass
0.316
0.233
0.321
0.130

Quality
1.000
0.984
0.983
0.995

Inertia
0.049
0.007
0.013
0.016

Column variable contributions to factors


Name
None
Light
Moderate
Heavy

Factor 1
0.654
0.031
0.166
0.150

Factor 2
0.029
0.463
0.002
0.506

Column variable squared correlations with factors


Name
None
Light
Moderate
Heavy

Factor 1
0.994
0.327
0.982
0.684

Factor 2
0.006
0.657
0.001
0.310

Factor 1
0.393
-0.099
-0.196
-0.294

Factor 2
0.030
-0.141
-0.007
0.198

I-153
Correspondence Analy sis

For the simple correspondence model, CORAN prints the basic statistics and
eigenvalues of the decomposition. Next are the row and column coordinates, with
mass, quality, and inertia values. Mass equals the marginal total divided by the grand
total. Quality is a measure (between 0 and 1) of how well a row or column point is
represented by the first two factors. It is a proportion-of-variance statistic. See
Greenacre (1984) for further information. Inertia is a rows (or columns) contribution
to the total inertia. Contributions to the factors and squared correlations with the factors
are the last reported statistics.

Example 2
Multiple Correspondence Analysis
This example uses automobile accident data in Alberta, Canada, reprinted in Jobson
(1992). The categories are ordered with the ORDER command so that the output will
show them in increasing order of severity. The data are in tabular form, so we use the
FREQ command. The input is:
USE ACCIDENT
FREQ=FREQ
ORDER INJURY$ / SORT=None,Minimal,Minor,Major
ORDER DRIVER$ / SORT=Normal,Drunk
ORDER SEATBELT$ / SORT=Yes,No
CORAN
MODEL INJURY$,DRIVER$,SEATBELT$
ESTIMATE

The resulting output is:


Variables in the SYSTAT Rectangular file are:
SEATBELT$
IMPACT$
INJURY$
DRIVER$

FREQ

Case frequencies determined by value of variable FREQ.


Categorical values encountered during processing are:
INJURY$ (4 levels)
None, Minimal, Minor, Major
DRIVER$ (2 levels)
Normal, Drunk
SEATBELT$ (2 levels)
Yes, No
Multiple Correspondence Analysis
Factor
1
2
3
4
5
Sum

Eigenvalue Percent Cum Pct


0.373
22.37
22.37
0.334
20.02
42.39
0.333
20.00
62.39
0.325
19.50
81.89
0.302
18.11 100.00
1.667

(Total Inertia)

----------------------------------

I-154
Chapter 7

Variable Coordinates
Name
None
Minimal
Minor
Major
Normal
Drunk
Yes
No

Mass
0.303
0.018
0.012
0.001
0.313
0.020
0.053
0.280

Quality
0.351
0.251
0.552
0.544
0.496
0.496
0.279
0.279

Variable contributions to factors


Name
None
Minimal
Minor
Major
Normal
Drunk
Yes
No

Factor 1
0.029
0.111
0.141
0.056
0.027
0.414
0.187
0.036

Factor 2
0.000
0.113
0.375
0.478
0.000
0.003
0.026
0.005

Variable squared correlations with factors


Name
None
Minimal
Minor
Major
Normal
Drunk
Yes
No

Factor 1
0.350
0.131
0.163
0.063
0.493
0.493
0.249
0.249

Factor 2
0.001
0.120
0.389
0.481
0.003
0.003
0.031
0.031

Factor 1
0.825
-0.779
-0.110
-1.713
-0.443
-2.047
-1.441
-3.045
0.825
-0.779
-0.110
-1.713
-0.443
-2.047
-1.441
-3.045
0.825
-0.779
-0.110
-1.713
-0.443
-2.047
-1.441
0.825
-0.779
-0.110
-1.713
-0.443
-1.441
-3.045
0.082
-1.521

Factor 2
-0.219
-0.349
-1.063
-1.193
1.676
1.547
-6.558
-6.687
-0.219
-0.349
-1.063
-1.193
1.676
1.547
-6.558
-6.687
-0.219
-0.349
-1.063
-1.193
1.676
1.547
-6.558
-0.219
-0.349
-1.063
-1.193
1.676
-6.558
-6.687
0.057
-0.073

Case coordinates
Name
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Inertia
0.031
0.315
0.322
0.332
0.020
0.313
0.280
0.053

Factor 1
0.189
-1.523
-2.134
-3.962
0.179
-2.758
1.143
-0.217

Factor 2
0.008
-1.454
3.294
-10.976
0.014
-0.211
-0.402
0.076

I-155
Correspondence Analy sis

33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

-0.853
-2.456
-1.186
-2.790
-2.184
-3.788
0.082
-1.521
-0.853
-2.456
-1.186
-2.790
-2.184
-3.788
0.082
-1.521
-0.853
-2.456
-1.186
-2.790
-2.184
-3.788
0.082
-1.521
-0.853
-2.456
-1.186
-2.790
-2.184
-3.788

-0.787
-0.916
1.953
1.823
-6.281
-6.411
0.057
-0.073
-0.787
-0.916
1.953
1.823
-6.281
-6.411
0.057
-0.073
-0.787
-0.916
1.953
1.823
-6.281
-6.411
0.057
-0.073
-0.787
-0.916
1.953
1.823
-6.281
-6.411

This time, we get case coordinates instead of column coordinates. These are not
included in the following Quick Graph because the focus of the graph is on the tabular
variables and we dont want to clutter the display. If you want to plot case coordinates,
cut and paste them into the editor and plot them directly.
Following is the Quick Graph:

I-156
Chapter 7

The graph reveals a principal axis of major versus minor injuries. This axis is related
to drunk driving and seat belt use.

Computation
All computations are in double precision.

Algorithms
CORAN uses a singular value decomposition of the cross-products matrix computed
from the data.

Missing Data
Cases with missing data are deleted from all analyses.

References
Greenacre, M. J. (1984). Theory and applications of correspondence analysis. New York:
Academic Press.
Hill, M. O. (1974). Correspondence analysis: A neglected multivariate method. Applied
Statistics, 23, 340354.
Jobson, J. D. (1992). Applied multivariate data analysis, Vol. II: Categorical and
multivariate methods. New York: Springer-Verlag.

Chapter

8
Crosstabulation

When variables are categorical, frequency tables (crosstabulations) provide useful


summaries. For a report, you may need only the number or percentage of cases falling
in specified categories or cross-classifications. At times, you may require a test of
independence or a measure of association between two categorical variables. Or, you
may want to model relationships among two or more categorical variables by fitting
a loglinear model to the cell frequencies.
Both Crosstabs and Loglinear Model can make, analyze, and save frequency tables
that are formed by categorical variables (or table factors). The values of the factors
can be character or numeric. Both procedures form tables using data read from a
cases-by-variables rectangular file or recorded as frequencies (for example, from a
table in a report) with cell indices. In Crosstabs, you can request percentages of row
totals, column totals, or the total sample size.
Crosstabs (on the Statistics menu) provides three types of frequency tables:
One-way
Two-way
Multiway

Frequency counts, percentages, and confidence intervals on cell proportions for


single table factors or categorical variables
Frequency counts, percentages, tests, and measures of association for the
crosstabulation of two factors
Frequency counts and percentages for series of two-way tables stratified by all
combinations of values of a third, fourth, etc., table factor

I-157

I-158
Chapter 8

Statistical Background
Tables report results as counts or the number of cases falling in specific categories or
cross-classifications. Categories may be unordered (democrat, republican, and
independent), ordered (low, medium, and high), or formed by defining intervals on a
continuous variable like AGE (child, teen, adult, and elderly).

Making Tables
There are many formats for displaying tabular data. Lets examine basic layouts for
counts and percentages.

One-Way Tables
Here is an example of a table showing the number of people of each gender surveyed
about depression at UCLA in 1980.
Female
Male
+---------------+
|
152
104 |
+---------------+

Total
256

The categorical variable producing this table is SEX$. Sometimes, you may define
categories as intervals of a continuous variable. Here is an example showing the 256
people broken down by age.
18 to 30 30 to 45 46 to 60 Over 60
+-------------------------------------+
|
79
80
64
33 |
+-------------------------------------+

Total
256

Two-Way Tables
A crosstabulation is a table that displays one cell for every combination of values on
two or more categorical variables. Here is a two-way table that crosses the gender and
age distributions of the tables above.
18 to
30 to
46 to
Over
Total

30
45
60
60

Female
Male
+-------------------+
|
49
30 |
|
48
32 |
|
38
26 |
|
17
16 |
+-------------------+
152
104

Total
79
80
64
33
256

I-159
Crosstabulation

This crosstabulation shows relationships between age and gender, which were
invisible in the separate tables. Notice, for example, that the sample contains a large
number of females below the age of 46.

Standardizing Tables with Percentages


As with other statistical procedures such as Correlation, it sometimes helps to have
numbers standardized on a recognizable scale. Correlations vary between 1 and 1, for
example. A convenient scale for table counts is percentage, which varies between 0 and
100.
With tables, you must choose a facet on which to standardizerows, columns, or
the total count in the table. For example, if we are interested in looking at the difference
between the genders within age groups, we might want to standardize by rows. Here is
that table:
18 to
30 to
46 to
Over

30
45
60
60

Total
N

Female
Male
+-------------------+
| 62.025
37.975 |
| 60.000
40.000 |
| 59.375
40.625 |
| 51.515
48.485 |
+-------------------+
59.375
40.625
152
104

Total

100.000
100.000
100.000
100.000

79
80
64
33

100.000
256

Here we see that as age increases, the sample becomes more evenly dispersed across
the two genders.
On the other hand, if we are interested in the overall distribution of age for each
gender, we might want to standardize within columns:
18 to
30 to
46 to
Over
Total
N

30
45
60
60

Female
Male
+-------------------+
| 32.237
28.846 |
| 31.579
30.769 |
| 25.000
25.000 |
| 11.184
15.385 |
+-------------------+
100.000 100.000
152
104

Total
30.859
31.250
25.000
12.891

N
79
80
64
33

100.000
256

For each gender, the oldest age group appears underrepresented.

I-160
Chapter 8

Significance Tests and Measures of Association


After producing a table, you may want to consider a population model that accounts
for the structure you see in the observed table. You should have a population in mind
when you make such inferences. Many published statistical analyses of tables do not
explicitly deal with the sampling problem.

One-Way Tables
A model for these data might be that the proportion of the males and females is equal
in the population. The null hypothesis corresponding to the model is:
H: pmales= pfemales

The sampling model for testing this hypothesis requires that a population contains
equal numbers of males and females and that each member of the population has an
equal chance of being chosen. After choosing each person, we identify the person as
male or female. There is no other category possible and one person cannot fit under
both categories (exhaustive and mutually exclusive).
There is an exact way to reject our null hypothesis (called a permutation test). We
can tally every possible sample of size 256 (including one with no females and one
with no males). Then we can sort our samples into two piles: samples in which there
are between 40.625% and 59.375% percent females and samples in which there are
not. If the latter pile is extremely small relative to the former, we can reject the null
hypothesis.
Needless to say, this would be a tedious undertakingparticularly on a
microcomputer. Fortunately, there is an approximation using a continuous probability
distribution that works quite well. First, we need to calculate the expected count of
males and females, respectively, in a sample of size 256 if p is 0.5. This is 128, or half
the sample N. Next, we subtract the observed counts from these expected counts,
square them, and divide by the expected:

( 152 128 ) ( 104 128 )


2
= ------------------------------- + ------------------------------- = 9
128
128
2

If our assumptions about the population and the structure of the table are correct, then
this statistic will be distributed as a mathematical chi-square variable. We can look up

I-161
Crosstabulation

the area under the tail of the chi-square statistic beyond the sample value we calculate
and if this area is small (say, less than 0.05), we can reject the null hypothesis.
To look up the value, we need a degrees of freedom (df) value. This is the number
of independent values being added together to produce the chi-square. In our case, it is
1, since the observed proportion of men is simply 1 minus the observed proportion of
women. If there were three categories (men, women, other?), then the degrees of
freedom would be 2. Anyway, if you look up the value 9 with one degree of freedom
in your chi-square table, you will find that the probability of exceeding this value is
exceedingly small. Thus, we reject our null hypothesis that the proportion of males
equals the proportion of females in the population.
This chi-square approximation is good only for large samples. A popular rule of
thumb is that the expected counts should be greater than 5, although they should be
even greater if you want to be comfortable with your test. With our sample, the
difference between the approximation and the exact result is negligible. For both, the
probability is small.
Our hypothesis test has an associated confidence interval. You can use SYSTAT to
compute this interval on the population data. Here is the result:
95 percent approximate confidence intervals scaled as cell percents
Values for SEX$
Female
Male
+-----------------+
| 66.150 47.687 |
| 52.064 33.613 |
+-----------------+

The lower limit for each gender is on the bottom; the upper limits are on the top. Notice
that these two intervals do not overlap.

Two-Way Tables
The most familiar test available for two-way tables is the Pearson chi-square test for
independence of table rows and columns. When the table has only two rows or two
columns, the chi-square test is also a test for equality of proportions. The concept of
interaction in a two-way frequency table is similar to the one in analysis of variance. It
is easiest to see in an example. Schachter (1959) randomly assigned 30 subjects to one
of two groups: High Anxiety (17 subjects), who were told that they would be
experiencing painful shocks, and Low Anxiety (13 subjects), who were told that they
would experience painless shocks. After the assignment, each subject was given the

I-162
Chapter 8

choice of waiting alone or with the other subjects. The following tables illustrate two
possible outcomes of this study.
No Interaction
WAIT

ANXIETY

High
Low

Alone
8
6

Together
9
7

Interaction
WAIT
Alone
5
9

Together
12
4

Notice in the table on the left that the number choosing to wait together relative to those
choosing to wait alone is similar for both High and Low Anxiety groups. In the table on
the right, however, more of the High Anxiety group chose to wait together.
We are interpreting these numbers relatively, so we should compute row
percentages to understand the differences better. Here are the same tables standardized
by rows:
No Interaction
WAIT

ANXIETY

High
Low

Alone
47.1
46.1

Together
52.8
53.8

Interaction
WAIT
Alone
29.4
69.2

Together
70.6
30.8

Now we can see that the percentages are similar in the two rows in the table on the left
(No Interaction) and quite different in the table on the right (Interaction). A simple
graph reveals these differences even more strongly. In the following figure, the No
Interaction row percentages are plotted on the left.

I-163
Crosstabulation

Notice that the lines cross in the Interaction plot, showing that the rows differ. There
is almost complete overlap in the No Interaction plot.
Now, in the one-way table example above, we tested the hypothesis that the cell
proportions were equal in the population. We can test an analogous hypothesis in this
contextthat each of the four cells contains 25 percent of the population. The problem
with this assumption is that we already know that Schachter randomly assigned more
people to the High Anxiety group. In other words, we should take the row marginal
percentages (or totals) as fixed when we determine what proportions to expect in the
cells from a random model.
Our No Interaction model is based on these fixed marginals. In fact, we can fix
either the row or column margins to compute a No Interaction model because the total
number of subjects is fixed at 30. You can verify that the row and column sums in the
above tables are the same.
Now we are ready to compute our chi-square test of interaction (often called a test
of independence) in the two-way table by using the No Interaction counts as expected
counts in our chi-square formula above. This time, our degrees of freedom are still 1
because the marginal counts are fixed. If you know the marginal counts, then one cell
count determines the remaining three. In general, the degrees of freedom for this test
are (rows 1) times (columns 1).
Here is the result of our chi-square test. The chi-square is 4.693, with a p of 0.03.
On this basis, we reject our No Interaction hypothesis.
ANXIETY (rows) by WAIT$ (columns)
Alone Together
+-------------------+
High |
5.000
12.000 |
Low |
9.000
4.000 |
+-------------------+
Total
14.000
16.000
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
McNemar Symmetry Chi-square
Yates corrected Chi-square
Fisher exact test (two-tail)

Total
17.000
13.000
30.000
Value
4.693
4.810
0.429
3.229

df
1.000
1.000
1.000
1.000

Prob
0.030
0.028
0.513
0.072
0.063

Actually, we cheated. The program computed the expected counts from the observed
data. These are not exactly the ones we showed you in the No Interaction table. They
differ by rounding error in the first decimal place. You can compute them exactly. The
popular method is to multiply the total row count times the total column count
corresponding to a cell and dividing by the total sample size. For the upper left cell,
this would be 17*14/30 = 7.93.

I-164
Chapter 8

There is one other interesting problem with these data. The chi-square is only an
approximation and it does not work well for small samples. Although these data meet
the minimum expected count of 5, they are nevertheless problematic. Look at the
Fishers exact test result in the output above. Like our permutation test above, which
was so cumbersome for large data files, Fishers test counts all possible outcomes
exactly, including the ones that produce interaction greater than what we observed. The
Fisher exact test p value is not significant (0.063). On this basis, we could not reject
the null hypothesis of no interaction, or independence.
Yates chi-square test in the output is an attempt to adjust the Pearson chi-square
statistic for small samples. While it has come into disfavor for being unnecessarily
conservative in many instances, nevertheless, the Yates p value is consistent with
Fishers in this case (0.072). Likelihood-ratio chi-square is an alternative to the
Pearson chi-square and is used as a test statistic for log linear models.
Selecting a Test or Measure

Other tests and measures are appropriate for specific table structures and also depend
on whether or not the categories of the factor are ordered. We use 2 2 to denote a
table with two rows and two columns, and r c for a table with r rows and c columns.
The Pearson and likelihood-ratio chi-square statistics apply to r c tables
categories need not be ordered.
McNemars test of symmetry is used for r r square tables (the number of rows
equals the number of columns). This structure arises when the same subjects are
measured twice as in a paired comparisons t test (say before and after an event) or when
subjects are paired or matched (cases and controls). So the row and column categories
are the same, but they are measured at different times or circumstances (like the paired
t) or for different groups of subjects (cases and controls). This test ignores the counts
along the diagonal of the table and tests whether the counts in cells above the diagonal
differ from those below the diagonal. A significant result indicates a greater change in
one direction than another. (The counts along the diagonal are for subjects who did not
change.)
The table structure for Cohens kappa looks like that of McNemars in that the row
and column categories are the same. But here the focus shifts to the diagonal: Are the
counts along the diagonal significantly greater than those expected by chance alone?
Because each subject is classified or rated twice, kappa is a measure of interrater
agreement.
Another difference between McNemar and Kappa is that the former is a test with
a chi-square statistic, degrees of freedom, and an associated p value, while the latter is

I-165
Crosstabulation

a measure. Its size is judged by using an asymptotic standard error to construct a


t statistic (that is, measure divided by standard error) to test whether kappa differs from
0. Values of kappa greater than 0.75 indicate strong agreement beyond chance,
between 0.40 and 0.79 means fair to good, and below 0.40 means poor agreement.
Phi, Cramrs V, and contingency are measures suitable for testing independence of
table factors as you would with Pearsons chi-square. They are designed for comparing
results of r c tables with different sample sizes. (Note that the expected value of the
Pearson chi-square is proportional to the total table size.) The three measures are scaled
differently, but all test the same null hypothesis. Use the probability printed with the
Pearson chi-square to test that these measures are zero. For tables with two rows and
two columns (a 2 2 table), phi and Cramrs V are the same.
Five of the measures for two-way tables are appropriate when both categorical
variables have ordered categories (always, sometimes, never or none, minimal
moderate, severe). These are Goodman-Kruskals gamma, Kendalls tau-b, Stuarts
tau-c, Spearmans rho, and Somers d. The first three measures differ only in how ties
are treated; the fourth is like the usual Pearson correlation except that the rank order of
each value is used in the computations instead of the value itself. Somers d is an
asymmetric measure: in SYSTAT, the column variable is considered to be the
dependent variable.
For 2 2 tables, Fishers exact test (if n 50) and Yates corrected chi-square are
also printed. When expected cell sizes are small in a 2 2 table (no expected value
less than 5), use Fishers exact test as described above.
In larger contingency tables, we do not want to see any expected values less than 1.0
or more than 20% of the values less than 5. For large tables with too many small
expected values, there is no remedy except to combine categories or possibly omit a
category that has very few observations.
Yules Q and Yules Y measure dominance in a 2 2 table. If either off-diagonal
cell is 0, both statistics are equal (otherwise they are less than 1). These statistics are 0
if and only if the chi-square statistic is 0. Therefore, the null hypothesis that the
measure is 0 can be tested by the chi-square test.

I-166
Chapter 8

Crosstabulations in SYSTAT
One-Way Frequency Tables Main Dialog Box
To open the One-Way Frequency Tables dialog box, from the menus choose:
Statistics
Tables
Crosstabs
One-way

One-way frequency tables provides frequency counts, percentages, tests, etc., for
single table factors or categorical variables.
n Tables. Tables can include frequency counts, percentages, and confidence

intervals. You can specify any confidence level between 0 and 1.


n Pearson chi-square. Tests the equality of the cell frequencies. This test assumes all

categories are equally likely.


n Options. You can include a category for cases with missing data. SYSTAT treats

this category in the same fashion as the other categories. In addition, you can
display output in a listing format instead of a tabular display. The listing includes
counts, cumulative counts, percentages, and cumulative percentages.

I-167
Crosstabulation

n Save last table as data file. Saves the table for the last variable in the Variable(s) list

as a SYSTAT data file.

Two-Way Frequency Tables Main Dialog Box


To open the Two-Way Frequency Tables dialog box, from the menus choose:
Statistics
Tables
Crosstabs
Two-way

Two-way frequency tables crosstabulate one or more categorical row variables with a
categorical column variable.
n Row variable(s). The variables displayed in the rows of the crosstabulation. Each

row variable is crosstabulated with the column variable.


n Column variable. The variable displayed in the columns of the crosstabulation. The

column variable is crosstabulated with each row variable.


n Tables. Tables can include frequency counts, percentages (row, column, or total),

expected counts, deviates (Observed-Expected), and standardized deviates


(Observed-Expected) / SQR (Expected).

I-168
Chapter 8

n Options. You can include counts and percentages for cases with missing data. In

addition, you can display output in a listing format instead of a tabular display. The
listing includes counts, cumulative counts, percentages, and cumulative
percentages for each combination of row and column variable categories.
n Save last table as data file. Saves the crosstabulation of the column variable with

the last variable in the row variable(s) list as a SYSTAT data file. For each cell of
the table, SYSTAT saves a record with the cell frequency and the row and column
category values.

Two-Way Frequency Tables Statistics


A wide variety of statistics is available for testing the association between variables in
a crosstabulation. Each statistic is appropriate for a particular table structure (rows by
columns), and a few assume that categories are ordered (ordinal data).

Pearson chi-square. For tables with any number of rows and columns, tests for
independence of the row and column variables.
2 x 2 tables. For tables with two rows and two columns, available tests are:
n Yates corrected chi-square. Adjusts the Pearson chi-square statistic for small

samples.
n Fishers exact test. Counts all possible outcomes exactly. When the expected cell

sizes are small (less than 5), use this test as an alternative to the Pearson chi-square.

I-169
Crosstabulation

n Odds ratio. A measure of association in which a value near 1 indicates no relation

between the variables.


n Yules Q and Y. Measures of association in which values near 1 or +1 indicate a

strong relation. Values near 0 indicate no relation. Yules Y is less sensitive to


differences in the margins of the table than Q.
2 x k tables. For tables with only two rows and any number of ordered column
categories (or vice versa), Cochrans test of linear trend is available to reveal whether
proportions increase (or decrease) linearly across the ordered categories.
r x r tables. For square tables, available tests include:
n McNemars test for symmetry. Used for paired (or matched) variables. Tests whether

the counts above the table diagonal differ from those below the diagonal. Small
probability values indicate a greater change in one direction.
n Cohens kappa. Commonly used to measure agreement between two judges rating

the same objects. Tests whether the diagonal counts are larger than expected.
Values of kappa greater than 0.75 indicate strong agreement beyond chance, values
between 0.40 and 0.79 indicate fair to good, and values below 0.40 indicate poor
agreement.
r x c tables, unordered levels. For tables with any number of rows or columns with no
assumed category order, available tests are:
n Phi. A chi-square based measure of association. Values may exceed 1.
n Cramrs V. A measure of association based on the chi-square. The value ranges

between 0 and 1, with 0 indicating independence between the row and column
variables and values close to 1 indicating dependence between the variables.
n Contingency coefficient. A measure of association based on the chi-square. Similar

to Cramrs V, but values of 1 cannot be attained.


n Uncertainty coefficient and Goodman-Kruskals lambda. Measure of association that

indicate the proportional reduction in error when values of one variable are used to
predict values of the other variable. Values near 0 indicate that the row variable is
no help in predicting the column variable.
n Likelihood-ratio chi-square. An alternative to the Pearson chi-square, primarily

used as a test statistic for loglinear models.


r x c tables, ordered levels. For tables with any number of rows or columns in which
categories for both variables represent ordered levels (for example, low, medium,
high), available tests are:

I-170
Chapter 8

n Spearmans rho. Similar to the Pearson correlation coefficient, but uses the ranks of

the data rather than the actual values.


n Goodman-Kruskals gamma, Kendalls tau-b, and Stuarts tau-c. Measures of

association between two ordinal variables that range between 1 and +1, differing
only in the method of dealing with ties. Values close to 0 indicate little or no
relationship.
n Somers d. An asymmetric measure of association between two ordinal variables

that ranges from 1 to 1. Values close to 1 or +1 indicate a strong relationship


between the variables. The column variable is treated as the dependent variable.

Multiway Frequency Tables Main Dialog Box


Multiway frequency tables provide frequency counts and percentages for series of twoway tables stratified by all combinations of values of a third, fourth, etc., table factor.
To open the Multiway Frequency Tables dialog box, from the menus choose:
Statistics
Tables
Crosstabs
Multiway

n Row variable. The variable displayed in the rows of the crosstabulation.


n Column variable. The variable displayed in the columns of the crosstabulation.

I-171
Crosstabulation

n Strata variable(s). If strata are separate, a separate crosstabulation is produced for

each value of each strata variable. If strata are crossed, a separate crosstabulation
is produced for each unique combination of strata variable values. For example, if
you have two strata variables, each with five categories, Separate will produce 10
tables and Crossed will produce 25 tables.
n Options. You can include counts and percentages for cases with missing data and

save the last table produced as a SYSTAT data file. In addition, you can display
output in a listing format, including percentages and cumulative percentages,
instead of a tabular display.
n Display. You can display frequencies, total percentages, row percentages, and

column percentages. Furthermore, you can use the Mantel-Haenszel test for 2 2
subtables to test for an association between two binary variables while controlling
for another variable.

Using Commands
For one-way tables in XTAB, specify:
XTAB
USE filename
PRINT / FREQ CHISQ LIST PERCENT
TABULATE varlist / CONFI=n MISS

ROWPCT

COLPCT

For two-way tables in XTAB, specify:


XTAB
USE filename
PRINT / FREQ CHISQ LRCHI YATES FISHER ODDS YULE COCHRAN,
MCKEM KAPPA PHI CRAMER CONT UNCE LAMBDA RHO GAMMA,
TAUB TAUC SOMERS EXPECT DEVI STAND LIST PERCENT,
ROWPCT COLPCT
TABULATE rowvar * colvar / MISS

For multiway tables in XTAB, specify:


XTAB
USE filename
PRINT / FREQ MANTEL LIST PERCENT ROWPCT COLPCT
TABULATE varlist * rowvar * colvar / MISS

I-172
Chapter 8

Usage Considerations
Types of data. There are two ways to organize data for tables:
n The usual cases-by-variables rectangular data file
n Cell counts with cell identifiers

For example, you may want to analyze the following table reflecting application results
by gender for business schools:
Admitted

Denied

Male

420

90

Female

150

25

A cases-by-variables data file has the following form:


PERSON
1
2
3
(etc.)
684
685

GENDER$
female
male
male

STATUS$
admit
deny
admit

female
male

deny
admit

Instead of entering one case for each of the 685 applicants, you could use the second
method to enter four cases:
GENDER$
male
male
female
female

STATUS$
admit
deny
admit
deny

COUNT
420
90
150
25

For this method, the cell counts in the third column are identified by designating
COUNT as a FREQUENCY variable.
Print options. Three levels of output are available. Statistics produced depend on the
dimensionality of the table. PRINT SHORT yields frequency tables for all tables and
Pearson chi-square for one-way and two-way tables. The MEDIUM length yields all
statistics appropriate for the dimensionality of a two-way or multiway table. LONG

I-173
Crosstabulation

adds expected cell values, deviates, and standardized deviates to the SHORT and
MEDIUM output.
Quick Graphs. Frequency tables produce no Quick Graphs.
Saving files. You can save the frequency counts to a file. For two-way tables, cell
values, deviates, and standardized deviates are also saved.
BY groups. Use of a BY variable yields separate frequency tables (and corresponding
statistics) for each level of the BY variable.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. XTAB uses the FREQUENCY variable to duplicate cases. This is the
preferred method of input when the data are aggregated.
Case weights. WEIGHT is available for frequency tables.

Examples
Example 1
One-Way Tables
This example uses questionnaire data from a community survey (Afifi and Clark,
1984). The SURVEY2 data file includes a record (case) for each of the 256 subjects in
the sample. We request frequencies for gender, marital status, and religion. The values
of these variables are numbers, so we add character identifiers for the categories. The
input is:
USE survey2
XTAB
LABEL sex
LABEL marital

/ 1=Male, 2=Female
/ 1=Never, 2=Married, 3=Divorced,
4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / FREQ
TABULATE sex marital religion

If the words male and female were stored in the variable SEX$, you would omit LABEL
and tabulate SEX$ directly. If you omit LABEL and specify SEX, the numbers would
label the output.

I-174
Chapter 8

n When using the Label dialog box, you can omit quotation marks around category

names. With commands, you can omit them if the name has no embedded blanks
or symbols (the name, however, is displayed in uppercase letters).
The output follows:
Frequencies
Values for SEX
Male Female
+---------------+
|
104
152 |
+---------------+

Total
256

Frequencies
Values for MARITAL
Never
Married Divorced Separated
+-----------------------------------------+
|
73
127
43
13 |
+-----------------------------------------+

Total
256

Frequencies
Values for RELIGION
Protestant
Catholic
Jewish
None
Other
+--------------------------------------------------------+
|
133
46
23
52
2 |
+--------------------------------------------------------+

Total
256

In this sample of 256 subjects, 152 are females, 127 are married, and 133 are
Protestants.

List Layout
List layout produces an alternative layout for the same information. Percentages and
cumulative percentages are part of the display. The input is:
USE survey2
XTAB
LABEL sex
LABEL marital

/ 1=Male, 2=Female
/ 1=Never, 2=Married, 3=Divorced,
4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / LIST
TABULATE sex marital religion
PRINT

I-175
Crosstabulation

You can also use TABULATE varlist / LIST as an alternative to PRINT NONE / LIST. The
output follows:
Count
104.
152.

Cum
Count
104.
256.

Cum
Pct
Pct SEX
40.6 40.6 Male
59.4 100.0 Female

Count
73.
127.
43.
13.

Cum
Count
73.
200.
243.
256.

Cum
Pct
Pct
28.5 28.5
49.6 78.1
16.8 94.9
5.1 100.0

MARITAL
Never
Married
Divorced
Separated

Count
133.
46.
23.
52.
2.

Cum
Count
133.
179.
202.
254.
256.

Cum
Pct
Pct
52.0 52.0
18.0 69.9
9.0 78.9
20.3 99.2
.8 100.0

RELIGION
Protestant
Catholic
Jewish
None
Other

Almost 60% (59.4) of the subjects are female, approximately 50% (49.6) are married,
and more than half (52%) are Protestants.

Example 2
Two-Way Tables
This example uses the SURVEY2 data to crosstabulate marital status against religion.
The input is:
USE survey2
XTAB
LABEL marital

/ 1=Never, 2=Married, 3=Divorced,


4=Separated
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None, 6=Other
PRINT NONE / FREQ
TABULATE marital * religion

The table follows:


Frequencies
MARITAL (rows) by RELIGION (columns)

Never
Married
Divorced
Separated
Total

Protestant
Catholic
Jewish
None
Other
+--------------------------------------------------------+
|
29
16
8
20
0 |
|
75
21
11
19
1 |
|
21
6
3
13
0 |
|
8
3
1
0
1 |
+--------------------------------------------------------+
133
46
23
52
2

Total
73
127
43
13
256

I-176
Chapter 8

In the sample of 256 people, 73 never married. Of the people that have never married,
29 are Protestants (the cell in the upper left corner), and none are in the Other category
(their religion is not among the first four categories). The Totals (or marginals) along
the bottom row and down the far right column are the same as the values displayed for
one-way tables.

Omitting Sparse Categories


There are only two counts in the last column, and the counts in the last row are fairly
sparse. It is easy to omit rows and/or columns. You can:
n Omit the category codes from the LABEL request.
n Select cases to use.

Note that LABEL and SELECT remain in effect until you turn them off. If you request
several different tables, use SELECT to ensure that the same cases are used in all tables.
The subset of cases selected via LABEL applies only to those tables that use the
variables specified with LABEL. To turn off the LABEL specification for RELIGION, for
example, specify:
LABEL religion

We continue from the last table, eliminating the last category codes for MARITAL and
RELIGION:
SELECT marital <> 4 AND religion <> 6
TABULATE marital * religion
SELECT

The table is:


Frequencies
MARITAL (rows) by RELIGION (columns)
Protestant
Catholic
Jewish
None
+---------------------------------------------+
Never |
29
16
8
20 |
Married |
75
21
11
19 |
Divorced |
21
6
3
13 |
+---------------------------------------------+
Total
125
43
22
52

Total
73
126
43
242

I-177
Crosstabulation

List Layout
Following is the panel for marital status crossed with religious preference:
USE survey2
XTAB
LABEL marital / 1=Never, 2=Married, 3=Divorced
LABEL religion / 1=Protestant, 2=Catholic, 3=Jewish,
4=None
PRINT NONE / LIST
TABULATE marital * religion
PRINT

The listing is:


Count
29.
16.
8.
20.
75.
21.
11.
19.
21.
6.
3.
13.

Cum
Count
29.
45.
53.
73.
148.
169.
180.
199.
220.
226.
229.
242.

Cum
Pct
Pct
12.0 12.0
6.6 18.6
3.3 21.9
8.3 30.2
31.0 61.2
8.7 69.8
4.5 74.4
7.9 82.2
8.7 90.9
2.5 93.4
1.2 94.6
5.4 100.0

MARITAL
Never
Never
Never
Never
Married
Married
Married
Married
Divorced
Divorced
Divorced
Divorced

RELIGION
Protestant
Catholic
Jewish
None
Protestant
Catholic
Jewish
None
Protestant
Catholic
Jewish
None

Example 3
Frequency Input
Crosstabs, like other SYSTAT procedures, reads cases-by-variables data from a
SYSTAT file. However, if you want to analyze a table from a report or a journal article,
you can enter the cell counts directly. This example uses counts from a four-way table
of a breast cancer study of 764 women. The data are from Morrison et al. (1973), cited
in Bishop, Fienberg, and Holland (1975). There is one record for each of the 72 cells
in the table, with the count (NUMBER) of women in the cell and codes or category
names to identify their age group (under 50, 50 to 69, and 70 or over), treatment center
(Tokyo, Boston, or Glamorgan), survival status (dead or alive), and tumor diagnosis
(minimal inflammation and benign, maximum inflammation and benign, minimal
inflammation and malignant, and maximum inflammation and malignant). This
example illustrates how to form a two-way table of AGE by CENTER$.

I-178
Chapter 8

The input is:


USE cancer
XTAB
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
TABULATE center$ * age

The resulting two-way table is:


Frequencies
CENTER$ (rows) by AGE (columns)
Under 50 50 to 69 70 & Over
+-------------------------------+
Boston |
58
122
73 |
Glamorgn |
71
109
41 |
Tokyo |
151
120
19 |
+-------------------------------+
Total
280
351
133
Test statistic
Pearson Chi-square

Value
74.039

Total
253
221
290
764
df
4.000

Prob
0.000

Of the 764 women studied, 290 were treated in Tokyo. Of these women, 151 were in
the youngest age group, and 19 were in the 70 or over age group.

Example 4
Missing Category Codes
You can choose whether or not to include a separate category for missing codes. For
example, if some subjects did not check male or female on a form, there would be
three categories for SEX$: male, female, and blank (missing). By default, when values
of a table factor are missing, SYSTAT does not include a category for missing values.
In the OURWORLD data file, some countries did not report the GNP to the United
Nations. In this example, we include a category for missing values, and we followed
this request with a table that omits the category for missing. The input follows:
USE ourworld
XTAB
TABULATE group$ * gnp$ / MISS
LABEL gnp$ / D=Developed, U=Emerging
TABULATE group$ * gnp$

I-179
Crosstabul ation

The tables are:


Frequencies
GROUP$ (rows) by GNP$ (columns)
D
U
+----------------------------+
Europe |
3
17
0 |
Islamic |
2
4
10 |
NewWorld |
1
15
5 |
+----------------------------+
Total
6
36
15

Total
20
16
21
57

Frequencies
GROUP$ (rows) by GNP$ (columns)
Developed Emerging
+---------------------+
Europe |
17
0 |
Islamic |
4
10 |
NewWorld |
15
5 |
+---------------------+
Total
36
15

Total
17
14
20
51

List Layout
To create a listing of the counts in each cell of the table:
PRINT / LIST
TAB group$ * gnp$
PRINT

The output is:


Count
17.
4.
10.
15.
5.

Cum
Count
17.
21.
31.
46.
51.

Cum
Pct
Pct
33.3 33.3
7.8 41.2
19.6 60.8
29.4 90.2
9.8 100.0

GROUP$
Europe
Islamic
Islamic
NewWorld
NewWorld

GNP$
Developed
Developed
Emerging
Developed
Emerging

Note that there is no entry for the empty cell.

Example 5
Percentages
Percentages are helpful for describing categorical variables and interpreting relations
between table factors. Crosstabs prints tables of percentages in the same layout as
described for frequency counts. That is, each frequency count is replaced by the
percentage. Percentages are computed by dividing each cell frequency by:

I-180
Chap te r 8

n
n
n

The total frequency in its row (row percents)


The total frequency in its column (column percents)
The total table frequency or sample size (table percents)

In this example, we request all three percentages using the following input:
USE ourworld
XTAB
LABEL gnp$ / 'D'='Developed', 'U'='Emerging'
PRINT NONE / ROWP COLP PERCENT
TABULATE group$ * gnp$

The output is:


Percents of total count
GROUP$ (rows) by GNP$ (columns)
Developed Emerging
+---------------------+
Europe |
33.333
0.0
|
Islamic |
7.843
19.608 |
NewWorld |
29.412
9.804 |
+---------------------+
Total
70.588
29.412
N
36
15

Total

33.333
27.451
39.216

17
14
20

100.000
51

Row percents
GROUP$ (rows) by GNP$ (columns)
Developed Emerging
+---------------------+
Europe | 100.000
0.0
|
Islamic |
28.571
71.429 |
NewWorld |
75.000
25.000 |
+---------------------+
Total
70.588
29.412
N
36
15

Total
100.000
100.000
100.000

N
17
14
20

100.000
51

Column percents
GROUP$ (rows) by GNP$ (columns)
Developed Emerging
+---------------------+
Europe |
47.222
0.0
|
Islamic |
11.111
66.667 |
NewWorld |
41.667
33.333 |
+---------------------+
Total
100.000
100.000
N
36
15

Total

33.333
27.451
39.216

17
14
20

100.000
51

I-181
Crosstabulation

Missing Categories
Notice how the row percentages change when we include a category for the missing
GNP:
PRINT NONE / ROWP
LABEL gnp$ / =Missing, D=Developed, U=Emerging
TABULATE group$ * gnp$
PRINT

The new table is:


Row percents
GROUP$ (rows) by GNP$ (columns)
MISSING Developed Emerging
+-------------------------------+
Europe |
15.000
85.000
0.0
|
Islamic |
12.500
25.000
62.500 |
NewWorld |
4.762
71.429
23.810 |
+-------------------------------+
Total
10.526
63.158
26.316
N
6
36
15

Total
100.000
100.000
100.000

N
20
16
21

100.000
57

Here we see that 62.5% of the Islamic nations are classified as emerging. However,
from the earlier table of row percentages, it might be better to say that among the
Islamic nations reporting the GNP, 71.43% are emerging.

Example 6
Multiway Tables
When you have three or more table factors, Crosstabs forms a series of two-way tables
stratified by all combinations of values of the third, fourth, and so on, table factors. The
order in which you choose the table factors determines the layout. Your input can be
the usual cases-by-variables data file or the cell counts with category values.
The input is:
USE cancer
XTAB
FREQ = number
LABEL age
/

50=Under 50, 60=50 to 69,


70=70 & Over
ORDER center$ / SORT=none
ORDER tumor$ / SORT=MinBengn, MaxBengn, MinMalig,
MaxMalig
TABULATE survive$ * tumor$ * center$ * age

I-182
Chapter 8

The last two factors selected (CENTER$ and AGE) define two-way tables. The levels
of the first two factors define the strata. After the table is run, we edited the output and
moved the four tables for SURVIVE$ = Dead next to those for Alive.
Frequencies
CENTER$ (rows) by AGE (columns)
SURVIVE$
= Alive
TUMOR$
= MinBengn
Under 50 50 to 69 70 & Over
+-------------------------------+
Tokyo |
68
46
6 |
Boston |
24
58
26 |
Glamorgn |
20
39
11 |
+-------------------------------+
Total
112
143
43

SURVIVE$
TUMOR$

120
108
70
298

Total
15
4
6
25

47
44
55
146

= Alive
= MaxMalig

Under 50 50 to 69 70 & Over


+-------------------------------+
Tokyo |
25
18
5 |
Boston |
4
10
1 |
Glamorgn |
8
10
4 |
+-------------------------------+
Total
37
38
10

Total
48
15
22
85

90

Total
5
2
0
7

Total
20
23
33
76

= Dead
= MaxMalig

Under 50 50 to 69 70 & Over


+-------------------------------+
Tokyo |
4
11
1 |
Boston |
6
3
3 |
Glamorgn |
3
3
3 |
+-------------------------------+
Total
13
17
7

List Layout
To create a listing of the counts in each cell of the table:
PRINT / LIST
TABULATE survive$ * center$ * age * tumor$

The output follows:

19
45
26

= Dead
= MinMalig

Under 50 50 to 69 70 & Over


+-------------------------------+
Tokyo |
9
9
2 |
Boston |
6
8
9 |
Glamorgn |
16
14
3 |
+-------------------------------+
Total
31
31
14

SURVIVE$
TUMOR$

Total

= Dead
= MaxBengn

Under 50 50 to 69 70 & Over


+-------------------------------+
Tokyo |
3
2
0 |
Boston |
0
2
0 |
Glamorgn |
0
0
0 |
+-------------------------------+
Total
3
4
0

SURVIVE$
TUMOR$
Total

= Dead
= MinBengn

Under 50 50 to 69 70 & Over


+-------------------------------+
Tokyo |
7
9
3 |
Boston |
7
20
18 |
Glamorgn |
7
12
7 |
+-------------------------------+
Total
21
41
28

SURVIVE$
TUMOR$

= Alive
= MinMalig

Under 50 50 to 69 70 & Over


+-------------------------------+
Tokyo |
26
20
1 |
Boston |
11
18
15 |
Glamorgn |
16
27
12 |
+-------------------------------+
Total
53
65
28

SURVIVE$
TUMOR$

Total

= Alive
= MaxBengn

Under 50 50 to 69 70 & Over


+-------------------------------+
Tokyo |
9
5
1 |
Boston |
0
3
1 |
Glamorgn |
1
4
1 |
+-------------------------------+
Total
10
12
3

SURVIVE$
TUMOR$

SURVIVE$
TUMOR$

Total
16
12
9
37

I-183
Crosstabulation

Case frequencies determined by value of variable NUMBER.


Count
68.
9.
26.
25.
46.
5.
20.
18.
6.
1.
1.
5.
24.
11.
4.
58.
3.
18.
10.
26.
1.
15.
1.
20.
1.
16.
8.
39.
4.
27.
10.
11.
1.
12.
4.
7.
3.
9.
4.
9.
2.
9.
11.
3.
2.
1.
7.
6.
6.
20.
2.
8.
3.
18.
9.
3.
7.
16.
3.
12.
14.
3.
7.
3.
3.

Cum
Count
68.
77.
103.
128.
174.
179.
199.
217.
223.
224.
225.
230.
254.
265.
269.
327.
330.
348.
358.
384.
385.
400.
401.
421.
422.
438.
446.
485.
489.
516.
526.
537.
538.
550.
554.
561.
564.
573.
577.
586.
588.
597.
608.
611.
613.
614.
621.
627.
633.
653.
655.
663.
666.
684.
693.
696.
703.
719.
722.
734.
748.
751.
758.
761.
764.

Cum
Pct
Pct
8.9
8.9
1.2 10.1
3.4 13.5
3.3 16.8
6.0 22.8
.7 23.4
2.6 26.0
2.4 28.4
.8 29.2
.1 29.3
.1 29.5
.7 30.1
3.1 33.2
1.4 34.7
.5 35.2
7.6 42.8
.4 43.2
2.4 45.5
1.3 46.9
3.4 50.3
.1 50.4
2.0 52.4
.1 52.5
2.6 55.1
.1 55.2
2.1 57.3
1.0 58.4
5.1 63.5
.5 64.0
3.5 67.5
1.3 68.8
1.4 70.3
.1 70.4
1.6 72.0
.5 72.5
.9 73.4
.4 73.8
1.2 75.0
.5 75.5
1.2 76.7
.3 77.0
1.2 78.1
1.4 79.6
.4 80.0
.3 80.2
.1 80.4
.9 81.3
.8 82.1
.8 82.9
2.6 85.5
.3 85.7
1.0 86.8
.4 87.2
2.4 89.5
1.2 90.7
.4 91.1
.9 92.0
2.1 94.1
.4 94.5
1.6 96.1
1.8 97.9
.4 98.3
.9 99.2
.4 99.6
.4 100.0

SURVIVE$
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Alive
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead
Dead

CENTER$
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Tokyo
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Boston
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn
Glamorgn

AGE
Under 50
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over
Under 50
Under 50
Under 50
50 to 69
50 to 69
50 to 69
70 & Over
70 & Over
70 & Over

TUMOR$
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MaxBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig
MinBengn
MinMalig
MaxMalig

I-184
Chapter 8

The 35 cells for the women who survived are listed first (the cell for Boston women
under 50 years old with MaxBengn tumors is empty). In the Cum Pct column, we see
that these women make up 72.5% of the sample. Thus, 27.5% did not survive.

Percentages
While list layout provides percentages of the total table count, you might want others.
Here we specify COLPCT in Crosstabs to print the percentage surviving within each
age-by-center stratum. The input is:
PRINT NONE / COLPCT
TABULATE age * center$ * survive$ * tumor$
PRINT

The tables follow:


Column percents
SURVIVE$ (rows) by TUMOR$ (columns)
AGE
= Under 50
CENTER$
= Tokyo
MinBengn MaxBengn MinMalig MaxMalig
+-----------------------------------------+
Alive |
90.667
75.000
74.286
86.207 |
Dead |
9.333
25.000
25.714
13.793 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
75
12
35
29
AGE
CENTER$

N
128
23

100.000
151

= Under 50
= Boston

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
77.419
0.0
64.706
40.000 |
Dead |
22.581
0.0
35.294
60.000 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
31
0
17
10
AGE
CENTER$

Total
84.768
15.232

Total

67.241
32.759

39
19

100.000
58

= Under 50
= Glamorgn

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
74.074
100.000
50.000
72.727 |
Dead |
25.926
0.0
50.000
27.273 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
27
1
32
11

Total
63.380
36.620

N
45
26

100.000
71

I-185
Crosstabulation

AGE
CENTER$

= 50 to 69
= Tokyo

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
83.636
71.429
68.966
62.069 |
Dead |
16.364
28.571
31.034
37.931 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
55
7
29
29
AGE
CENTER$

100.000
120

Total

72.951
27.049

89
33

100.000
122

Total

73.394
26.606

80
29

100.000
109

Total

68.421
31.579

13
6

100.000
19

= 70 & Over
= Boston

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
59.091
100.000
62.500
25.000 |
Dead |
40.909
0.0
37.500
75.000 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
44
1
24
4
AGE
CENTER$

89
31

= 70 & Over
= Tokyo

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
66.667
100.000
33.333
83.333 |
Dead |
33.333
0.0
66.667
16.667 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
9
1
3
6
AGE
CENTER$

74.167
25.833

= 50 to 69
= Glamorgn

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
76.471
100.000
65.854
76.923 |
Dead |
23.529
0.0
34.146
23.077 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
51
4
41
13
AGE
CENTER$

= 50 to 69
= Boston

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
74.359
60.000
69.231
76.923 |
Dead |
25.641
40.000
30.769
23.077 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
78
5
26
13
AGE
CENTER$

Total

Total

58.904
41.096

43
30

100.000
73

= 70 & Over
= Glamorgn

MinBengn MaxBengn MinMalig MaxMalig


+-----------------------------------------+
Alive |
61.111
100.000
80.000
57.143 |
Dead |
38.889
0.0
20.000
42.857 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
18
1
15
7

Total
68.293
31.707

N
28
13

100.000
41

I-186
Chapter 8

The percentage of women surviving for each age-by-center combination is reported in


the first row of each panel. In the marginal Total down the right column, we see that
the younger women treated in Tokyo have the best survival rate (84.77%). This is the
row total (128) divided by the total for the stratum (151).

Example 7
Two-Way Table Statistics
For the SURVEY2 data, you study the relationship between marital status and age. This
is a general tablewhile the categories for AGE are ordered, those for MARITAL are
not. The usual Pearson chi-square statistic is used to test the association between the
two factors. This statistic is the default for Crosstabs.
The data file is the usual cases-by-variables rectangular file with one record for each
person. We split the continuous variable AGE into four categories and add names such
as 30 to 45 for the output. There are too few separated people to tally, so here we
eliminate them and reorder the categories of MARITAL that remain. To supplement the
results, we request row percentages. The input is:
USE survey2
XTAB
LABEL age /

.. 29=18 to 29, 30 .. 45=30 to 45,


46 .. 60=46 to 60, 60 ..
=Over 60
LABEL marital / 2=Married, 3=Divorced, 1=Never
PRINT / ROWPCT
TABULATE age * marital

The output follows:


Frequencies
AGE (rows) by MARITAL (columns)

18 to
30 to
46 to
Over
Total

29
45
60
60

Married Divorced
Never
+----------------------------+
|
17
5
53 |
|
48
21
9 |
|
39
12
8 |
|
23
5
3 |
+----------------------------+
127
43
73

Total
75
78
59
31
243

I-187
Crosstabulation

Row percents
AGE (rows) by MARITAL (columns)

18 to
30 to
46 to
Over
Total
N

29
45
60
60

Married Divorced
Never
+----------------------------+
| 22.667
6.667
70.667 |
| 61.538
26.923
11.538 |
| 66.102
20.339
13.559 |
| 74.194
16.129
9.677 |
+----------------------------+
52.263
17.695
30.041
127
43
73

Test statistic
Pearson Chi-square

Total

100.000
100.000
100.000
100.000

75
78
59
31

100.000
243

Value
87.761

df
6.000

Prob
0.000

Even though the chi-square statistic is highly significant (87.761; p value < 0.0005), in
the Row percentages table, you see that 70.67% of the youngest age group fall into the
never-married category. Many of these people may be too young to consider marriage.

Eliminating a Stratum
If you eliminate the subjects in the youngest group, is there an association between
marital status and age? To address this question, the input is:
SELECT age > 29
PRINT / CHISQ PHI CRAMER
TABULATE age * marital
SELECT

CONT ROWPCT

The resulting output is:


Frequencies
AGE (rows) by MARITAL (columns)
Married Divorced
Never
+----------------------------+
30 to 45 |
48
21
9 |
46 to 60 |
39
12
8 |
Over 60 |
23
5
3 |
+----------------------------+
Total
110
38
20

Total
78
59
31
168

Row percents
AGE (rows) by MARITAL (columns)
Married Divorced
Never
+----------------------------+
30 to 45 | 61.538
26.923
11.538 |
46 to 60 | 66.102
20.339
13.559 |
Over 60 | 74.194
16.129
9.677 |
+----------------------------+
Total
65.476
22.619
11.905
N
110
38
20

Total
100.000
100.000
100.000

N
78
59
31

100.000
168

I-188
Chapter 8

Test statistic
Pearson Chi-square
Coefficient
Phi
Cramer V
Contingency

Value
2.173

df
4.000

Prob
0.704

Value
0.114
0.080
0.113

Asymptotic Std Error

The proportion of married people is larger within the Over 60 group than for the 30 to
45 group74.19% of the former are married while 61.54% of the latter are married.
The youngest stratum has the most divorced people. However, you cannot say these
proportions differ significantly (chi-square = 2.173, p value = 0.704).

Example 8
Two-Way Table Statistics (Long Results)
This example illustrates LONG results and table input. It uses the AGE by CENTER$
table from the cancer study described in the frequency input example. The input is:
USE cancer
XTAB
FREQ = number
PRINT LONG
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
TABULATE center$ * age

The output follows:


Frequencies
CENTER$ (rows) by AGE (columns)
Under 50 50 to 69 70 & Over
+-------------------------------+
Boston |
58
122
73 |
Glamorgn |
71
109
41 |
Tokyo |
151
120
19 |
+-------------------------------+
Total
280
351
133
Expected values
CENTER$ (rows) by AGE (columns)
Under 50 50 to 69 70 & Over
+-------------------------------+
Boston |
92.723
116.234
44.043 |
Glamorgn |
80.995
101.533
38.473 |
Tokyo | 106.283
133.233
50.484 |
+-------------------------------+

Total
253
221
290
764

I-189
Crosstabulation

Standardized deviates: (Observed-Expected)/SQR(Expected)


CENTER$ (rows) by AGE (columns)
Under 50 50 to 69 70 & Over
+-------------------------------+
Boston |
-3.606
0.535
4.363 |
Glamorgn |
-1.111
0.741
0.407 |
Tokyo |
4.338
-1.146
-4.431 |
+-------------------------------+
Test statistic
Pearson Chi-square
Likelihood ratio Chi-square
McNemar Symmetry Chi-square

Value
74.039
76.963
79.401

df
4.000
4.000
3.000

Prob
0.000
0.000
0.000

Coefficient
Phi
Cramer V
Contingency
Goodman-Kruskal Gamma
Kendall Tau-B
Stuart Tau-C
Cohen Kappa
Spearman Rho
Somers D
(column dependent)
Lambda
(column dependent)
Uncertainty (column dependent)

Value
0.311
0.220
0.297
-0.417
-0.275
-0.265
-0.113
-0.305
-0.267
0.075
0.049

Asymptotic Std Error

0.043
0.030
0.029
0.022
0.033
0.030
0.038
0.011

The null hypothesis for the Pearson chi-square test is that the table factors are
independent. You reject the hypothesis (chi-square = 74.039, p value < 0.0005). We
are concerned about the analysis of the full table with four factors in the cancer study
because we see an imbalance between AGE and study CENTER. The researchers in
Tokyo entered a much larger proportion of younger women than did the researchers in
the other cities.
Notice that with LONG, SYSTAT reports all statistics for an r c table including
those that are appropriate when both factors have ordered categories (gamma, tau-b,
tau-c, rho, and Spearmans rho).

Example 9
Odds Ratios
For a table with cell counts a, b, c, and d:
Exposure

Disease

yes

no

yes

no

I-190
Chapter 8

where, if you designate the Disease yes people sick and the Disease no people well, the
odds ratio (or cross-product ratio) equals the odds that a sick person is exposed divided
by the odds that a well person is exposed, or:

( a b ) ( c d ) = ( ad ) ( bc )
If the odds for the sick and disease-free people are the same, the value of the odds ratio
is 1.0.
As an example, use the SURVEY2 file and study the association between gender and
depressive illness. Be careful to order your table factors so that your odds ratio is
constructed correctly (we use LABEL to do this). The input is:
USE survey2
XTAB
LABEL casecont / 1=Depressed, 0=Normal
PRINT / FREQ ODDS
TABULATE sex$ * casecont

The output is:


Frequencies
SEX$ (rows) by CASECONT (columns)
Depressed
Normal
+---------------------+
Female |
36
116 |
Male |
8
96 |
+---------------------+
Total
44
212
Test statistic
Pearson Chi-square
Coefficient
Odds Ratio
Ln(Odds)

Total
152
104
256
Value
11.095
Value
3.724
1.315

df
1.000

Prob
0.001

Asymptotic Std Error


0.415

The odds that a female is depressed are 36 to 116, the odds for a male are 8 to 96, and
the odds ratio is 3.724. Thus, in this sample, females are almost four times more likely
to be depressed than males. But, does our sample estimate differ significantly from
1.0? Because the distribution of the odds ratio is very skewed, significance is

I-191
Crosstabulation

determined by examining Ln(Odds), the natural logarithm of the ratio, and the standard
error of the transformed ratio. Note the symmetry when ratios are transformed:
3
2
1
1/2
1/3

Ln 3
Ln 2
Ln 0
Ln 2
Ln 3

The value of Ln(Odds) here is 1.315 with a standard error of 0.415. Constructing an
approximate 95% confidence interval using the statistic plus or minus two times its
standard error:

1.315 2 * 0.415 = 1.315 0.830


results in:

0.485 < Ln ( Odds ) < 2.145


Because 0 is not included in the interval, Ln(Odds) differs significantly from 0, and the
odds ratio differs from 1.0.
Using the calculator to take antilogs of the limits. You can use SYSTATs calculator to
take antilogs of the limits EXP(0.485) and EXP(2.145) and obtain a confidence interval
for the odds ratio:

( 0.485 )

< odds ratio < e

( 2.145 )

1.624 < odds ratio < 8.542


That is, for the lower limit, type CALC EXP(0.485).
Notice that the proportion of females who are depressed is 0.2368 (from a table of
row percentages not displayed here) and the proportion of males is 0.0769, so you also
reject the hypothesis of equality of proportions (chi-square = 11.095, p value = 0.001).

I-192
Chapter 8

Example 10
Fishers Exact Test
Lets say that you are interested in how salaries of female executives compare with
those of male executives at a particular firm. The accountant there will not give you
salaries in dollar figures but does tell you whether the executives salaries are low or
high:
Low

High

Male

Female

The sample size is very small. When a table has only two rows and two columns and
PRINT=MEDIUM is set as the length, SYSTAT reports results of five additional tests

and measures: Fishers exact test, the odds ratio (and Ln(Odds)), Yates corrected chisquare, and Yules Q and Y.) By setting PRINT=SHORT, you request three of these:
Fishers exact test, the chi-square test, and Yates corrected chi-square. The input is:
USE salary
XTAB
FREQ = count
LABEL sex
/ 1=male, 2=female
LABEL earnings / 1=low, 2=high
PRINT / FISHER CHISQ YATES
TABULATE sex * earnings

The output follows:


Frequencies
SEX (rows) by EARNINGS (columns)
low
high
+---------------+
male |
2
7 |
female |
5
1 |
+---------------+
Total
7
8

Total
9
6
15

WARNING: More than one-fifth of fitted cells are sparse (frequency < 5).
Significance tests computed on this table are suspect.

Test statistic
Pearson Chi-square
Yates corrected Chi-square
Fisher exact test (two-tail)

Value
5.402
3.225

df
1.000
1.000

Prob
0.020
0.073
0.041

I-193
Crosstabulation

Notice that SYSTAT warns you that the results are suspect because the counts in the
table are too low (sparse). Technically, the message states that more than one-fifth of
the cells have expected values (fitted values) of less than 5.
The p value for the Pearson chi-square (0.020) leads you to believe that SEX and
EARNINGS are not independent. But there is a warning about suspect results. This
warning applies to the Pearson chi-square test but not to Fishers exact test. Fishers
test counts all possible outcomes exactly, including the ones that produce an
interaction greater than what you observe. The Fisher exact test p value is also
significant. On this basis, you reject the null hypothesis of independence (no
interaction between SEX and EARNINGS).

Sensitivity
Results for small samples, however, can be fairly sensitive. One case can matter. What
if the accountant forgets one well-paid male executive?
Frequencies
SEX (rows) by EARNINGS (columns)
low
high
+---------------+
male |
2
6 |
female |
5
1 |
+---------------+
Total
7
7

Total
8
6
14

WARNING: More than one-fifth of fitted cells are sparse (frequency < 5).
Significance tests computed on this table are suspect.
Test statistic
Value
df
Prob
Pearson Chi-square
4.667
1.000
0.031
Yates corrected Chi-square
2.625
1.000
0.105
Fisher exact test (two-tail)
0.103

The results of the Fisher exact test indicates that you cannot reject the null hypothesis
of independence. It is too bad that you do not have the actual salaries. Much
information is lost when a quantitative variable like salary is dichotomized into LOW
and HIGH.

What Is a Small Expected Value?


In larger contingency tables, you do not want to see any expected values less than 1.0
or more than 20% of the values less than 5. For large tables with too many small
expected values, there is no remedy but to combine categories or possibly omit a
category that has very few observations.

I-194
Chapter 8

Example 11
Cochrans Test of Linear Trend
When one table factor is dichotomous and the other has three or more ordered
categories (for example, low, median, and high), Cochrans test of linear trend is used
to test the null hypothesis that the slope of a regression line across the proportions is 0.
For example, in studying the relation of depression to education, you form this table
for the SURVEY2 data and plot the proportion depressed:

If you regress the proportions on scores 1, 2, 3, and 4 assigned by SYSTAT to the


ordered categories, you can test whether the slope is significant.
This is what we do in this example. We also explore the relation of depression to
health. The input is:
USE survey2
XTAB
LABEL casecont
/
1=Depressed,
0=Normal
LABEL educatn / 1,2=Dropout, 3=HS grad, 4,5=College,
6,7=Degree +
LABEL healthy / 1=Excellent, 2=Good, 3,4=Fair/Poor
PRINT / FREQ COLPCT COCHRAN
TABULATE casecont * educatn
TABULATE casecont * healthy

I-195
Crosstabulation

The output is:


Frequencies
CASECONT (rows) by EDUCATN (columns)
Dropout
HS grad
College Degree +
+-----------------------------------------+
Depressed |
14
18
11
1 |
Normal |
36
80
75
21 |
+-----------------------------------------+
Total
50
98
86
22

Total
44
212
256

Column percents
CASECONT (rows) by EDUCATN (columns)
Dropout
HS grad
College Degree +
+-----------------------------------------+
Depressed |
28.000
18.367
12.791
4.545 |
Normal |
72.000
81.633
87.209
95.455 |
+-----------------------------------------+
Total
100.000
100.000
100.000
100.000
N
50
98
86
22
Test statistic
Pearson Chi-square
Cochrans Linear Trend

Value
7.841
7.681

Total

17.187
82.813

44
212

100.000
256

df
3.000
1.000

Prob
0.049
0.006

Frequencies
CASECONT (rows) by HEALTHY (columns)
Excellent
Good Fair/Poor
+-------------------------------+
Depressed |
16
15
13 |
Normal |
105
78
29 |
+-------------------------------+
Total
121
93
42

Total
44
212
256

Column percents
CASECONT (rows) by HEALTHY (columns)
Excellent
Good Fair/Poor
+-------------------------------+
Depressed |
13.223
16.129
30.952 |
Normal |
86.777
83.871
69.048 |
+-------------------------------+
Total
100.000
100.000
100.000
N
121
93
42
Test statistic
Pearson Chi-square
Cochrans Linear Trend

Value
7.000
5.671

Total
17.187
82.813

N
44
212

100.000
256
df
2.000
1.000

Prob
0.030
0.017

As the level of education increases, the proportion of depressed subjects decreases


(Cochrans Linear Trend = 7.681, df = 1, and Prob (p value) = 0.006). Of those not
graduating from high school (Dropout), 28% are depressed, and 4.55% of those with
advanced degrees are depressed. Notice that the Pearson chi-square is marginally
significant (p value = 0.049). It simply tests the hypothesis that the four proportions are
equal rather than decreasing linearly.

I-196
Chapter 8

In contrast to education, the proportion of depressed subjects tends to increase


linearly as health deteriorates (p value = 0.017). Only 13% of those in excellent health
are depressed, whereas 31% of cases with fair or poor health report depression.

Example 12
Tables with Ordered Categories
In this example, we focus on statistics for studies in which both table factors have a
few ordered categories. For example, a teacher evaluating the activity level of
schoolchildren may feel that she cant score them from 1 to 20 but that she could
categorize the activity of each child as sedentary, normal, or hyperactive. Here you
study the relation of health status to age. If the category codes are character-valued,
you must indicate the correct ordering (as opposed to the default alphabetical
ordering).
For Spearmans rho, instead of using actual data values, the indices of the categories
are used to compute the usual correlation. Gamma measures the probability of getting
like (as opposed to unlike) orders of values. Its numerator is identical to that of
Kendalls tau-b and Stuarts tau-c. The input is:
USE survey2
XTAB
LABEL healthy /
LABEL age
/

1=Excellent, 2=Good, 3,4=Fair/Poor


.. 29=18 to 29, 30 .. 45=30 to 45,
46 .. 60=46 to 60, 60 ..
=Over 60
PRINT / FREQ ROWP GAMMA RHO
TABULATE healthy * age

The output follows:


Frequencies
HEALTHY (rows) by AGE (columns)
18 to 29 30 to 45 46 to 60
Over 60
+-----------------------------------------+
Excellent |
43
48
25
5 |
Good |
30
23
24
16 |
Fair/Poor |
6
9
15
12 |
+-----------------------------------------+
Total
79
80
64
33

Total
121
93
42
256

Row percents
HEALTHY (rows) by AGE (columns)
18 to 29 30 to 45 46 to 60
Over 60
+-----------------------------------------+
Excellent |
35.537
39.669
20.661
4.132 |
Good |
32.258
24.731
25.806
17.204 |
Fair/Poor |
14.286
21.429
35.714
28.571 |
+-----------------------------------------+
Total
30.859
31.250
25.000
12.891
N
79
80
64
33

Total
100.000
100.000
100.000

N
121
93
42

100.000
256

I-197
Crosstabulation

Test statistic
Pearson Chi-square

Value
29.380

Coefficient
Goodman-Kruskal Gamma
Spearman Rho

Value
0.346
0.274

df
6.000

Prob
0.000

Asymptotic Std Error


0.072
0.058

Not surprisingly, as age increases, health status tends to deteriorate. In the table of row
percentages, notice that among those with EXCELLENT health, 4.13% are in the oldest
age group; in the GOOD category, 17.2% are in the oldest group; and in the
FAIR/POOR category, 28.57% are in the oldest group.
The value of gamma is 0.346; rho is 0.274. Here are confidence intervals (Value
2 * Asymptotic Std Error) for each statistic:

0.202 <= 0.346 <= 0.490


0.158 <= 0.274 <= 0.390
Because 0 is in neither interval, you conclude that there is an association between
health and age.

Example 13
McNemars Test of Symmetry
In November of 1993, the U.S. Congress approved the North American Free Trade
Agreement (NAFTA). Lets say that two months before the approval and before the
televised debate between Vice President Al Gore and businessman Ross Perot,
political pollsters queried a sample of 350 people, asking Are you for, unsure, or
against NAFTA? Immediately after the debate, the pollsters contacted the same
people and asked the question a second time. Here are the responses:
After

Before

For

Unsure

Against

For

51

22

28

Unsure

46

18

27

Against

52

49

57

The pollsters wonder, Is there a shift in opinion about NAFTA? The study design for
the answer is similar to a paired t testeach subject has two responses. The row and
column categories of our table are the same variable measured at different points in time.

I-198
Chapter 8

The file NAFTA contains these data. To test for an opinion shift, the input is:
USE nafta
XTAB
FREQ = count
ORDER before$ after$ / SORT=for,unsure,against
PRINT / FREQ MCNEMAR CHI PERCENT
TABULATE before$ * after$

We use ORDER to ensure that the row and column categories are ordered the same. The
output follows:
Frequencies
BEFORE$ (rows) by AFTER$ (columns)
for unsure against
+-------------------------+
for |
51
22
28 |
unsure |
46
18
27 |
against |
52
49
57 |
+-------------------------+
Total
149
89
112

Total
101
91
158
350

Percents of total count


BEFORE$ (rows) by AFTER$ (columns)
for unsure against
Total
+-------------------------+
for | 14.571
6.286
8.000 | 28.857
unsure | 13.143
5.143
7.714 | 26.000
against | 14.857 14.000 16.286 | 45.143
+-------------------------+
Total
42.571 25.429 32.000
100.000
N
149
89
112

Test statistic
Pearson Chi-square
McNemar Symmetry Chi-square

Value
11.473
22.039

N
101
91
158
350

df
4.000
3.000

Prob
0.022
0.000

The McNemar test of symmetry focuses on the counts in the off-diagonal cells (those
along the diagonal are not used in the computations). We are investigating the direction
of change in opinion. First, how many respondents became more negative about
NAFTA?
n Among those who initially responded For, 22 (6.29%) are now Unsure and 28

(8%) are now Against.


n Among those who were Unsure before the debate, 27 (7.71%) answered Against

afterwards.

I-199
Crosstabulation

The three cells in the upper right contain counts for those who became more
unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in
the lower left contain counts for people who became more positive about NAFTA (46,
52, and 49) or 42% of the sample.
The null hypothesis for the McNemar test is that the changes in opinion are equal.
The chi-square statistic for this test is 22.039 with 3 df and p < 0.0005. You reject the
null hypothesis. The pro-NAFTA shift in opinion is significantly greater than the antiNAFTA shift.
You also clearly reject the null hypothesis that the row (BEFORE$) and column
(AFTER$) factors are independent (chi-square = 11.473; p = 0.022). However, a test
of independence does not answer your original question about change of opinion and
its direction.

Example 14
Confidence Intervals for One-Way Table Percentages
If your data are binomially or multinomially distributed, you may want confidence
intervals on the cell proportions. SYSTATs confidence intervals are based on an
approximation by Bailey (1980). Crosstabs uses that references approximation
number 6 with a continuity correction, which closely fits the real intervals for the
binomial on even small samples and performs well when population proportions are
near 0 or 1. The confidence intervals are scaled on a percentage scale for compatibility
with the other Crosstabs output.
Here is an example using data from Davis (1977) on the number of buses failing
after driving a given distance (1 of 10 distances). Print the percentages of the 191 buses
failing in each distance category to see the cover of the intervals. The input follows:
USE buses
XTAB
FREQ = count
PRINT NONE / FREQ PERCENT
TABULATE distance / CONFI=.95

I-200
Chapter 8

The resulting output is:


Frequencies
Values for DISTANCE
1
2
3
4
5
6
7
8
9
10
Total
+-------------------------------------------------------------+
|
6
11
16
25
34
46
33
16
2
2 |
191
+-------------------------------------------------------------+

Percents of total count


Values for DISTANCE
1
2
3
4
5
6
7
+---------------------------------------------------------+
| 3.141
5.759
8.377 13.089 17.801 24.084 17.277 |
+---------------------------------------------------------+
8
9
10
Total
+-------------------------+
| 8.377
1.047
1.047 | 100.000
+-------------------------+

N
191

95 percent approximate confidence intervals scaled as cell percents


Values for DISTANCE
1
2
3
4
5
6
7
+---------------------------------------------------------+
| 8.234 11.875 15.259 20.996 26.447 33.420 25.852 |
| 0.548
1.903
3.552
6.905 10.560 15.737 10.142 |
+---------------------------------------------------------+
8
9
10
+-------------------------+
| 15.259
4.914
4.914 |
| 3.552
0.0
0.0
|
+-------------------------+

There are 6 buses in the first distance category; this is 3.14% of the 191 buses. The
confidence interval for this percentage ranges from 0.55 to 8.23%.

Example 15
Mantel-Haenszel Test
For any ( k 2 2 ) table, if the output mode is MEDIUM or if you select the MantelHaenszel test, SYSTAT produces the Mantel-Haenszel statistic without continuity
correction. This tests the association between two binary variables controlling for a
stratification variable. The Mantel-Haenszel test is often used to test the effectiveness
of a treatment on an outcome, to test the degree of association between the presence or
absence of a risk factor and the occurrence of a disease, or to compare two survival
distributions.

I-201
Crosstabulation

A study by Ansfield, et al. (1977) examined the responses of two different groups of
patients (colon or rectum cancer and breast cancer) to two different treatments:
CANCER$
TREAT$ RESPONSE$
Colon-Rectum
a
Positive
Colon-Rectum
b
Positive
Colon-Rectum
a
Negative
Colon-Rectum
b
Negative
Breast
a
Positive
Breast
b
Positive
Breast
a
Negative
Breast
b
Negative

NUMBER
16.000
7.000
32.000
45.000
14.000
9.000
28.000
29.000

Here are the data rearranged:


Breast Cancer

Colon-Rectum

Positive

Negative

Positive Negative

Treatment A

14

28

16

32

Treatment B

29

45

The odds ratio (cross-product ratio) for the first table is:

odds (biopsy positive, given treatment A) = 14 28


--------------------------------------------------------------------------------------------------------------------------odds (biopsy positive, given treatment B) = 9 29
or

14 28
---------------- = 1.6
9 29
Similarly, for the second table, the odds ratio is:

16 32
---------------- = 3.2
7 45
If the odds for treatments A and B are identical, the ratios would both be 1.0. For these
data, the breast cancer patients on treatment A are 1.6 times more likely to have a
positive biopsy than patients on treatment B; while, for the colon-rectum, those on
treatment A are 3.2 times more likely to have a positive biopsy than those on treatment
B. But can you say these estimates differ significantly from 1.0? After adjusting for the

I-202
Chapter 8

total frequency in each table, the Mantel-Haenszel statistic combines odd ratios across
tables. The input is:
USE ansfield
XTAB
FREQ = number
ORDER response$ / SORT=Positive,Negative
PRINT / MANTEL
TABULATE cancer$ * treat$ * response$

The stratification variable (CANCER$) must be the first variable listed on TABULATE.
The output is:
Frequencies
TREAT$ (rows) by RESPONSE$ (columns)
CANCER$
= Breast

Total

Positive
Negative
+---------------------------+
a |
14
28 |
b |
9
29 |
+---------------------------+
23
57
CANCER$

Total

Total
42
38
80

= Colon-Rectum

Positive
Negative
+---------------------------+
a |
16
32 |
b |
7
45 |
+---------------------------+
23
77

Test statistic
Mantel-Haenszel statistic =
Mantel-Haenszel Chi-square =

Total
48
52
100

Value
df
2.277
4.739 Probability =

Prob
0.029

SYSTAT prints a chi-square test for testing whether this combined estimate equals 1.0
(that odds for A and B are the same). The probability associated with this chi-square is
0.029, so you reject the hypothesis that the odds ratio is 1.0 and conclude that treatment
A is less effectivemore patients on treatment A have positive biopsies after treatment
than patients on treatment B.
One assumption required for the Mantel-Haenszel chi-square test is that the odds
ratios are homogenous across tables. For your example, the second odds ratio is twice
as large as the first. You can use loglinear models to test if a cancer-by-treatment
interaction is needed to fit the cells of the three-way table defined by cancer, treatment,
and response. The difference between this model and one without the interaction was
not significant (a chi-square of 0.36 with 1 df).

I-203
Crosstabulation

Computation
All computations are in double precision.

References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning.
Ansfield, F., et al. (1977). A phase III study comparing the clinical utility of four regimens
of 5-fluorouracil. Cancer, 39, 3440.
Bailey, B. J. R. (1980). Large sample simultaneous confidence intervals for the
multinomial probabilities based on transformations of the cell frequencies.
Technometrics, 22, 583589.
Davis, D. J. (1977). An analysis of some failure data. Journal of the American Statistical
Association, 72, 113150.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. 2nd ed. New York: John
Wiley & Sons, Inc.
Morrison, A. S., Black, M. M., Lowe, C. R., MacMahon, B., and Yuasa, S. Y. (1973). Some
international differences in histology and survival in breast cancer. International
Journal of Cancer, 11, 261267.

Chapter

9
Descriptive Statistics
Leland Wilkinson and Laszlo Engelman

There are many ways to describe data, although not all descriptors are appropriate for
a given sample. Means and standard deviations are useful for data that follow a
normal distribution, but are poor descriptors when the distribution is highly skewed
or has outliers, subgroups, or other anomalies. Some statistics, such as the mean and
median, describe the center of a distribution. These estimates are called measures of
location. Others, such as the standard deviation, describe the spread of the
distribution.
Before deciding what you want to describe (location, spread, and so on), you
should consider what type of variables are present. Are the values of a variable
unordered categories, ordered categories, counts, or measurements?
For many statistical purposes, counts are treated as measured variables. Such
variables are called quantitative if it makes sense to do arithmetic on their values.
Means and standard deviations are appropriate for quantitative variables that follow a
normal distribution. Often, however, real data do not meet this assumption of
normality. A descriptive statistic is called robust if the calculations are insensitive to
violations of the assumption of normality. Robust measures include the median,
quartiles, frequency counts, and percentages.
Before requesting descriptive statistics, first scan graphical displays to see if the
shape of the distribution is symmetric, if there are outliers, and if the sample has
subpopulations. If the latter is true, then the sample is not homogeneous, and the
statistics should be calculated for each subgroup separately.
Descriptive Statistics offers the usual mean, standard deviation, and standard error
appropriate for data that follow a normal distribution. It also provides the median,
minimum, maximum, and range. A confidence interval for the mean and standard
errors for skewness and kurtosis can be requested. A stem-and-leaf plot is available

I-205

I-206
Chapter 9

for assessing distributional shape and identifying outliers. Moreover, Descriptive


Statistics provide stratified analysesthat is, you can request results separately for
each level of a grouping variable (such as SEX$) or for each combination of levels of
two or more grouping variables.

Statistical Background
Descriptive statistics are numerical summaries of batches of numbers. Inevitably, these
summaries are misleading, because they mask details of the data. Without them,
however, we would be lost in particulars.
There are many ways to describe a batch of data. Not all are appropriate for every
batch, however. Lets look at the Whos Who data from Chapter 1 to see what this
means. First of all, here is a stem-and-leaf diagram of the ages of 50 randomly sampled
people from Whos Who. A stem-and-leaf diagram is a tally; it shows us the
distribution of the AGE values.
Stem and leaf plot of variable:
Minimum:
Lower hinge:
Median:
Upper hinge:
Maximum:

AGE

, N =

50

34.000
49.000
56.000
66.000
81.000
3
4
3
689
4
14
4 H 556778999
5
0011112
5 M 556688889
6
0023
6 H 55677789
7
04
7
5668
8
1

Notice that these data look fairly symmetric and lumpy in the middle. A natural way
to describe this type of distribution would be to report its center and the amount of
spread.

Location
How do we describe the center, or central location of the distribution, on a scale? One
way is to pick the value above which half of the data values fall and, by implication,
below which half of the data values fall. This measure is called the median. For our
AGE data, the median age is 56 years. Another measure of location is the center of

I-207
Descriptive Sta tistics

gravity of the numbers. Think of turning the stem-and-leaf diagram on its side and
balancing it. The balance point would be the mean. For a batch of numbers, the mean
is computed by averaging the values. In our sample, the mean age is 56.7 years. It is
quite close to the median.

Spread
One way to measure spread is to take the difference between the largest and smallest
value in the data. This is called the range. For the age data, the range is 47 years.
Another measure, called the interquartile range or midrange, is the difference between
the values at the limits of the middle 50% of the data. For AGE, this is 17 years. (Using
the statistics at the top of the stem-and-leaf display, subtract the lower hinge from the
upper hinge.) Still another way to measure would be to compute the average variability
in the values. The standard deviation is the square root of the average squared
deviation of values from the mean. For the AGE variable, the standard deviation is
11.62. Following is some output from STATS:
AGE
N of cases
Mean
Standard Dev

50
56.700
11.620

The Normal Distribution


All of these measures of location and spread have their advantages and disadvantages,
but the mean and standard deviation are especially useful for describing data that
follow a normal distribution. The normal distribution is a mathematical curve with
only two parameters in its equation: the mean and standard deviation. As you recall
from Chapter 1, a parameter defines a family of mathematical functions, all of which
have the same general shape. Thus, if data come from a normal distribution, we can
describe them completely (except for random variation) with only a mean and standard
deviation.
Lets see how this works for our AGE data. Shown in the next figure is a histogram
of AGE with the normal curve superimposed. The location (center) of this curve is at
the mean age of the sample (56.7), and its spread is determined by the standard
deviation (11.62).

I-208
Chapter 9

12

Count

8
0.1
4

0
30

40

50

60 70
AGE

80

Proportion per Bar

0.2

0.0
90

The fit of the curve to the data looks excellent. Lets examine the fit in more detail. For
a normal distribution, we would expect 68% of the observations to fall between one
standard deviation below the mean and one standard deviation above the mean (45.1
to 68.3 years). By counting values in the stem-and-leaf diagram, we find 34 caseson
target. This is not to say that every number follows a normal distribution exactly,
however. If we looked further, we would find that the tails of this distribution are
slightly shorter than those from a normal distribution, but not enough to worry.

Non-Normal Shape
Before you compute means and standard deviations on everything in sight, however,
lets take a look at some more data: the USDATA data. Following are histograms for
the first two variables, ACCIDENT and CARDIO:

Count
5

0.1

0
0.0
20 30 40 50 60 70 80 90
ACCIDENT

0.2

8
6
0.1
4
2
0
100

200

300 400
CARDIO

500

0.0
600

Proportion per Bar

0.2

10

Proportion per Bar

10

0.3

Count

15

I-209
Descriptive Sta tistics

Notice that the normal curves fit the distributions poorly. ACCIDENT is positively
skewed. That is, it has a long right tail. CARDIO, on the other hand, is negatively
skewed. It has a long left tail. The means (44.3 and 398.5) clearly do not fall in the
centers of the distributions. Furthermore, if you calculate the medians using the Stem
display, you will see that the mean for ACCIDENT is pulled away from the median
(41.9) toward the upper tail and the mean for CARDIO is pulled to the left of the
median (416.2).
In short, means and standard deviations are not good descriptors for non-normal
data. In these cases, you have two alternatives: either transform your data to look
normal, or find other descriptive statistics that characterize the data. If you log the
values of ACCIDENT, for example, the histogram looks quite normal. If you square the
values of CARDIO, the normal fit similarly improves.
If a transformation doesnt work, then you may be looking at data that come from a
different mathematical distribution or are mixtures of subpopulations (see below). The
probability plots in SYSTAT can help you identify certain mathematical distributions.
There is not room here to discuss parameters for more complex probability
distributions. Otherwise, you should turn to distribution-free summary statistics to
characterize your data: the median, range, minimum, maximum, midrange, quartiles,
and percentiles.

Subpopulations
Sometimes, distributions can look non-normal because they are mixtures of different
normal distributions. Lets look at the Fisher/Anderson IRIS flower measurements.
Following is a histogram of PETALLEN (petal length) smoothed by a normal curve:
50
0.3

Count

30

0.2

20
0.1
10
0
0

3 4 5
PETALLEN

0.0
7

Proportion per Bar

40

I-210
Chapter 9

We forgot to notice that the petal length measurements involve three different flower
species. You can see one of them at the left. The other two are blended at the right.
Computing a mean and standard deviation on the mixed data is misleading.
The following box plot, split by species, shows how different the subpopulations are:
7

PETALLEN

6
5
4
3
2
1
0

2
3
SPECIES

When there are such differences, you should compute basic statistics by group. If you
want to go on to test whether the differences in subpopulation means are significant,
use analysis of variance.
But first notice that the Setosa flowers (Group 1) have the shortest petals and the
smallest spread; while the Virginica flowers (Group 3) have the longest petals and
widest spread. That is, the size of the cell mean is related to the size of the cell standard
deviation. This violates the assumption of equal variances necessary for a valid
analysis of variance.

PETALLEN

Here, we log transform the plot scale:


8
7
6
5
4
3
2

1
1

2
3
SPECIES

I-211
Descriptive Statistics

The spreads of the three distributions are now more similar. For the analysis, we should
log transform the data.

Descriptive Statistics in SYSTAT


Basic Statistics Main Dialog Box
To open the Basic Statistics main dialog box, from the menus choose:
Statistics
Descriptive Statistics
Basic Statistics

The following options are available


n All Options. Calculate all available statistics.
n N. The number of nonmissing values for the variable.
n Minimum. The smallest nonmissing value.
n Maximum. The largest nonmissing value.
n Sum. The total of all nonmissing values of a variable.
n Mean. The arithmetic mean of a variablethe sum of the values divided by the

number of (nonmissing) values.


n SEM. The standard error of the mean is the standard deviation divided by the square

root of the sample size. It is the estimation error, or the average deviation of sample
means from the expected value of a variable.

I-212
Chapter 9
n CI of Mean. Endpoints for the confidence interval of the mean. You can specify

confidence values between 0 and 1.


n Median. The median estimates the center of a distribution. If the data are sorted in

increasing order, the median is the value above which half of the values fall.
n SD. Standard deviation, a measure of spread, is the square root of the sum of the

squared deviations of the values from the mean divided by (n1).


n CV. The coefficient of variation is the standard deviation divided by the sample

mean.
n Range. The difference between the minimum and the maximum values.
n Variance. The mean of the squared deviations of values from the mean. (Variance

is the standard deviation squared).


n Skewness. A measure of the symmetry of a distribution about its mean. If skewness

is significantly nonzero, the distribution is asymmetric. A significant positive value


indicates a long right tail; a negative value, a long left tail. A skewness coefficient is
considered significant if the absolute value of SKEWNESS / SES is greater than 2.
n SES. The standard error of skewness

( SQR ( 6 n ) ) .

n Kurtosis. A value of kurtosis significantly greater than 0 indicates that the variable

has longer tails than those for a normal distribution; less than 0 indicates that the
distribution is flatter than a normal distribution. A kurtosis coefficient is considered
significant if the absolute value of KURTOSIS / SEK is greater than 2.
n SEK. The standard error of kurtosis

( SQR ( 24 n ) ) .

n Confidence. Confidence level for the confidence interval of the mean. Enter a value

between 0 and 1. (0.95 and 0.99 are typical values).


n N-tiles. Values that divide a sample of data into N groups containing (as far as possible)

equal numbers of observations.


n Percentiles.Values that divide a sample of data into one hundred groups containing (as
far as equal numbers of observations.

Saving Basic Statistics to a File


If you are saving statistics to a file, you must select the format in which the statistics
are to be saved:
n Variables. Use with a By Groups variable to save selected statistics to a data file.

Each selected statistic is a case in the new data file (both the statistic and the

I-213
Descriptive Sta tistics

group(s) are identified). The file contains the variable STATISTIC$ identifying the
statistics.
n Aggregate. Saves aggregate statistics to a data file. For each By Groups category, a

record (case) in the new data file contains all requested statistics. Three characters
are appended to the first eight letters of the variable name to identify the statistics.
The first two characters identify the statistic. The third character represents the
order in which the variables are selected. The statistics correspond to the following
two-letter combinations:
N of cases
Minimum
Maximum
Range
Sum
Median
Mean
CI Upper
CI Lower

NU
MI
MA
RA
SU
MD
ME
CU
CL

Std. Error
Std. Deviation
Variance
C.V.
Skewness
SE Skewness
Kurtosis
SE Kurtosis

SE
SD
VA
CV
SK
ES
KU
EK

Stem Main Dialog Box


To open the Stem main dialog box, from the menus choose:
Statistics
Descriptive Statistics
Stem-and-Leaf

Stem creates a stem-and-leaf plot for one or more variables. The plot shows the
distribution of a variable graphically. In a stem-and-leaf plot, the digits of each number
are separated into a stem and a leaf. The stems are listed as a column on the left, and
the leaves for each stem are in a row on the right. Stem-and-leaf plots also list the
minimum, lower-hinge, median, upper-hinge, and maximum values of the sample.

I-214
Chapter 9

Unlike histograms, stem-and-leaf plots show actual numeric values to the precision of
the leaves.
The stem-and-leaf plot is useful for assessing distributional shape and identifying
outliers. Values that are markedly different from the others in the sample are labeled
as outside valuesthat is, the value is more than 1.5 hspreads outside its hinge (the
hspread is the distance between the lower and upper hinges, or quartiles). Under
normality, this translates into roughly 2.7 standard deviations from the mean.
The following must be specified to obtain a stem-and-leaf plot:
n Variable(s). A separate stem-and-leaf plot is created for each selected variable.

In addition, you can indicate how many lines (stems) to include in the plot.

Cronbach Main Dialog Box


To open the Cronbach main dialog box, from the menus choose:
Statistics
Scale
Cronbachs Alpha

Cronbach computes Cronbachs alpha. This statistic is a lower bound for test reliability
and ranges in value from 0 to 1 (negative values can occur when items are negatively
correlated). Alpha can be viewed as the correlation between the items (variables)
selected and all other possible tests or scales (with the same number of items)
constructed to measure the characteristic of interest. The formula used to calculate
alpha is:

k avg ( cov )
-------------------------------avg ( var )
= ------------------------------------------------------------( k 1 ) avg ( cov )
1 + ---------------------------------------------avg ( var )

I-215
Descriptive Sta tistics

where k is the number of items, avg(cov) is the average covariance among the items,
and avg(var) is the average variance. Note that alpha depends on both the number of
items and the correlations among them. Even when the average correlation is small, the
reliability coefficient can be large if the number of items is large.
The following must be specified to obtain a Cronbachs alpha:
n Variable(s). To obtain Cronbachs alpha, at least two variables must be selected.

Using Commands
To generate descriptive statistics, choose your data by typing USE filename, and
continue with:
STATISTICS
STEM varlist / LINES=n
CRONBACH varlist
SAVE / AG
STATISTICS varlist / ALL N MIN MAX SUM MEAN SEM CIM,
CONFI=n MEDIAN SD CV RANGE VARIANCE,
SKEWNESS SES KURTOSIS SEK

Usage Considerations
Types of data. STATS uses only numeric data.
Print options. The output is standard for all PRINT options.
Quick Graphs. STATS does not create Quick Graphs.
Saving files. STATS saves basic statistics as either records (cases) or as variables.
BY groups. STATS analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. STATS uses the FREQ variable, if present, to duplicate cases.
Case weights. STATS uses the WEIGHT variable, if present, to weight cases. However,
STEM is not affected by the WEIGHT variable.

I-216
Chapter 9

Examples
Example 1
Basic Statistics
This example uses the OURWORLD data file, containing one record for each of 57
countries, and requests the default set of statistics for BABYMORT (infant mortality),
GNP_86 (gnp per capita in 1986), LITERACY (percentage of the population who can
read), and POP_1990 (population, in millions, in 1990).
The Statistics procedure knows only that these are numeric variablesit does not
know if the mean and standard deviation are appropriate descriptors for their
distributions. In other examples, we learned that the distribution of infant mortality is
right-skewed and has distinct subpopulations, the gnp is missing for 12.3% of the
countries, the distribution of LITERACY is left-skewed and has distinct subgroups. and
a log transformation markedly improves the symmetry of the population values. This
example ignores those findings.
The input is:
STATISTICS
USE ourworld
STATISTICS babymort gnp_86 literacy pop_1990

Following is the output:


N of cases
Minimum
Maximum
Mean
Standard Dev

BABYMORT
GNP_86
57
50
5.0000
120.0000
154.0000 17680.0000
48.1404
4310.8000
47.2355
4905.8773

LITERACY
57
11.6000
100.0000
73.5632
29.7646

POP_1990
57
0.2627
152.5051
22.8003
30.3655

For each variable, SYSTAT prints the number of cases (N of cases) with data present.
Notice that the sample size for GNP_86 is 50, or 7 less than the total observations. For
each variable, Minimum is the smallest value and Maximum, the largest. Thus, the
lowest infant mortality rate is 5 deaths (per 1,000 live births), and the highest is 154
deaths. In a symmetric distribution, the mean and median are approximately the same.
The median for POP_1990 is 10.354 million people (see the stem-and-leaf plot
example). Here, the mean is 22.8 millionmore than double the median. This estimate
of the mean is quite sensitive to the extreme values in the right tail.
Standard Dev, or standard deviation, measures the spread of the values in each
distribution. When the data follow a normal distribution, we expect roughly 95% of the
values to fall within two standard deviations of the mean.

I-217
Descriptive Sta tistics

Example 2
Saving Basic Statistics: One Statistic and One Grouping Variable
For European, Islamic, and New World countries, we save the median infant mortality
rate, gross national product, literacy rate, and 1990 population using the OURWORLD
data file. The input is:
STATISTICS
USE ourworld
BY group$
SAVE mystats
STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN
BY

The text results that appear on the screen are shown below (they can also be sent to a
text file).
The following results are for:
GROUP$
= Europe
BABYMORT
N of cases
20
Median
6.000
GROUP$
N of cases
Median
GROUP$
N of cases
Median

GNP_86
18
9610.000

LITERACY
20
99.000

POP_1990
20
10.462

= Islamic
BABYMORT
16
113.000

GNP_86
12
335.000

LITERACY
16
28.550

POP_1990
16
16.686

= NewWorld
BABYMORT
21
32.000

GNP_86
20
1275.000

LITERACY
21
85.600

POP_1990
21
7.241

The MYSTATS data file (created in the SAVE step) is shown below:
Case GROUP$

1
2
3
4
5
6

Europe
Europe
Islamic
Islamic
NewWorld
NewWorld

STATISTIC$ BABYMORT

N of cases
Median
N of cases
Median
N of cases
Median

20
6
16
113
21
32

GNP_86

LITERACY

18
9610
12
335
20
1275

20
99
16
28.550
21
85.6

POP_1990

Use a statement such as this to eliminate the sample size records:


SELECT statistic$ <> "N of cases"

20
10.462
16
16.686
21
7.241

I-218
Chapter 9

Example 3
Saving Basic Statistics: Multiple Statistics and Grouping Variables
If you want to save two or more statistics for each unique cross-classification of the
values of the grouping variables, SYSTAT can write the results in two ways:
n A separate record for each statistic. The values of a new variable named

STATISTICS$ identify the statistics.


n One record containing all the requested statistics. SYSTAT generates variable

names to label the results.


The first layout is the default; the second is obtained using:
SAVE filename / AG

As examples, we save the median, mean, and standard error of the mean for the crossclassification of type of country with government for the OURWORLD data. The nine
cells for which we compute statistics are shown below (the number of countries is
displayed in each cell):
Democracy

Military

One Party

Europe

16

Islamic

12

New World

Note the empty cell in the first row. We illustrate both file layoutsa separate record
for each statistic and one record for all results.
One record per statistic. The following commands are used to compute and save
statistics for the combinations of GROUP$ and GOV$ shown in the table above:
STATISTICS
USE ourworld
BY group$ gov$
SAVE mystats2
STATISTICS babymort gnp_86 literacy pop_1990
BY

/ MEDIAN

MEAN

SEM

I-219
Descriptive Sta tistics

The MYSTATS2 file with 32 cases and seven variables is shown below:
Case

GROUP$

GOV$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Europe
Europe
Europe
Europe
Europe
Europe
Europe
Europe
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
Islamic
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld
NewWorld

Democracy
Democracy
Democracy
Democracy
OneParty
OneParty
OneParty
OneParty
Democracy
Democracy
Democracy
Democracy
OneParty
OneParty
OneParty
OneParty
Military
Military
Military
Military
Democracy
Democracy
Democracy
Democracy
OneParty
OneParty
OneParty
OneParty
Military
Military
Military
Military

STATISTC$ BABYMORT

N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median
N of Cases
Mean
Std. Error
Median

16.000
6.875
0.547
6.000
4.000
11.500
1.708
12.000
4.000
91.000
23.083
97.000
5.000
109.800
15.124
116.000
7.000
110.857
11.801
116.000
12.000
44.667
9.764
35.000
3.000
14.667
1.333
16.000
6.000
53.167
13.245
55.000

GNP_86

16.000
9770.000
1057.226
10005.000
2.000
2045.000
25.000
2045.000
4.000
700.000
378.660
370.000
3.000
1016.667
787.196
280.000
5.000
458.000
180.039
350.000
12.000
2894.167
1085.810
1645.000
2.000
2995.000
2155.000
2995.000
6.000
1045.000
287.573
780.000

LITERACY

16.000
97.250
1.055
99.000
4.000
98.750
0.250
99.000
4.000
37.300
9.312
29.550
5.000
29.720
9.786
18.000
7.000
37.886
7.779
29.000
12.000
85.800
3.143
86.800
3.000
90.500
8.251
98.500
6.000
63.000
10.820
60.500

POP_1990

16.000
22.427
5.751
9.969
4.000
20.084
6.036
15.995
4.000
12.761
5.315
12.612
5.000
15.355
3.289
15.862
7.000
51.444
18.678
51.667
12.000
26.490
11.926
15.102
3.000
4.441
3.153
2.441
6.000
6.886
1.515
5.726

The average infant mortality rate for European democratic nations is 6.875 (case 2),
while the median is 6.0 (case 4).
One record for all statistics. Instead of four records (cases) for each combination of
GROUP$ and GOV$, we specify AG (aggregate) to prompt SYSTAT to write one
record for each cell:

I-220
Chapter 9

STATISTICS
USE ourworld
BY group$ gov$
SAVE mystats3 / AG
STATISTICS babymort gnp_86 literacy pop_1990 / MEDIAN
BY

MEAN

SEM

The MYSTATS3 file, with 8 cases and 18 variables, is shown below. (We separated
them into three panels and shortened the variable names):
Case

GROUP$

GOV$

1
2
3
4
5
6
7
8

Europe
Europe
Islamic
Islamic
Islamic
NewWorld
NewWorld
NewWorld

Democracy
OneParty
Democracy
OneParty
Military
Democracy
OneParty
Military

NU1BABYM

SE1BABYM

MD1BABYM

0.547
1.708
23.083
15.124
11.801
9.764
1.333
13.245

6.0
12.0
97.0
116.0
116.0
35.0
16.0
55.0

NU2GNP_8

ME2GNP_8

SE2GNP_8

MD2GNP_8

NU3LITER

ME3LITER

16
2
4
3
5
12
2
6

9770.000
2045.000
700.000
1016.667
458.000
2894.167
2995.000
1045.000

1057.226
25.000
378.660
787.196
180.039
1085.810
2155.000
287.573

10005
2045
370
280
350
1645
2995
780

16
4
4
5
7
12
3
6

SE3LITER

MD3LITER

NU4POP_1

ME4POP_1

SE4POP_1

1.055
0.250
9.312
9.786
7.779
3.143
8.251
10.820

99.0
99.0
29.5
18.0
29.0
86.8
98.5
60.5

16
4
4
5
7
12
3
6

22.427
20.084
12.761
15.355
51.444
26.490
4.441
6.886

5.751
6.036
5.315
3.289
18.678
11.926
3.153
1.515

16
4
4
5
7
12
3
6

ME1BABYM

6.875
11.500
91.000
109.800
110.857
44.667
14.667
53.167

97.250
98.750
37.300
29.720
37.886
85.800
90.500
63.000
MD4POP_1

9.969
15.995
12.612
15.862
51.667
15.102
2.441
5.726

Note that there are no European countries with Military governments, so no record is
written.

I-221
Descriptive Sta tistics

Example 4
Stem-and-Leaf Plot
We request robust statistics for BABYMORT (infant mortality), POP_1990 (1990
population in millions), and LITERACY (percentage of the population who can read)
from the OURWORLD data file. The input is:
STATISTICS
USE ourworld
STEM babymort pop_1990 literacy

The output follows:


Stem and Leaf Plot of variable:
Minimum:
5.0000
Lower hinge:
7.0000
Median:
22.0000
Upper hinge:
74.0000
Maximum:
154.0000

BABYMORT, N = 57

0 H 5666666666677777
1
00123456668
2 M 227
3
028
4
9
5
6
11224779
7 H 4
8
77
9
10
77
11
066
12
559
13
6
14
07
15
4
Stem and Leaf Plot of variable:
Minimum:
0.2627
Lower hinge:
6.1421
Median:
10.3545
Upper hinge:
25.5665
Maximum:
152.5051
0
00122333444
0 H 5556667777788899
1 M 0000034
1
556789
2
14
2 H 56
3
23
3
79
4
4
5
1
* * * Outside Values * * *
5
6677
6
2
11
48
15
2

POP_1990, N = 57

I-222
Chapter 9

Stem and Leaf Plot of variable:


Minimum:
11.6000
Lower hinge:
55.0000
Median:
88.0000
Upper hinge:
99.0000
Maximum:
100.0000

LITERACY, N = 57

1
1258
2
035689
3
1
4
5 H 002556
6
355
7
0446
8 M 03558
9 H 03344457888889999999999999
10
00

In a stem-and-leaf plot, the digits of each number are separated into a stem and a leaf.
The stems are listed as a column on the left, and the leaves for each stem are in a row
on the right. For infant mortality (BABYMORT), the Maximum number of babies who
die in their first year of life is 154 (out of 1,000 live births). Look for this value at the
bottom of the BABYMORT display. The stem for 154 is 15, and the leaf is 4. The
Minimum value for this variable is 5its leaf is 5 with a stem of 0.
The median value of 22 is printed here as the Median in the top panel and marked
by an M in the plot. The hinges, marked by Hs in the plot, are 7 and 74 deaths, meaning
that 25% of the countries in our sample have a death rate of 7 or less, and another 25%
have a rate of 74 or higher. Furthermore, the gaps between 49 and 61 deaths and
between 87 and 107 indicate that the sample does not appear homogeneous
Focusing on the second plot, the median population size is 10.354, or more than 10
million people. One-quarter of the countries have a population of 6.142 million or less.
The largest country (Brazil) has more than 152 million people. The largest stem for
POP_1990 is 15, like that for BABYMORT. This 15 comes from 152.505, so the 2 is
the leaf and the 0.505 is lost.
The plot for POP_1990 is very right-skewed. Notice that a real number line extends
from the minimum stem of 0 (0.623) to the stem of 5 for 51 million. The values below
Outside Values (stems of 5, 6, 11, and 25 with 8 leaves) do not fall along a number line,
so the right tail of this distribution extends further than one would think at first glance.
The median in the final plot indicates that half of the countries in our sample have
a literacy rate of 88% or better. The upper hinge is 99%, so more than one-quarter of
the countries have a rate of 99% or better. In the country with the lowest rate (Somalia),
only 11.6% of the people can read. The stem for 11.6 is 1 (the 10s digit), and the leaf
is 1 (the units digit). The 0.6 is not part of the display. For stem 10, there are two leaves
that are 0so two countries have 100% literacy rates (Finland and Norway). Notice
the 11 countries (at the top of the plot) with very low rates. Is there a separate subgroup
here?

I-223
Descriptive Sta tistics

Transformations
Because the distribution of POP_1990 is very skewed, it may not be suited for analyses
based on normality. To find out, we transform the population values to log base 10 units
using the L10 function. The input is:
STATISTICS
USE ourworld
LET logpop90=L10(pop_1990)
STEM logpop90

Following is the output:


Stem and Leaf Plot of variable:
Minimum:
-0.5806
Lower hinge:
0.7883
Median:
1.0151
Upper hinge:
1.4077
Maximum:
2.1833

LOGPOP90, N = 57

-0
5
* * * Outside Values * * *
0
01
0
33
0
445
0 H 6667777
0
888888899999
1 M 00000111
1
2222233
1 H 445555
1
777777
1
2
001
2

For the untransformed values of the population, the stem-and-leaf plot identifies eight
outliers. Here, there is only one outlier. More important, however, is the fact that the
shape of the distribution for these transformed values is much more symmetric.

Subpopulations
Here, we stratify the values of LITERACY for countries grouped as European, Islamic,
and New World. The input is:
STATISTICS
USE ourworld
BY group$
STEM babymort pop_1990 literacy
BY

I-224
Chapter 9

The output follows:


The following results are for:
GROUP$
= Europe
Stem and Leaf Plot of variable:
Minimum:
83.0000
Lower hinge:
98.0000
Median:
99.0000
Upper hinge:
99.0000
Maximum:
100.0000

LITERACY, N = 20

83
0
93
0
95
0
* * * Outside Values * * *
97
0
98 H 000
99 M 00000000000
100
00
The following results are for:
GROUP$
= Islamic
Stem and Leaf Plot of variable:
Minimum:
11.6000
Lower hinge:
19.0000
Median:
28.5500
Upper hinge:
53.5000
Maximum:
70.0000

LITERACY, N = 16

1 H 1258
2 M 05689
3
1
4
5 H 0255
6
5
7
0
The following results are for:
GROUP$
= NewWorld
Stem and Leaf Plot of variable:
Minimum:
23.0000
Lower hinge:
74.0000
Median:
85.6000
Upper hinge:
94.0000
Maximum:
99.0000
2
3
* * * Outside Values * * *
5
0
5
6
6
3
6
5
7 H 44
7
6
8
0
8 M 558
9 H 03444
9
8899

LITERACY, N = 21

I-225
Descriptive Sta tistics

The literacy rates for Europe and the Islamic nations do not even overlap. The rates
range from 83% to 100% for the Europeans and 11.6% to 70% for the Islamics. Earlier,
11 countries were identified that have rates of 31% or less. From these stratified results,
we learn that 10 of the countries are Islamic and 1 (Haiti) is from the New World. The
Haitian rate (23%) is identified as an outlier with respect to the values of the other New
World countries.

Computation
All computations are in double precision.

Algorithms
SYSTAT uses a one-pass provisional algorithm (Spicer, 1972). Wilkinson and Dallal
(1977) summarize the performance of this algorithm versus those used in several
statistical packages.

References
Spicer, C. C. (1972). Calculation of power sums of deviations about the mean. Applied
Statistics, 21, 226227.
Wilkinson, L. and Dallal, G. E. (1977). Accuracy of sample moments calculations among
widely used statistical programs. The American Statistician, 31, 128131.

Chapter

10
Design of Experiments
Herb Stenson

Design of Experiments (DOE) generates design matrices for a variety of ANOVA and
mixture models. You can use Design of Experiments as both an online library and a
search engine for experimental designs, saving any design to a SYSTAT file. You can
run the associated experiment, add the values of a dependent variable to the same file,
and analyze the experimental data by using General Linear Model (or another
SYSTAT statistical procedure).
SYSTAT offers three methods for generating experimental designs: Classic DOE,
the DOE Wizard, and the DESIGN command.
n Classic DOE provides a standard dialog interface for generating the most popular

complete (full) and incomplete (fractional) factorial designs. Complete factorial


designs can have two or three levels of each factor, with two-level designs limited
to two to seven factors, and three-level designs limited to two to five factors.
Incomplete designs include: Latin square designs with 3 to 12 levels per factor;
selected two-level designs described by Box, Hunter, and Hunter (1978) with 3 to
11 factors and from 4 to 128 runs; 13 of the most popular Taguchi (1987) designs;
all of the Plackett and Burman (1946) two-level designs with 4 to 100 runs; the 6
three-, five-, and seven-level designs described by Plackett and Burman; and the
set of 10 three-level designs described by Box and Behnken (1960) in both their
blocked and unblocked versions. In addition, the Lattice, Centroid, Axial, and
Screening mixture designs can be generated. The number of factors (components
of a mixture) can be as large as your computers memory allows.
n The DOE Wizard provides an alternative interface consisting of a series of

questions defining the structure of the design. The wizard offers more designs
than Classic DOE, including response surface and optimal designs. Optimization
methods include the Fedorov, k-exchange, and coordinate exchange algorithms

I-227

I-228
Chapter 10

with three optimality criteria available. The coordinate exchange algorithms


accommodate both continuous and categorical variables. The search algorithms for
fractional factorial designs allow any number of levels for any factor and search for
orthogonal, incomplete blocks if requested. The number of factors for factorial,
central composite, and optimal designs is restricted only by your computers
memory.
n The DESIGN command generates all designs found in Classic DOE using

SYSTATs command language.


Designs can be replicated as many times as you want, and the runs can be randomized.

Statistical Background
The Research Problem
As an investigator interested in solving problems, you are faced with the task of
identifying good solutions. You do this by using what you already know about the
problem area to make a judgment about the solution(s). If you possess in-depth process
knowledge, then there is little work to be done; you simply apply that knowledge to the
problem at hand and derive a solution.
More common is the situation in which you have limited knowledge about the
factors involved and their interrelationships, so that any conjecture would be quite
uncertain and far from optimal. In these situations, the first step would be to enhance
your knowledge. This is usually done by empirical investigationthat is, by
systematically observing the factors and how they affect the outcome of interest. The
results of these observations become the data in your study.
Process problems usually have factors, or variables, that may affect the outcome,
and responses that measure the outcome of interest. The basic problem-solving
approach is to develop a model that helps you understand the specific relationships
between factors and responses. Such a model allows you to predict which factor values
will lead to a desired response, or outcome. These empirical data provide the statistical
basis used to generate models of your process.

I-229
Design of Ex periments

Types of Investigation
You can think of any empirical investigation as falling into one of two broad classes:
experiment or observational study. The two classes have different properties and are
used to approach different types of problems.

Experiments
Experiments are studies in which the factors are under the direct control of the
experimenter. That is, the experimenter assigns certain values of the factors to each
run, or observation. The response(s) are recorded for each chosen combination of
factor levels.
Because the factors are being manipulated by the experimenter, the experimenter
can make inferences about causality. If assigning a certain temperature leads to a
decrease in the output of a chemical process, you can be fairly certain that temperature
really did cause the decrease because you assigned the temperature value while holding
other factors constant.
Unfortunately, experiments do have a drawback in that there are some situations in
which it is either impossible or impractical, or even unethical, to exercise control over
the factors of interest. In those situations, an observational study must be used.

Observational Studies
Observational studies use only minimal, if any, intervention by the observer on the
process. The observer merely observes and records changes in the response as the
factors undergo their natural variation. No attempt is made to control the factors.
Because the factors are not under the control of the experimenter, observational
studies are very limited in their ability to explain causal relationships. For example, if
you observe that shoe size and scholastic achievement show a strong relationship
among school children, can you infer that larger feet cause achievement? Of course
not. The truth of the matter is that both variables are most likely caused by a third
(unmeasured) variableage. Older students have larger feet, and they have been in
school longer. If you could have some control over shoe size, you could make sure that
shoe sizes were evenly distributed across students of different ages, and you would be
in a much better position to make inferences about the causal relationship between
shoe size and achievement.

I-230
Chapter 10

But of course its silly to speak of controlling shoe size, since you cant change the
size of peoples feet. This illustrates the strength of observational studiesthey can be
employed where true experimental studies are impossible, for either ethical or practical
reasons.
Because the focus of this chapter is the design and analysis of experimental studies,
further references to observational studies will be minimal.

The Importance of Having a Strategy


Controlling the factors in an experiment is only the beginning of effective experimental
research. Once you determine that you have a problem that can be addressed by
experimentation, you need to answer other crucial questions: what will your
experiment look like? What levels of which factors will you measure? How will you
analyze the results to convert your data to knowledge? These are the questions that
SYSTAT can help you answer.
Careful planning of your experiment will give you many advantages over a poorly
designed, haphazard approach to data collection. As Box, Hunter, and Hunter (1978)
point out,
Frequently conclusions are easily drawn from a well-designed experiment, even
when rather elementary methods of analysis are employed. Conversely, even the
most sophisticated statistical analysis cannot salvage a badly designed experiment.
(p. vii.)

Completeness
By using a well-designed experiment, you will be able to discover the most important
relationships in your process. Lack of planning can lead to incomplete designs that
leave certain questions unanswered, confounding that causes confusion of two or more
effects so that they become statistically indistinguishable, and poor precision of
estimates.

Efficiency
Carefully planned experiments allow you to get the information you need at a fraction
of the cost of a poorly planned design. Content knowledge can be applied to select
specific effects of interest, and your experimental runs can be targeted to answer just
those effects. Runs are not wasted on testing effects you already understand well.

I-231
Design of Ex periments

Insight
A well-designed experiment allows you to see patterns in the data that would be
difficult to spot in a simple table of hastily collected values. The mathematical model
you build based on your observations will be more reliable, more accurate, and more
informative if you use well-chosen run points from an appropriate experimental
design.

The Role of Experimental Design in Research


Experimental design is the interface between your question and the real world. The
design tells you how much data you will need to collect, what factor levels to use for
the run points, and how to analyze the results to get a useful model of the process. The
model you derive from your experiment can then be applied to the problem at hand,
enhancing your knowledge and allowing you to confidently formulate a solution.
The figure below illustrates the flow of knowledge in experimental research. Notice
that the diagram is circularyou start with some knowledge, formulate a research
question, and perform the research; then the knowledge you gained from the research is
used to formulate new research questions, new designs, and so on. As you go through the
iterations, you should find that your information increases in both quantity and quality.
Prior
Knowledge

Research
Question

Experimental
Design

Data
Collection

Analysis

Interpretation

New
Knowledge

Types of Experimental Designs


There is a wide variety of experimental designs, each of which addresses a different
type of research problem. These designs tend to fall into broad classes, which can be
summarized as follows:
n Factorial designs. These designs are used to identify important effects in your

process.

I-232
Chapter 10

n Response surface designs. These designs are useful when you want to find the

combination of factor values that gives the highest (or lowest) response.
n Mixture designs. These designs are useful when you want to find the ideal

proportions of ingredients for a mixture process. Mixture designs take into account
the fact that all the component proportions must sum to 1.0.
n Optimal designs. These designs are useful when you have enough information

available to give a very detailed specification of the model you want to test.
Because optimal designs are very flexible, you can use them in situations where no
standard design is available. Optimal designs are also useful when you want to
have explicit control over the type of efficiency maximized by the design.

Factorial Designs
In investigating the factors that affect a certain process, the basic building blocks of
your investigation are observations of the system under different conditions. You vary
the factors under your control and measure what happens to the outcome of the process.
The naive inquirer might use a haphazard, trial-and-error approach to testing the
factors. Of course, this approach can take a long time and many observations, or runs,
to give reasonable results (if it does at all), and, in fact, it may fail to reveal important
effects because of the lack of an investigative strategy.
Someone more familiar with scientific methodology might make systematic
comparisons of various levels of each factor, holding the others constant. However,
while this approach is more reliable than the trial-and-error approach, it can still cause
you to overlook important effects. Consider the following hypothetical response plot.
The contours indicate points of equal response.
220

20

200

180

16

160

140

12
4

120
100
80
60
40
20
10

20

30

40

50

60
X1

70

80

90 100

I-233
Design of Ex periments

If you tried the one-at-a-time approach, your ability to accurately measure the effects
of the variables would depend on the initial settings you chose. For example, if you
chose the point indicated by the horizontal line as your fixed starting value for x2 as
you varied x1, you would conclude that the maximum response occurs when x 1 = 47 .
Then, you would fix x1 at 47 and vary x2, concluding that the maximum response
occurs when x 2 = 98 . The two following figures illustrate this problem. However, it
is clear from the previous contours that the maximum effect occurs where x 1 = 100
and x 2 = 220 , or perhaps even somewhere outside the range that youve measured.
8

0
0

10

20

30 40 50 60 70 80
X1 (X2 held constant at 98.0)

90

100

50

100
150
200
X2 (X1 held constant at 47.0)

250

This illustrates the importance of considering the factors simultaneously. The only way
to find the true effects of the factors on the response variable is to take measurements
at carefully planned combinations of the factor levels, as shown below. Such designs
are called factorial designs. A factorial design that could be used to explore the
hypothetical process would take measurements at high, medium, and low levels of
each factor, with all combinations of levels used in the design.

I-234
Chapter 10

220

20

200

180

16

160

140

12
4

120
100
80
60
40
20
10

20

30

40

50

60
X1

70

80

90 100

Factorial designs can be classified into two broad types: full (or complete) factorials
and fractional factorials, shown below. Full factorials (a) use observations at all
combinations of all factor levels . Full factorials give a lot of insight into the effects of
the factors, particularly interactions, or joint effects of variables. Unfortunately, they
often require a large number of runs, which means that they can be expensive.
Fractional factorials (b) use only some combinations of factor levels. This means that
they are efficient, requiring fewer runs than their full factorial counterpart. However,
to gain this efficiency, they sacrifice some (or all) of their ability to measure interaction
effects. This makes them ill-suited to exploring the details of complex processes.

Fractional Factorial Design Types


The following types of fractional factorial designs can be generated:

I-235
Design of Ex periments

n Homogeneous fractional. These are fractional designs in which all factors have the

same number of levels.


n Mixed-level fractional. These are fractional designs in which factors have different

numbers of levels.
n Box-Hunter. This is a set of fractional designs for two-level factors that can be

specified based on the number of factors and the number of runs (as a power of 2).
n Plackett-Burman. These designs are saturated (or nearly saturated) fractional

factorial designs based on orthogonal arrays. They are very efficient for estimating
main effects but rely on the absence of two-factor interactions.
n Taguchi. These designs are orthogonal arrays allowing for a maximum number of

main effects to be estimated from a minimum number of runs in the experiment


while allowing for differences in the number of factor levels
n Latin square. These designs are useful when there are restrictions on randomization,

where you need to isolate the effects of one or more blocking (or "nuisance")
factors. In Latin square designs, all factors must have the same number of levels.
Graeco-Latin squares and hyper-Graeco-Latin squares can also be generated when
you need to isolate the effects of more than one "nuisance" variable.

Analysis of Factorial Designs


Factorial designs are usually analyzed as linear models. The models available for a
design depend on the number of factors and their levels and whether the design is full
or fractional.
The simplest models are main-effects models. A main-effects model is summarized
by the following equation:

y = + i + j + +
where y is the response variable and i , j , represent the treatment effects of the
factors. This model assumes that all interactions are negligible. These models are
useful for describing very simple processes and for analyzing fractional designs of low
resolution. They are also useful for analyzing screening designs, where the goal is not
necessarily to model all effects realistically but merely to identify influential factors for
further study.
The next level of model complexity, the second-order model, involves adding
two-factor interaction terms to the equation. Following is an example for a twofactor model:

I-236
Chapter 10

y ijk = + i + j + ( ) ij + ijk
This model allows you to explore joint effects of factors taken in pairs. For example,
the term ( ) ij allows you to see whether the effect of the factor on y depends on
the level of . If this term is significant, you can conclude that the effect of does
indeed depend on the level of .

Response Surface Designs


There are many situations in which it is not enough to know simply which factors affect
a process. You need to know exactly what combination of values for the factors
produces the desired result. In other words, you want to optimize your process in terms
of the outcome of interest. For example, you may want to find the best combination of
temperature and pressure for a chemical process, or you may want to identify the ideal
soak time and developer concentration for a photographic development process.
This is typically done by calculating a model of the response based on the factors of
interest. The shape of the surface is examined in order to identify the point of
maximum response (or minimum response for minimization problems). Such a model
is called a response surface, and experimental designs for finding such models are
called response surface designs.

In many cases, the response surface must be considered in parts because when you
consider all of the possible values for the factors involved, the surface can be quite
complex. Because of this complexity, it is often not possible to build a mathematical

I-237
Design of Ex periments

model that truly reflects the shape of the surface. Fortunately, when you look at
restricted portions of the response surface, they can usually be modeled successfully
with relatively simple equations.

To take advantage of this, experimenters often use a two-stage approach to modeling


response surfaces. In the first stage, a neighborhood in the space defined by the
factors is chosen and a simple linear model is constructed. If the linear model fits the
data in that neighborhood, the model is used to find a direction of steepest ascent (or
descent for minimization problems). The factor limits that define the neighborhood are
then adjusted in the appropriate direction, defining a new neighborhood, and another
linear design is used. This continues until the simple design no longer fits the data in
that region. Then a more complex model is calculated, and an estimate of the maximum
(or minimum) response point can be found. (Occasionally, it may happen that the
surface is linear up to the boundary of your factor space, in which case you simply use
the linear model to choose the boundary point that maximizes your response.)

Variance of Estimates and Rotatability


In most cases, the purpose of building a mathematical model of a process is to make
predictions about what would happen given a particular set of conditions that you have
not measured directly. This is particularly true in the case of a response surface
experimentthe surface you calculate is essentially a set of predictions for all possible
combinations within the limits of your factor measurements. With an adequate model

I-238
Chapter 10

and careful measurements, you can usually do a reasonably good job of predicting
response throughout the response surface neighborhood of interest.
When you make such predictions, however, you must accept the fact that the model
is not perfectthere are often imperfections in your measurements, and the
mathematical model almost never fits the true response function exactly. Thus, if you
were to conduct the experiment repeatedly, you would get slightly different answers
each time. The degree to which your predictions are expected to differ across multiple
experiments is known as the variance of prediction, or V( y ). The value of V( y )
depends on the design used and on where in the factor space you are calculating a
prediction. V( y ) increases as you get further from the observed data points. Of course,
you would like the portion of the design that produces the most precise predictions to
be near the optimum that you are trying to locate. Unfortunately, you usually dont
really know where the optimal value is (or in what direction it lies) when you start.
To deal with the fact that you dont know exactly where the optimum is, you can
use designs in which the variance of prediction depends only on the distance from the
center of the design, not on the direction from the center. Such designs are called
rotatable designs. First-order (linear) orthogonal models are always rotatable. Some
central composite designs are rotatable. (In SYSTAT, the distance from the center is
automatically chosen to ensure rotatability for unblocked designs. However, for
blocked designs, the distance is chosen to ensure orthogonality of blocks, which may
lead to nonrotatable designs.) In addition, some Box-Behnken designs are rotatable,
and most are nearly rotatable (meaning that directional differences in prediction
variance exist, but they are small). In general, three-level factorial response surface
designs are not rotatable. This means that care should be used before employing such
a designthe precision of your predictions may depend on the direction in which the
optimum lies.

Response Surface Design Types


Two types of response surface designs are available:
Central composite. These designs combine a 2k factorial design (or a fraction thereof)
with a set of 2k axial points and one or more center points, which allow quadratic
surfaces to be modeled. These designs are efficient, requiring fewer runs than the
corresponding full factorial design. However, they require each factor to be measured
at five different levels.
Box-Behnken. These are second-order designs (which allow estimation of quadratic
effects) based on combining a two-level factorial design with an incomplete block

I-239
Design of Ex periments

design. For these designs, factors need to be measured at only three levels. BoxBehnken designs are also quite efficient in requiring relatively few runs.

Analysis of Response Surface Designs


Response surface designs are analyzed with either a linear or a quadratic model,
depending on the purpose of the design. If the purpose is hill-climbing, a linear model
is usually adequate. If the purpose is to locate the optimum, then a quadratic model is
needed.
The linear model takes the form

y = 0 + 1 x 1 + + k x k
where k is the number of factors in the design. Similarly, the quadratic model is
expressed as
2
2
y = 0 + 1 x 1 + + k x k + 11 x 1 + 12 x 1 x 2 + + ( k 1 )k x ( k 1 ) x k + kk x k

In either case, the estimated equation defines the response surface. This surface is often
plotted, either as a 3-D surface plot or a 2-D contour plot, to help the investigator
visualize the shape of the response surface.

Mixture Designs
Suppose that you are trying to determine the best blend of ingredients or components
for your product. Initially, this appears to be a problem that can be addressed with a
straightforward response surface design. However, upon closer examination, you
discover that there is an additional consideration in this problemthe amounts of the
ingredients are inextricably linked with each other. For example, suppose that you are
trying to determine the best combination of pineapple and orange juices for a fruit
punch. Increasing the amount of orange juice means that the amount of pineapple juice
must be decreased, relative to the whole. (Of course, you could add more pineapple
juice as you add more orange juice, but this would simply increase the total amount of
punch. It would not alter the fundamental quality of the punch.) The problem is shown
in the following plot.

I-240
Chapter 10

1.0

Pineapple Juice (PJ)

0.8

0.6

0.4

0.2

0.0
0.0

0.2

0.4

0.6

0.8

1.0

Orange Juice (OJ)

By specifying that the components are ingredients in a mixture, you limit the values
that the amounts of the components can take. All of the points corresponding to onegallon blends lie on the line shown in the plot. You can describe the constraint with the
equation OJ + PJ = 1 gallon.
Now, suppose that you decide to add a third type of juice, watermelon juice, to the
blend. Of course, you still want the total amount of juice to be one gallon, but with
three factors you have a bit more flexibility in the mixtures. For example, if you want
to increase the amount of orange juice, you can decrease the amount of pineapple juice,
the amount of watermelon juice, or both. The constraint now becomes OJ + PJ + WJ
= 1 gallon. The combinations of juice amounts that satisfy this constraint lie in a
triangular plane within the unconstrained factor space.

0.8

0.6

0.4

0.4

0.8

0.6
0.8
Pinea
pple J
uice (P
J)

1.0

1.0

an
g

0.6

0.2

OJ

e(
u ic

eJ

0.4
0.0

0.0
0.2

0.2

Or

Watermelon Juic

e (WJ)

1.0

I-241
Design of Ex periments

The feasible values for a mixture comprise a (k 1)-dimensional region within the kdimensional factor space (indicated by the shaded triangle). This region is called a
simplex. The pure mixtures (made of only one component) are at the corners of the
simplex, and binary mixtures (mixtures of only two components) are along the edges.
The concept of the mixture simplex extends to higher-dimensional problems as well
the simplex for a four-component problem is a three-dimensional regular tetrahedron
and so on.
To generalize, you measure component amounts as proportions of the whole rather
than as absolute amounts. When you take this approach, it is clear that increasing the
proportion of one ingredient necessarily decreases the proportion(s) of one or more of
the others. There is a constraint that the sum of the ingredient proportions must equal
the whole. In the case of proportions, the whole would be denoted by 1.0, and the
constraint is expressed as

x 1 + x 2 + + x k = 1.0
where x1, ..., xk are the proportions of each of the k components in the mixture.
Because of this constraint, such problems require a special approach. This approach
includes using a special class of experimental designs, called mixture designs. These
designs take into account the fact that the component amounts must sum to 1.0.

Unconstrained Mixture Designs


Unconstrained mixture designs allow factor levels to vary from the minimum to the
maximum value for the mixture. Four unconstrained designs are available. See Cornell
(1990) for more information on each.
Lattice. Lattice designs allow you to specify the number of levels or the number of
values that each component (factor) assumes, including 0 and 1. The selection of levels
has no effect for the other three types of designs available because the number of
factors determines the number of levels for each of them. As Cornell (1990) points out,
the vast majority of mixture research employs lattice models; however, the other three
types included here are useful in specific situations.
Centroid. Centroid designs consist of every (non-empty) subset of the components, but
only with mixtures in which the components appear in equal proportions. Thus, if we
asked for a centroid design with four factors (components), the mixtures in the model
would consist of all permutations of the set (1,0,0,0), all permutations of the set
(1/2,1/2,0,0), all permutations of the set (1/3,1/3,1/3,0), and the set (1/4,1/4,1/4,1/4).

I-242
Chapter 10

Thus, the number of distinct points is 1 less than 2 raised to the q power, where q is the
number of components. Centroid designs are useful for investigating mixtures where
incomplete mixtures (with at least one component absent) are of primary importance.
Axial. In an axial design with m components, each run consists of at least (m - 1) equal
proportions of the components. These designs include: mixtures composed of one
component; mixtures composed of (m - 1) components in equal proportions; and
mixtures with equal proportions of all components. Thus, if we asked for an axial
design with four factors (components), the mixtures in the model would consist of all
permutations of the set (1,0,0,0), all permutations of the set (5/8,1/8,1/8,1/8), all
permutations of the set (0,1/3,1/3,1/3), and the set (1/4,1/4,1/4,1/4).
Screen. Screening designs are reduced axial designs, omitting the mixtures that contain
all but one components. Thus, if we asked for a screening design with four factors
(components), the mixtures in the model would consist of all permutations of the set
(1,0,0,0), all permutations of the set (5/8,1/8,1/8,1/8), and the set (1/4,1/4,1/4,1/4).
Screening designs enable you to single out unimportant components from an array of
many potential components.

Constrained Mixture Designs


You can also consider mixture problems with additional constraints on the mixture
values. For example, suppose that orange juice is much cheaper than the other kinds of
juice, and you therefore decide that your punch must contain at least 50% orange juice.
However, you also want to make sure that your punch is sufficiently distinct from pure
orange juice, so you place another restrictionthat orange juice can make up no more
than 75% of the punch. These criteria place additional constraints on your mixture,
specifically0.5 OJ 0.75. This restricts the range of feasible solutions in the
simplex, as shown below by the outlined area.

I-243
Design of Ex periments

OJ

0.0
0.1

1.0
0.9

0.2

0.8

0.3

0.7

0.4

0.6

0.5

0.5

0.6

0.4

0.7

0.3

0.8

0.2

0.9

0.1

1.0

WJ

0.0

0.0
0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

PJ

Analysis of Mixture Designs


In mixture experiments, you are usually trying to find an optimal mixture, according
to some criterion. In this sense, mixture models are related to response surface models.
However, the constraint on the sum of the component values takes away one degree of
freedom from the model. This can be accommodated by reparameterizing the linear
model so that there is no intercept term. (This is also known as a Scheff model.) Thus,
the linear model is specified as

y = 1 x 1 + 2 x 2 + + k x k
and the quadratic form is

y = 1 x 1 + 2 x 2 + + k x k + 12 x 1 x 2 + + ( k 1 )k x k 1 x k
for mixtures with k components. Notice that the quadratic form does not include
squared terms. Such terms would be redundant, since the square of a component can
be reexpressed as a function of the linear and cross-product terms. For example,

x1 = x1 ( 1 x2 xk ) = x1 x1 x2 x1 xk
2

The model is estimated using standard general linear modeling techniques. The
parameters can be tested (with a sufficient number of observations), and they can be
used to define the response function. The plot of this function can give visual insights

I-244
Chapter 10

into the process under investigation and allow you to select the optimal combination
of components for your mixture.

Optimal Designs
In going through the process of designing experiments, you might ask yourself, What
is the advantage of a designed experiment over a more haphazard approach to
collecting data? The answer is that a carefully designed experiment will allow you to
estimate the specific model you have in mind for the process, and it will allow you to
do so efficiently. Efficiently in this context means that the model can be estimated with
high (or at least adequate) precision, with a manageable number of runs.
Through the years, statisticians have identified useful classes of research problems
and developed efficient experimental designs for each. Such classes of problems
include identifying important effects within a set of two-level factors (Box-Hunter
designs), optimizing a process using a quadratic surface (central composite or BoxBehnken designs), or optimizing a mixture process (mixture designs).
One of the standard designs may be appropriate for your research needs.
Sometimes, however, your research problem doesnt quite fit into the mold of these
standard designs. Perhaps you have specific ideas about which terms you want to
include in your model, or perhaps you cant afford the number of runs called for in the
standard design. The standard designs efficiency is based on assumptions about the
model to be estimated, the number of runs to be collected, and so on. When you try to
run experiments that violate these assumptions, you lose some of the efficiency of the
design.
You may now be asking yourself, Well, then, how do I find a design for my
idiosyncratic experiment? Isnt there a way that I can specify exactly what I want and
get an efficient design to test it? The answer is yesthis is where the techniques of
optimal experimental design (often abbreviated to optimal design) come in. Optimal
design methods allow you to specify your model exactly (including number of runs)
and to choose a criterion for measuring efficiency. The design problem is then solved
by mathematical programming to find a design that maximizes the efficiency of the
design, given your specifications. The use of the word optimal to describe designs
generated in this manner means that we are optimizing the design for maximum
efficiency relative to the desired efficiency criterion.

I-245
Design of Ex periments

Optimization Methods
First, you need to choose an optimization method. Different mathematical methods
(algorithms) are available for finding the design that optimizes the efficiency criterion.
Some of these methods require a candidate set of design points from which to choose
the points for the optimal design. Other methods do not require such a candidate set.
Three optimization methods are available:
Fedorov method. This method requires a predefined candidate set. It starts with an initial
design, and at each step it identifies a pair of pointsone from the design and one from
the candidate setto be exchanged. That is, the candidate point replaces the selected
design point to form a new design. The pair exchanged is the pair that shows the
greatest reduction in the optimality criterion when exchanged. This process repeats
until the algorithm converges.
K-exchange method. This method starts with a set of candidate points and an initial
design and exchanges the worst k points at each iteration in order to minimize the
objective function. Candidate points must come from a previously generated design.
Coordinate exchange method. This method does not require a candidate set. It starts
with an initial design based on a random starting point. At each iteration, k design
points are identified for exchange, and the coordinates of these points are adjusted one
by one to minimize the objective function. The fact that this method does not require a
candidate set makes it useful for problems with large factor spaces. Another advantage
of this method is that one can use either continuous or categorical variables, or a
mixture of both, in the model.

For the designs that require a candidate set, that set must be defined before you
generate your optimal design. The set of points must be in a file that was generated and
saved by the Design Wizard. You may eliminate undesirable rows before using the file
in the Fedorov or k-exchange method to generate an optimal design based on the
candidate design. The same requirements hold for any so-called starting design in a file
that is submitted by the user.
It is important to remember that these methods are iterative, based on starting
designs with a random component to them. Therefore, they will not always converge
on a design that is absolutely optimalthey may fall into a local minimum or saddle
point, or they may simply fail to converge within the allowed number of iterations.
That is why each method allows you to generate a design multiple times based on
different starting designs.

I-246
Chapter 10

Efficiency Criteria for Optimal Designs


You may have noticed that no explicit mathematical definition of efficiency was given
in the discussion above. This is because there are several different ways of defining and
measuring efficiency of designs. Because the object of optimal design is to minimize
a specific efficiency criterion, the values used to measure efficiency are also called
optimality criteria in this context. You can choose from three optimality criteria:
D-optimality. This criterion measures the generalized variance of the parameter
estimates in the model. The generalized variance is the determinant of the parameter
1
dispersion matrix: D = ( XX ) , where X is the design matrix. The square root of
this value is proportional to the volume of the confidence ellipsoid about the parameter
estimates. The design is generated to minimize D. (The D stands for determinant.)
A-optimality. This criterion measures the average (or, equivalently, the sum) of the
variances for the parameter estimates. Minimizing this criterion, measured as the trace
1
of the parameter dispersion matrix A = trace [ ( XX ) ] , yields the design with the
smallest average variance for the parameter estimates. The design is generated to
minimize A. (The A stands for average.)
G-optimality. This criterion focuses on the variance of predicted response values rather
than the variance of the parameter estimates. The variance of predictions varies across
the factor space (that is, as different levels of the factors are examined). This criterion
specifically measures the maximum variance of prediction within the factor space, and
seeks to minimize this maximum value, G = max
v ( x ) , where v(x) is the variance of
x
the prediction at design point x. (The G stands for global.)

In most circumstances, these methods will give similar results. Using G-optimality can
take more time to compute, since each iteration involves both maximization and
minimization. In many situations, D-optimality will be a good choice because it is fast
and invariant to linear transformations. A-optimality is especially sensitive to the scale
of continuous factors, such that a design with factors having very different scales may
lead to problems generating a design.

Analysis of Optimal Designs


Analysis of optimal designs closely parallels the analysis of other experimental
designs. The general linear model (GLM) is used to build an equation for the model
and estimate and test effects.

I-247
Design of Ex periments

There is one important difference, however. For an optimal experiment, you specify
the model for the experiment before you generate the design. This is necessary to
ensure that the design is optimized for your particular model, rather than an assumed
model (such as a complete factorial or a full quadratic model). This means that for
optimal designs, the form of the equation to be estimated is an integral part of the
experimental design.
Lets consider a simple example: suppose that you have three two-level factors (call
them A, B, and C), and you want to perform tests of the following effects: A, B, C, AB,
and AC. You could use the usual 23 factorial design, which would give you the
following runs:
A

0
0
0
0
1
1
1
1

0
0
1
1
0
0
1
1

0
1
0
1
0
1
0
1

Now, suppose that you want to estimate the model in only six runs. There is no standard
design for this, so you must use an optimal design. Using the coordinate exchange
method with the D-optimality criterion yields the following design:
A

1
0
1
1
0
0

1
0
1
0
0
1

0
0
1
0
1
0

I-248
Chapter 10

However, if we change the form of the model slightly, so that we are asking for the A,
B, C, AB, and BC effects, we get a slightly different design:
A

1
0
0
0
0
1

0
1
1
0
0
1

1
1
0
0
1
0

In general, a design generated based on one model will not be a good design for a
different model. The implication of this is that the model used to generate the design
places limits on the model that you estimate in your data analysis. In most cases, the
two models will be the same, although you may sometimes want to omit terms from
the analysis that were in the original model used to generate the design.

Choosing a Design
Deciding which design to use is an important part of the experimental design process.
The answer will depend on various aspects of the research problem at hand, such as:
n What type of knowledge do you want to gain from the research?
n How much background knowledge can you bring to bear on the question?
n How many factors are involved in the process of interest?
n How many different ways do you want to measure the outcome?
n What are the constraints, if any, on your factors?
n What is your criterion for the best design?
n Will you have to use the results of the experiment to convince others of your

conclusions? What will they find convincing?


n What are the constraints on your research process in terms of time, money, human

resources, and so forth?

I-249
Design of Ex periments

Defining the Question


Successful research depends on how well the problem is formulated. It does no good
to run an elaborate, highly efficient experiment if it gives you the answer to the wrong
question. Spend some time and effort carefully considering your problem. Doing so
will help to ensure that your experimental design will give you the information you
need to solve your problem.

Identifying Candidates for Factors and Responses


In most cases, it is most efficient to focus on only the important factors and ignore the
inconsequential ones. However, you shouldnt be too eager to eliminate factors from
your experiment. Leaving out even one crucial factor can seriously hinder your ability
to find true effects and can lead to highly misleading results. If there is any doubt about
a factor, it is usually best to include it. Once you have empirical confirmation that its
effect really is negligible, you can delete it from subsequent models.
If there is not much background knowledge available to help in your factor
selection, you should consider employing a screening design. These designs allow you
to test for main effects with a small number of runs. Such designs allow you to examine
a large number of candidate factors without exhausting your resources. Once you have
identified a set of interesting factors, you can move on to a fuller design to test for more
complex effects.

Setting Priorities
Consider what is really important in your study. Do you need the highest precision
possible, regardless of what it takes? Or are you more concerned about controlling
costs, even if it means settling for an approximate model? Would the cost of
overlooking an effect be greater than the cost of including the effect in your model?
Giving careful thought to questions like these will help you choose a design that
satisfies your criteria and helps you accomplish your research goals.

I-250
Chapter 10

Design of Experiments in SYSTAT


Design of Experiments Wizard
To access the Design of Experiments Wizard, from the menus choose:
Statistics
Design of Experiments
Wizard

The Design of Experiments Wizard offers nine different design types: General
Factorial, Box-Hunter, Latin Square, Taguchi, Plackett-Burman, Box-Behnken,
Central Composite, Optimal, and Mixture Model. After selecting a design type, a series
of dialogs prompts for design specifications before generating a final design matrix.
These specifications typically include the number of factors involved, as well as the
number of levels for each factor.
Replications. For any design created by the Design Wizard, replications can be saved
to a file. By default, SYSTAT saves the design without replications. If you request n
copies of a design, the complete design will be repeated n times in the saved file (global
replication). If local replications are desired, simply sort the saved file on the variable
named RUN to group replications by run number. Replications do not appear on the
output screen.

Note: It is not necessary to have a data file open to use Design of Experiments.

I-251
Design of Ex periments

Classic Design of Experiments


To access the classic Design of Experiments dialog box, from the menus choose:
Statistics
Design of Experiments
Classic

Classic DOE offers a subset of the designs available using the Design Wizard,
including factorial, Box-Hunter, Latin Square, Taguchi, Plackett, Box-Behnken, and
mixture designs. In contrast to the wizard, classic DOE uses a single dialog to define
all design settings. The following options are available:
Levels. For factorial, Latin, and mixture designs, this is the number of levels for the
factors. Factorial designs are limited to either two or three levels per factor.
Factors. For factorial, BoxHunter, BoxBehnken, and lattice mixture designs, this is the
number of factors, or independent variables.
Runs. For Plackett and BoxHunter designs, this is the number of runs.
Replications. For all designs except BoxBehnken and mixture, this is the number of
replications.
Mixture type. For mixture designs, you can specify a mixture type from the drop-down
list. Select either Centroid, Lattice, Axial, or Screen.
Taguchi type. For Taguchi designs, you can select a Taguchi type from the drop-down
list.

I-252
Chapter 10

Save file. This option saves the design to a file.


Print Options. The following two options are available:
n Use letters for labels. Labels the design factors with letters instead of numbers.
n Print Latin square. For Latin square designs, you can print the Latin square.

Design Options. The following two options are available:


n Randomize. Randomizes the order of experimentation.
n Include blocking factor. For BoxBehnken designs, you can include a blocking factor.

Using Commands
With commands:
DESIGN
SAVE filename
FACTORIAL / FACTORS=n REPS=n LETTERS RAND,
LEVELS = 2 or 3
BOXHUNTER / FACTORS=n RUNS=n REPS=n LETTERS RAND
LATIN / LEVELS=n SQUARE REPS=n LETTERS RAND
TAGUCHI / TYPE=design REPS=n LETTERS RAND
PLACKETT / RUNS=n REPS=n LETTERS RAND
BOXBEHNKEN / FACTORS=n BLOCK LETTERS RAND
MIXTURE / TYPE=LATTICE or CENTROID or AXIAL
or SCREEN,
FACTORS=n LEVELS=n RAND LETTERS

Note: Some designs generated by the Design Wizard cannot be created using
commands.

Usage Considerations
Types of data. No data file is needed to use Design of Experiments.
Print options. For Box-Hunter designs, using PRINT=LONG in Classic DOE yields a
listing of the generators (confounded effects) for the design. For Taguchi designs, a
table defining the interaction is available.
Quick Graphs. No Quick Graphs are produced.
Saving files. The design can be saved to a file.

I-253
Design of Ex periments

BY groups. Analysis by groups is not available.


Bootstrapping. Bootstrapping is not available in this procedure.
Case weights. Case weighting is not available in Design of Experiments.

Examples
Example 1
Full Factorial Designs
The DOE Wizard input for a (2 x 2 x 2) design is:
Wizard Prompt

Response

Design Type
Choose a type of design:
Divide the design into incomplete
blocks?
Enter the number of factors desired:
Is the number of levels to be the same
for all factors?
Enter number of levels:
Display the factors for this design?
Save the design to a file?

General Factorial
Full Factorial Design
No
3
Yes
2
Yes
No

The output is:


Factorial Design:
Factor
RUN

3 Factors,

8 Runs

I-254
Chapter 10

To generate this design using commands, the input is:


DESIGN
FACTORIAL / FACTORS=3

LEVELS=2

Example 2
Fractional Factorial Design
The DOE Wizard input for a (2 x 2 x 2 x 2) fractional factorial design in which the twoway interactions A*B and A*C must be estimable is:
Wizard Prompt

Response

Design Type
Choose a type of design:
Divide the design into incomplete blocks?
Enter the number of factors desired:
Is the number of levels to be the same for all factors?
Enter number of levels:

General Factorial
Fractional Factorial Design
No
4
Yes
2
Automatically find the smallest
design consistent with my criteria
Require that specific effects be
estimable

Please choose:
Choose a Search Criterion

May main effects be confounded with 2-factor interacYes


tions?
Are there any specific effects to be estimated other than Yes
the effects already cited?
A*B
List them by using the appropriate factor letters separated by asterisks for interactions.
A*C
Are there any effects that are not to be estimated, but yet
should not be confounded with effects that are to be
Yes
estimated?
List them by using the appropriate factor letters
A*D
separated by asterisks for interactions.
Display the factors for this design?
Yes
Save the design to a file?
No
Display another fraction of this design?
No
Find another design with same parameters?
No

I-255
Design of Ex periments

The output is:


Complete Defining Relation
Identity =
B * C * D
The design resolution is 3
Design Generators
Identity =
B * C * D
Fractional Factorial Design:

4 Factors,

8 Runs

Factor
Run

SYSTAT assumes that the main effects of any design should always be estimated.
Notice, however, that the defining relation avoids confounding the interaction of A
with any of the other factors, as requested by specifying the effects to be estimated
(A*B, A*C) and effects that should not be confounded even though they are not to be
estimated (A*D).

I-256
Chapter 10

Example 3
Box-Hunter Fractional Factorial Design
To generate a (2 x 2 x 2) Box-Hunter fractional factorial, the input is:
Wizard Prompt

Response

Design Type
Enter the number of factors desired:
Enter the total number of cells for the
entire design:
Display the factors for this design?
Save the design to a file?

Box-Hunter
3
4
Yes
No

The resulting output is:


Complete Defining Relation
Identity =
A * B * C
The design resolution is 3
Design Generators
Identity =
A * B * C
Box-Hunter Design:

4 Runs,

3 Factors

Factor
RUN

-1 -1

-1

1 -1

1 -1 -1

To generate this design using commands, enter the following:


DESIGN
BOXHUNTER / FACTORS=3

I-257
Design of Ex periments

Aliases
For 7 two-level factors, the number of cells (runs) for a complete factorial is 27=128.
The following example shows the smallest fractional factorial for estimating main
effects. The design codes for the first three factors generate the last four. The input is:
Wizard Prompt

Response

Design Type
Enter the number of factors desired:
Enter the total number of cells for the
entire design:
Display the factors for this design?
Save the design to a file?

Box-Hunter
7

The output is:


Complete Defining Relation
Identity =
A * B * D =
A * C * E =
B * C * D * E =
B * C * F =
A * C * D * F =
A * B * E * F =
D * E * F =
A * B * C * G =
C * D * G =
B * E * G =
A * D * E * G =
A * F * G =
B * D * F * G =
C * E * F * G =
A * B * C * D * E * F * G
The design resolution is 3
Design Generators

8
Yes
No

I-258
Chapter 10

Identity =
A * B * D =
A * C * E =
B * C * F =
A * B * C * G
Box-Hunter Design:

8 Runs,

7 Factors

Factor
RUN

-1 -1 -1

1 -1

-1 -1

1 -1 -1

-1

1 -1 -1

-1

1 -1

1 -1 -1

1
1

1 -1

1 -1 -1 -1 -1

1 -1

1 -1

1 -1 -1 -1

1 -1
1

1 -1 -1
1

The main effect for factor D is confounded with the interaction between factors A and
B; the main effect for factor E is confounded with the interaction between factors A and
C; and so on.

Example 4
Latin Squares
To generate a Latin square when each factor has four levels, enter the following DOE
Wizard responses:
Wizard Prompt

Response

Design Type
The types available are:
Number of levels:
Randomize the design?
Display the square?
Display the factors for this design?
Save the design to a file?

Latin Square
Ordinary Latin Square
4
No
Yes
Yes
No

I-259
Design of Ex periments

The output is:


Latin Square: 4 levels.
A B C D
B C D A
C D A B
D A B C
Latin Square Design:

4 Levels.

Factor
RUN

10

11

12

13

14

15

16

To generate this design using commands, enter the following:


DESIGN
LATIN / LEVELS=4

SQUARE

LETTERS

Omitting SQUARE prevents the Latin square from appearing in the output.

I-260
Chapter 10

Permutations
To randomly assign the factors to the cells, the input is:
Wizard Prompt

Response

Design Type
The types available are:
Number of levels:
Randomize the design?
Display the square?
Display the factors for this design?
Save the design to a file?

Latin Square
Ordinary Latin Square
4
Yes
Yes
No
No

The resulting output is:


Latin Square: 4 levels.
D C A B
C A B D
B D C A
A B D C

Using commands:
DESIGN
LATIN / LEVELS=4

SQUARE

LETTERS

RAND

Example 5
Taguchi Design
To obtain a Taguchi L12 design with 11 factors, the DOE Wizard input is:
Wizard Prompt

Response

Design Type
Taguchi Design Type:
Display the factors for this design?
Save the design to a file?
Display confounding matrix?

Taguchi
L12
Yes
No
No

I-261
Design of Ex periments

The output is:


Taguchi L12 Design (12 Runs, 11 Factors, 2 Levels Each)
Factor
RUN

10

11

12

To generate this design using commands, enter the following:


DESIGN
TAGUCHI / TYPE=L12

Design L16 with 15 Two-Level Factors Plus Aliases


To obtain a Taguchi L16 design with 15 factors, the input is:
Wizard Prompt

Response

Design Type
Taguchi Design Type:
Display the factors for this design?
Save the design to a file?
Display confounding matrix?

Taguchi
L16
Yes
No
Yes

I-262
Chapter 10

The output is:


Taguchi L16 Design (16 Runs, 15 Factors, 2 Levels Each)
Factor
RUN

10

11

12

13

14

15

16

Confoundings For Each Pairwise Interaction


(Note that partial confoundings do not appear.)
FACTOR
FACTOR

9 10 11 12 13 14 15

8 11 10 13 12 15 14

9 10 11 12 13 14 15

7
1

10

11

9 14 15 12 13

3 10

11

10

8 15 14 13 12

1 11

12

13 14 15

9 10 11

7 12

13

12 15 14

8 11 10

1 13

14

15 12 13 10 11

3 14

15

14 13 12 11 10

1 15

I-263
Design of Ex periments

This design can also be generated by the following commands:


DESIGN
PRINT=LONG
TAGUCHI / TYPE=L16

The matrix of confoundings identifies the factor pattern associated with the interaction
between the row and column factors. For example, the factor pattern for the interaction
between factors 6 and 8 is identical to the pattern for factor 14 (N).

Example 6
Plackett-Burman Design
To generate a Plackett-Burman design consisting of 11 two-level factors, the DOE
Wizard input is:
Wizard Prompt

Response

Design Type
Number of levels in design:
Runs per replication
Display the factors for this design?
Save the design to a file?

Plackett-Burman
2
12
Yes
No

The output follows:


Plackett-Burman Design:

12 Runs, 11 Factors

Factor
RUN

10

11

12

I-264
Chapter 10

To generate this design using commands, the input is:


DESIGN
PLACKETT / RUNS=12

Example 7
Box-Behnken Design
Each factor in this example has three levels. The DOE wizard input is:
Wizard Prompt

Response

Design Type
Number of Factors
Display the factors for this design?
Save the design to a file?

Box-Behnken
3
Yes
No

The output is:


Box-Behnken Design:
Factor
RUN

-1 -1

1 -1

-1

-1

0 -1

0 -1

-1

0 -1 -1

10

11

0 -1

1 -1
1

12

13

14

15

3 Factors,

15 Runs

I-265
Design of Ex periments

To generate this design using commands, the input is:


DESIGN
BOXBEHNKEN / FACTORS=3

Example 8
Mixture Design
We illustrate a lattice mixture design in which each of the three factors has five levels;
that is, each component of the mixture is 0%, 25%, 50%, 75%, or 100% of the mixture
for a given run, subject to the restriction that the sum of the percentages is 100. To
generate the design for this situation using the DOE Wizard, enter the following
responses at the corresponding prompt:
Wizard Prompt

Response

Design Type
Are there to be constraints for any component(s)?
The possible kinds of unconstrained design are:
Enter the number of mixture components:
Enter the number of levels for each component:
Display the factors for this design?
Save the design to a file?

Mixture Model
No
Lattice
3
5
Yes
No

The resulting mixture design follows:


Lattice Design:

3 Factors,

Component
RUN

1.000

.000

.000

.000 1.000

.000

.000

.000 1.000

.750

.250

.000

.750

.000

.250

.000

.750

.250

.500

.500

.000

.500

.000

.500

.000

.500

.500

10

.250

.750

.000

11

.250

.000

.750

15 Runs,

5 Levels

I-266
Chapter 10

12

.000

.250

.750

13

.500

.250

.250

14

.250

.500

.250

15

.250

.250

.500

To generate this design using commands, the input is:


DESIGN
MIXTURE / TYPE=LATTICE
LEVELS=5

FACTORS=3,

After collecting your data, you may want to display it in a triangular scatterplot.

Example 9
Mixture Design with Constraints
This example is adapted from an experiment reported in Cornell (1990, p. 265). The
problem concerns the mixture of three plasticizers in the production of vinyl for car
seats. We know that the combination of plasticizers must make up 79.5% of the
mixture. There are further constraints on each of the plasticizers:
32.5% <= P1 <= 67.5%
0% <= P2 <= 20.0%
12.0% <= P3 <= 21.8%
Because we are interested in only the plasticizers, we can model them separately from
the other components in the overall process. Taking this approach, we can
reparameterize the components by dividing by 79.5%, giving
0.409 <= A <= 0.849
0 <= B <= 0.252
0.151 <= C <= 0.274

I-267
Design of Ex periments

We want to be sure that the design points span the feasible region adequately. To
generate the design using the DOE Wizard, the responses to the prompts follow:
Wizard Prompt

Response

Design Type
Are there to be constraints for any
component(s)?
The possible kind of constrained design are:
Enter the number of mixture components:
Enter the maximum dimension to be used to
compute centroids:
How many such constraints do you wish to have?
Constraint 1: Enter the coefficient for factor 1:
Constraint 1: Enter the coefficient for factor 2:
Constraint 1: Enter the coefficient for factor 3:
Constraint 1: Enter an additive constant:
Constraint 2: Enter the coefficient for factor 1:
Constraint 2: Enter the coefficient for factor 2:
Constraint 2: Enter the coefficient for factor 3:
Constraint 2: Enter an additive constant:
Constraint 3: Enter the coefficient for factor 1:
Constraint 3: Enter the coefficient for factor 2:
Constraint 3: Enter the coefficient for factor 3:
Constraint 3: Enter an additive constant:
Constraint 4: Enter the coefficient for factor 1:
Constraint 4: Enter the coefficient for factor 2:
Constraint 4: Enter the coefficient for factor 3:
Constraint 4: Enter an additive constant:
Constraint 5: Enter the coefficient for factor 1:
Constraint 5: Enter the coefficient for factor 2:
Constraint 5: Enter the coefficient for factor 3:
Constraint 5: Enter an additive constant:
Specify the tolerance for checking constraints
and duplication of points:
Display the factors for this design?
Save the design to a file?

Mixture Model
Yes
Extreme vertices plus centroids
3
1
5
1
0
0
-.409
-1
0
0
.849
0
-1
0
.252
0
0
1
-.151
0
0
-1
.274
.00001
Yes
No

I-268
Chapter 10

The constrained mixture design output follows:


The following are index numbers of input constraints found to be redundant:
1
2
Extreme Vertices + Centroids Design:

3 Factors,

9 Runs,

4 Vertices

Component
RUN

.849

.000

.151

.597

.252

.151

.726

.000

.274

.474

.252

.274

.787

.000

.213

.535

.252

.213

.723

.126

.151

.600

.126

.274

.661

.126

.213

The design contains nine runs: four points at the extreme vertices of the feasible region,
four points at the edge centroids, and one point at the overall centroid. The following
plot displays the constrained region for the mixture as a blue parallelogram with the
actual design points represented as red filled circles.
0.0 1.0
0.2

0.8
0.6
B

0.4
0.6

0.4

0.8
1.0
0.0

0.2

0.2

0.4

0.6
A

0.8

0.0
1.0

I-269
Design of Ex periments

Example 10
Central Composite Response Surface Design
In an industrial experiment reported by Aia et al. (1961), the authors investigated the
response surface of a chemical process for producing dihydrated calcium hydrogen
orthophosphate (CaHPO4 2H2O). The factors of interest are the ratio of NH3 to CaCl2
in the calcium chloride solution, the addition time of the NH3-CaCl2 mixture, and the
beginning pH of the NH4H2PO4 solution used. We will now see how this experiment
would be designed using the DOE Wizard.
For efficiency and rotatability, we use a central composite design with three factors.
The central composite design consists of a 2k factorial (or fraction thereof), a set of 2k
axial (or star) points on the axes of the design space, and some number of center
points. The distance between the axial points and the center of the design determines
important properties of the design. In SYSTAT, the distance used ensures rotatability
for unblocked designs. For blocked designs, the distance ensures orthogonality of
blocks.
The choice of number of center points hinges on desired properties of the design.
Orthogonal designs (designs in which the factors are uncorrelated) minimize the
average variance of prediction of the response surface equation. However, in some
cases, you may decide that it is more important to have the variance of predictions be
nearly constant throughout most of the experimental region, even if the overall
variance of predictions is increased somewhat. In such situations, we can use designs
in which the variance of predictions is the same at the center of the design as it is at any
point one unit distant from the center. This property of equal variance between the
center of the design and points one unit from the center is called uniform precision. In
this example, we sacrifice orthogonality in favor of uniform precision. Therefore, we
use six center points instead of the nine points required to make the design nearly
orthogonal. (A table of orthogonal and uniform precision designs with appropriate
numbers of center points can be found in Montgomery, 1991, p. 546.)
The input to generate the central composite design follows:
Wizard Prompt

Response

Design Type
Enter the number of factors desired:
Are the cube and star portions of the design to be
separate blocks?
Enter number of center points desired:
Display the factors for this design?
Save the design to a file?

Central Composite
3
No
6
Yes
No

I-270
Chapter 10

The resulting design is:


Second-order Composite Design: 3 Factors 20 Runs
Factor
RUN

-1.000

-1.000

-1.000

-1.000

-1.000
1.000

-1.000

1.000

-1.000

-1.000

1.000

1.000

1.000

-1.000

-1.000

1.000

-1.000

1.000

1.000

1.000

-1.000

1.000

1.000

1.000

-1.682

.000

.000

10

1.682

.000

.000

11

.000

-1.682

.000

12

.000

1.682

.000

13

.000

.000

-1.682

14

.000

.000

1.682

15

.000

.000

.000

16

.000

.000

.000

17

.000

.000

.000

18

.000

.000

.000

19

.000

.000

.000

20

.000

.000

.000

In the central composite design, each factor is measured at five different levels. The
runs with no zeros for the factors are the factorial (cube) points, the runs with only
one nonzero factor are the axial (star) points, and the runs with all zeros are the center
points.
After collecting data according to this design, fit a response surface to analyze the
results.

Example 11
Optimal Designs: Coordinate Exchange
Consider a situation in which you want to compute a response surface but your
resources are very limited. Assume that you have three continuous factors but can
afford only 12 runs. This number of runs is not enough for any of the standard response

I-271
Design of Ex periments

surface models. However, you can generate a design with 12 runs that will allow you
to estimate the effects of interest using an optimal design.
To generate the design using the DOE Wizard, the responses to the prompts follow:
Wizard Prompt

Response

Design Type
Choose the method to use:
Choose the type of optimality desired:
Specify the number of points to replace in a single iteration:
Specify the maximum number of iterations within a trial:
Specify the relative convergence tolerance:
Specify the number of trials to be run:
Random number seed:
The starting design is to be:
Enter the number of factors desired:
How many points (runs) are desired?
The variables in the design are:

Optimal
Coordinate Exchange
D-optimality
1
100
.00001
3
131
Generated by the program.
3
12
All continuous
lower limit = -1
upper limit = 1
lower limit = -1
upper limit = 1
lower limit = -1
upper limit = 1

Limits for factor A


Limits for factor B
Limits for factor C
Does the model for your designed design contain an additive
constant?

Define other effects to be included in the model:

Display the factors for this design?


Save the design to a file?
Display the factors for this design?
Save the design to a file?
Display the factors for this design?
Save the design to a file?

Yes
A*A
B*B
C*C
A*B
A*C
B*C
A*B*C
Yes
No
Yes
No
Yes
No

I-272
Chapter 10

The design that was output on the third trial follows:


Design from Coordinate-exchange Algorithm:

12 Runs,

3 Factors

k =

Factor
RUN

-1.000

1.000

-0.046

-0.038

-0.000

-1.000

-1.000

1.000

-1.000

1.000

1.000

-1.000

1.000

1.000

1.000

-1.000

-1.000

-1.000

1.000

-1.000

-1.000

-1.000

1.000

1.000

1.000

-0.046

0.001

10

-1.000

-1.000

1.000

11

0.081

-1.000

-0.036

12

1.000

-1.000

1.000

The points shown here were generated from a particular run of the algorithm. Since the
initial design depends on a randomly chosen starting point, your design may vary
slightly from the design shown here. Your design should share several characteristics
with this one, however. First, notice that most values appear to be very close to one of
three values: 1, 0, or +1. For the purposes of conceptual discussion, we can act as if
the values were rounded to the nearest integer. We can see that the design includes the
eight corners of the design space (the runs where all values are either 1 or +1). The
design also includes three points that are face centers (runs where two values are near
0), and one edge point (where only one value is near 0).
This design will allow you to estimate all first- and second-order effects in your
model. Of course, you will not have as much precision as you would if you had used a
Box-Behnken or central composite design, because you dont have as much
information to work with. You also lose some of the other advantages of the standard
designs, such as rotatability. However, because the design is optimized with respect to
generalized variance of parameter estimates, you will be getting as much information
as you can out of your 12 runs.

I-273
Design of Ex periments

References
Aia, M. A, Goldsmith, R. L., and Mooney, R. W. (1961). Precipitating Stoichiometric
CaHPO42H2O. Industrial and Engineering Chemistry, 53, 5557.
Box, G. E. P., and Behnken, D. W. (1960). Some new three level designs for the study of
quantitative variables. Technometrics, 2, 455476.
Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for Experimenters. New
York: Wiley.
Cochran, W. G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John
Wiley & Sons, Inc.
Cornell, J. A. (1990). Experiments with Mixtures. New York: Wiley.
Fedorov, V. V. (1972). Theory of Optimal Experiments. New York: Academic Press.
Galil, Z., and Kiefer, J. (1980). Time- and space-saving computer methods, related to
Mitchells DETMAX, for finding D-optimum designs. Technometrics, 21, 301313.
John, P. W. M. (1971). Statistical Design and Analysis of Experiments. New York:
Macmillan.
_____. (1990). Statistical Methods in Engineering and Quality Assurance. New York:
Wiley.
Johnson, M. E., and Nachtsheim, C. J. (1983). Some guidelines for constructing exact Doptimal designs on convex design spaces. Technometrics, 25, 271277.
Meyer, R. K., and Nachtsheim, C. J. (1995). The coordinate-exchange algorithm for
constructing exact optimal designs. Technometrics, 37, 6069.
Montgomery, D. C. (1991). Design and Analysis of Experiments. New York: Wiley.
Plackett, R. L., and Burman, J. P. (1946). The design of optimum multifactorial
experiments. Biometrika, 33, 305325.
Schneider, A. M., and Stockett, A. L. (1963). An experiment to select optimum operating
conditions on the basis of arbitrary preference ratings. Chemical Engineering Progress
Symposium Series, No. 42, Vol. 59.
Taguchi, G. (1986). Introduction to Quality Engineering. Tokyo: Asian Productivity
Organization.
Taguchi, G. (1987). System of experimental design (2 volumes). New York:
UNIPUB/Kraus International Publications.

Chapter

11
Discriminant Analysis
Laszlo Engelman

Discriminant Analysis performs linear and quadratic discriminant analysis, providing


linear or quadratic functions of the variables that best separate cases into two or
more predefined groups. The variables in the linear function can be selected in a
forward or backward stepwise manner, either interactively by the user or
automatically by SYSTAT. For the latter, at each step, SYSTAT enters the variable
that contributes most to the separation of the groups (or removes the variable that is
the least useful).
The command language allows you to emphasize the difference between specific
groups; contrasts can be used to guide variable selection. Cases can be classified even
if they are not used in the computations.
Discriminant analysis is related to both multivariate analysis of variance and
multiple regression. The cases are grouped in cells like a one-way multivariate
analysis of variance and the predictor variables form an equation like that for multiple
regression. In discriminant analysis, Wilks lambda, the same test statistic used in
multivariate ANOVA, is used to test the equality of group centroids. Discriminant
analysis can be used not only to test multivariate differences among groups, but also
to explore:
n Which variables are most useful for discriminating among groups
n If one subset of variables performs equally well as another
n Which groups are most alike and most different

I-275

I-276
Chapter 11

Statistical Background
When we have categorical variables in a model, it is often because we are trying to
classify cases; that is, what group does someone or something belong to? For example,
we might want to know whether someone with a grade point average (GPA) of 3.5 and
an Advanced Psychology Test score of 600 is more like the group of graduate students
successfully completing a Ph.D. or more like the group that fails. Or, we might want
to know whether an object with a plastic handle and no concave surfaces is more like
a wrench or a screwdriver.
Once we attempt to classify, our attention turns from parameters (coefficients) in a
model to the consequences of classification. We now want to know what proportion of
subjects will be classified correctly and what proportion incorrectly. Discriminant
analysis is one method for answering these questions.

Linear Discriminant Model


If we know that our classifying variables are normally distributed within groups, we
can use a classification procedure called linear discriminant analysis (Fisher, 1936).
Before we present the method, however, we should warn you that the procedure
requires you to know that the groups share a common covariance matrix and you must
know what the covariance matrix values are. We have not found an example of
discriminant analysis in the social sciences where this was true. The most appropriate
applications we have found are in engineering, where a covariance matrix can be
deduced from physical measurements. Discriminant analysis is used, for example, in
automated vision systems for detecting objects on moving conveyer belts.
Why do we need to know the covariance matrix? We are going to use it to calculate
Mahalanobis distances (developed by the Indian statistician Prasanta C.
Mahalanobis). These distances are calculated between cases we want to classify and
the center of each group in a multidimensional space. The closer a case is to the center
of one group (relative to its distance to other groups), the more likely it is to be
classified as belonging to that group. The figure on p. 277 shows what we are doing.
The borders of this graph comprise the two predictors GPA and GRE. The two
hills are centered at the mean values of the two groups (No Ph.D. and Ph.D.). Most
of the data in each group are supposed to be under the highest part of each hill. The
hills, in other words, mathematically represent the concentration of data values in the
scatterplot beneath.

I-277
Discriminant Analysis

The shape of the hills was computed from a bivariate normal distribution using the
covariance matrix averaged within groups. Weve plotted this figure this way to show
you that this model is like pie-in-the-sky if you use the information in the data below
to compute the shape of these hills. As you can see, there is a lot of smoothing of the
data going on, and if one or two data values in the scatterplot influence unduly the
shape of the hills above, you will have an unrepresentative model when you try to use
it on new samples.
How do we classify a new case into one group or another? Look at the figure again.
The new case could belong to one or the other group. Its more likely to belong to
the closer group, however. The simple way to find how far this case is from the center
of each group would be to take a direct walk from the new case to the center of each
group in the data plot.

I-278
Chapter 11

Instead of walking in sample data space below, however, we must climb the hills of
our theoretical model above when using the normal classification model. In other
words, we will use our theoretical model to calculate distances. The covariance matrix
we used to draw the hills in the figure makes distances depend on the direction we are
heading. The distance to a group is thus proportional to the altitude (not the horizontal
distance) we must climb to get to the top of the corresponding hill.
Because these hills can be oblong in shape, it is possible to be quite far from the top
of the hill as the crow flies, yet have little altitude to cover in a climb. Conversely, it is
possible to be close to the center of the hill and have a steep climb to get to the top.
Discriminant analysis adjusts for the covariance that causes these eccentricities in hill
shape. That is why we need the covariance matrix in the first place.
So much for the geometric representation. What do the numbers look like? Lets
look at how to set up the problem with SYSTAT. The input is:
DISCRIM
USE ADMIT
PRINT LONG
MODEL PHD = GRE,GPA
ESTIMATE

The output is:


Group frequencies
----------------Frequencies
Group means
----------GPA
GRE

1
51

2
29

4.423
590.490

4.639
643.448

Pooled within covariance matrix -- DF=


78
-----------------------------------------------GPA
GRE
GPA
0.095
GRE
1.543
4512.409
Within correlation matrix
------------------------GPA
GPA
1.000
GRE
0.075

GRE
1.000

Total covariance matrix


-- DF=
79
-----------------------------------------------GPA
GRE
GPA
0.104
GRE
4.201
5111.610
Total correlation matrix
-----------------------GPA
GPA
1.000
GRE
0.182

GRE
1.000

I-279
Discriminant Analysis

Between groups F-matrix -- df =


2
77
---------------------------------------------1
2
1
0.0
2
9.469
0.0
Wilks lambda
Lambda =
Approx. F=

0.8026
9.4690

df =
df =

2
2

1
77

78
prob =

0.0002

Classification functions
---------------------1
-133.910
44.818
0.116

Constant
GPA
GRE

2
-150.231
46.920
0.127

Classification matrix (cases in row categories classified into columns)


--------------------1
2 %correct
1
38
13
75
2
7
22
76
Total

45

35

75

Jackknifed classification matrix


-------------------------------1
2 %correct
1
37
14
73
2
7
22
76
Total

44
Eigen
values
--------0.246

Wilks

36
Canonical
correlations
-----------0.444

74
Cumulative proportion
of total dispersion
--------------------1.000

lambda=
Approx.F=

0.803
9.469

DF=

2,

77

p-tail=

0.0002

Pillais trace=
Approx.F=

0.197
9.469

DF=

2,

77

p-tail=

0.0002

Lawley-Hotelling trace=
Approx.F=

0.246
9.469

DF=

2,

77

p-tail=

0.0002

Canonical discriminant functions


-------------------------------1
Constant
-15.882
GPA
2.064
GRE
0.011
Canonical discriminant functions -- standardized by within variances
-------------------------------------------------------------------1
GPA
0.635
GRE
0.727
Canonical scores of group means
------------------------------1
-.369
2
.649

Theres a lot to follow on this output. The counts and means per group are shown first.
Next comes the Pooled within covariance matrix, computed by averaging the separate-

I-280
Chapter 11

group covariance matrices, weighting by group size. The Total covariance matrix
ignores the groups. It includes variation due to the group separation. These are the
same matrices found in the MANOVA output with PRINT=LONG. The Between groups
F-matrix shows the F value for testing the difference between each pair of groups on
all the variables (GPA and GRE). The Wilks lambda is for the multivariate test of
dispersion among all the groups on all the variables, just as in MANOVA. Each case
is classified by our model into the group whose classification function yields the
largest score. Each function is like a regression equation. We compute the predicted
value of each equation for a cases values on GPA and GRE and classify the case into
the group whose function yields the largest value.
Next come the separate F statistics for each variable and the Classification matrix.
The goodness of classification is comparable to that for the PROBIT model. We did a
little worse with the No Ph.D. group and a little better with the Ph.D. The Jackknifed
classification matrix is an attempt to approximate cross-validation. It will tend to be
somewhat optimistic, however, because it uses only information from the current
sample, leaving out single cases to classify the remainder. There is no substitute for
trying the model on new data.
Finally, the program prints the same information produced in a MANOVA by
SYSTATs MGLH (GLM and ANOVA). The multivariate test statistics show the
groups are significantly different on GPA and GRE taken together.

Linear Discriminant Function


We mentioned in the last section that the canonical coefficients are like a regression
equation for computing distances up the hills. Lets look more closely at these
coefficients. The following figure shows the plot underlying the surface in the last
figure. Superimposed at the top of the GRE axis are two normal distributions centered
at the means for the two groups. The standard deviations of these normal distributions
are computed within groups. The within-group standard deviation is the square root of
the diagonal GRE variance element of the residual covariance matrix (4512.409). The
same is done for GPA on the right, using square root of the within-groups variance
(0.095) for the standard deviation and the group means for centering the normals.

I-281
Discriminant Analysis

5.5

5.0

GPA

4.5
4.0
N

Y N Y YY
N Y NN Y
N
YY
Y
YYY
N YN
N
YN
Y
YY
Y N
N NNNYN
N
N
Y
Y
N
Y
N
N
N YY
Y
N N YN
N
Y
N
Y
Y
N N N Y NNN
Y YY YN NNY
NN YN NNN N
YN
YN
Y NNNY NNN
N N
Y Y YN
Y N YN
N
NN
Y
N
YYYY NN
NYN N
Y
N
N N N Y
N
Y
N
NN
N
NN
N
Y
N
N
Y
N

3.5
400

500

600
GRE

700

800

Either of these variables separates the groups somewhat. The diagonal line underlying
the two diagonal normal distributions represents a linear combination of these two
variables. It is computed using the canonical discriminant functions in the output.
These are the same as the canonical coefficients produced by MGLH. Before applying
these coefficients, the variables must be standardized by the within-group standard
deviations. Finally, the dashed line perpendicular to this diagonal cuts the observations
into two groups: those to the left and those to the right of the dashed line.
You can see that this new canonical variable and its perpendicular dashed line are
an orthogonal (right-angle-preserving) rotation of the original axes. The separation of
the two groups using normal distributions drawn on the rotated canonical variable is
slightly better than that for either variable alone. To classify on the linear discriminant
axis, make the mean on this new variable 0 (halfway between the two diagonal normal
curves). Then add a scale along the diagonal, running from negative to positive. If we
do this, then any observations with negative scores on this diagonal scale will be
classified into the No Ph.D. group (to the left of the dashed perpendicular bisector) and
those with positive scores into the Ph.D. (to the right). All Ys to the left of the dashed
line and Ns to the right are misclassifications. Try rotating these axes any other way
to get a better count of correctly classified cases (watch out for ties). The linear
discriminant function is the best rotation.
Using this linear discriminant function variable, we get the same classifications we
got with the Mahalanobis distance method. Before computers, this was the preferred
method for classifying because the computations are simpler.

I-282
Chapter 11

We just use the equation:


Fz = 0.635*ZGPA + 0.727*ZGRE

The two Z variables are the raw scores minus the overall mean divided by the withingroups standard deviations. If Fz is less than 0, classify No Ph.D.; otherwise, classify
Ph.D.
As we mentioned, the Mahalanobis method and the linear discriminant function
method are equivalent. This is somewhat evident in the figure. The intersection of the
two hills is a straight line running from the northwest to the southeast corner in the
same orientation as the dashed line. Any point to the left of this line will be closer to
the top of the left hill, and any point to the right will be closer to the top of the right hill.

Prior Probabilities
Our sample contained fewer Ph.D.s than No Ph.D.s. If we want to use our discriminant
model to classify new cases and if we believe that this difference in sample sizes
reflects proportions in the population, then we can adjust our formula to favor No
Ph.D.s. In other words, we can make the prior probabilities (assuming we know
nothing about GRE and GPA scores) favor a No Ph.D. classification. We can do this
by adding the option
PRIORS = 0.625, 0.375

to the MODEL command. Do not be tempted to use this method as a way of improving
your classification table. If the probabilities you choose do not reflect real population
differences, then new samples will on average be classified worse. It would make sense
in our case because we happen to know that more people in our department tend to drop
out than stay for the Ph.D.
You might have guessed that the default setting is for prior probabilities to be equal
(both 0.5). In the last figure, this makes the dashed line run halfway between the means
of the two groups on the discriminant axis. By changing the priors, we move this
dashed line (the normal distributions stay in the same place).

Multiple Groups
The discriminant model generalizes to more than two groups. Imagine, for example,
three hills in the first figure. All the distances and classifications are computed in the

I-283
Discriminant Analysis

same manner. The posterior probabilities for classifying cases are computed by
comparing three distances rather than two.
The multiple group (canonical) discriminant model yields more than one
discriminant axis. For three groups, we get two sets of canonical discriminant
coefficients. For four groups, we get three. If we have fewer variables than groups, then
we get only as many sets as there are variables. The group classification function
coefficients are handy for classifying new cases with the multiple group model. Simply
multiply each coefficient times each variable and add in the constant. Then assign the
case to the group whose set yields the largest value.

Discriminant Analysis in SYSTAT


Discriminant Analysis Main Dialog Box
To open the Discriminant Analysis dialog box, from the menus choose:
Statistics
Classification
Discriminant Analysis...

The following options can be specified:


Quadratic. The Quadratic check box requests quadratic discriminant analysis. If not
selected, linear discriminant analysis is performed.
Save. For each case, Distances saves the Mahalanobis distances to each group centroid
and the posterior probability of the membership in each group. Scores saves the

I-284
Chapter 11

canonical variable scores. Scores/Data and Distances/Data save scores and distances
along with the data.

Discriminant Analysis Options


SYSTAT includes several controls for stepwise model building and tolerance. To
access these options, click Options in the main dialog box.

The following can be specified:


Tolerance. The tolerance sets the matrix inversion tolerance limit. Tolerance = 0.001 is
the default.

Two estimation options are available:


n Complete. All variables are used in the model.
n Stepwise. Variables can be selected in a forward or backward stepwise manner,

either interactively by the user or automatically by SYSTAT.


If you select stepwise estimation, you can specify the direction in which the estimation
should proceed, whether SYSTAT should control variable entry and elimination, and
any desired criteria for variable entry and elimination.
n Backward. In backward stepping, all variables are entered, irrespective of their F-

to-enter values (if a variable fails the Tolerance limit, however, it is excluded). Fto-remove and F-to-enter values are reported. When Backward is selected along
with Automatic, at each step, SYSTAT removes the variable with the lowest F-to-

I-285
Discriminant Analysis

remove value that passes the Remove limit of the F statistic (or reenters the
variable with the largest F-to-enter above the Remove limit of the F statistic).
n Forward. In forward stepping, the variables are entered in the model. F-to-enter

values are reported for all candidate variables, and F-to-remove values are reported
for forced variables. When Forward is selected along with Automatic, at each step,
SYSTAT enters the variable with the highest F-to-enter that passes the Enter limit
of the F statistic (or removes the variable with the lowest F-to-remove below the
Remove limit of the F statistic).
n Automatic. SYSTAT enters or removes variables automatically. F-to-enter and

F-to-remove limits are used.


n Interactive. Variables are interactively removed from and/or added to the model at
each step. In the Command pane, type a STEP command to enter and remove

variables interactively.
STEP

One variable is entered into or removed from the model (based


on the Enter and Remove limits of the F statistic).
STEP +
Variable with the largest F-to-enter is entered into the model
(irrespective of the Enter limit of the F statistic).
STEP
Variable with the smallest F-to-remove is removed from the
model (irrespective of the Remove limit of the F statistic).
STEP c, e
Variables named c and e are stepped into/out of the model
(irrespective of the Enter and Remove limits of the F statistic).
STEP 3, 5
Third and fifth variables are stepped into/out of the model
(irrespective of the Enter and Remove limits of the F statistic).
STEP/NUMBER = 3 Three variables are entered into or removed from the model.
STOP
Stops the stepping and generates final output (classification
matrices, eigenvalues, canonical variables, etc.).

Variables are added to or eliminated from the model based on one of two possible
criteria.
n Probability. Variables with probability (F-to-enter) smaller than the Enter

probability are entered into the model if Tolerance permits. The default Enter value
is 0.15. For highly correlated predictors, you may want to set Enter = 0.01.
Variables with probability (F-to-remove) larger than the Remove probability are
removed from the model. The default Remove value is 0.15.
n F-statistic. Variables with F-to-enter values larger than the Enter F value are

entered into the model if Tolerance permits. The default Enter value is 4. Variables
with F-to-remove values smaller than the Remove F value are removed from the
model. The default Remove value is 3.9.

I-286
Chapter 11

You can also specify variables to include in the model, regardless of whether they meet
the criteria for entry into the model. In the Force text box, enter the number of
variables, in the order in which they appear in the Variables list, to force into the model
(for example, Force = 2 means include the first two variables on the Variables list in
the main dialog box). Force = 0 is the default.

Discriminant Analysis Statistics


You can select any desired output elements by clicking Statistics in the main dialog
box.

All selected statistics will be displayed in the output. Depending on the specified length
of your output, you may also see additional statistics. By default, the print length is set
to Short (you will see all of the statistics on the Short Statistics list). To change the
length of your output, choose Options from the Edit menu. Select Short, Medium, or
Long from the Length drop-down list. Again, all selected statistics will be displayed in
the output, regardless of the print setting.
Short Statistics. Options for Short Statistics are FMatrix (between-groups F matrix),
FStats (F-to-enter/remove statistics), Eigen (eigenvalues and canonical correlation),
CMeans (canonical scores of group means), and Sum (summary panel).
Medium Statistics. Options for Medium Statistics are those for Short Statistics plus
Means (group frequencies and means), Wilks (Wilks lambda and approximate F),
CFunc (discriminant functions), Traces (Lawley-Hotelling and Pillai and Wilks
traces), CDFunc (canonical discriminant functions), SCDFunc (standardized canonical

I-287
Discriminant Analysis

discriminant functions), Class (classification matrix), and JClass (Jackknifed


classification matrix).
Long Statistics. Options for Long Statistics are those for Medium Statistics plus WCov
(within covariance matrix), WCorr (within correlation matrix), TCov (total covariance
matrix), TCorr (total correlation matrix), GCov (groupwise covariance matrix), and
GCorr (groupwise correlation matrix).

Mahalanobis distances, posterior probabilities (Mahal), and canonical scores (CScore)


for each case must be specified individually.

Using Commands
Select your data by typing USE filename and continue as follows:
Basic

DISCRIM
MODEL grpvar = varlist / QUADRATIC
CONTRAST [matrix]
PRINT / length element
SAVE / DATA SCORES DISTANCES
ESTIMATE / TOL=n
Stepwise (Instead of ESTIMATE, specify START)

PRIORS=n1,n2,

START / FORWARD

TOL=n ENTER=p REMOVE=p FENTER=n FREMOVE=n


FORCE=n
BACKWARD
STEP no argument or / NUMBER=n AUTO ENTER=p REMOVE=p
FENTER=n FREMOVE=n
+ or
- or
varlist or
nvari, nvarj,
(sequence of STEPs)
STOP

In addition to indicating a length for the PRINT output, you can select elements not
included in the output for the specified length. Elements for each length include:
Length

Element

SHORT
MEDIUM
LONG

FMATRIX FSTATS EIGEN CMEANS SUM


MEANS WILKS CFUNC TRACES CDFUNC
WCOV WCOR TCOV TCOR GCOV GCOR

CLASS JCLASS
SCDFUNC

I-288
Chapter 11

MAHAL and CSCORE must be specified individually. No length specification includes

these statistics.

Usage Considerations
Types of data. DISCRIM uses rectangular data only.
Print options. Print options allow the user to select panels of output to display, including
group means, variances, covariances, and correlations.
Quick Graphs. For two canonical variables, SYSTAT produces a canonical scores plot,
in which the axes are the canonical variables and the points are the canonical variable
scores. This plot includes confidence elipses for each group. For analyses involving
more than two canonical variables, SYSTAT displays a SPLOM of the first three
canonical variables.
Saving files. You can save the Mahalanobis distances to each group centroid (with the
posterior probability of the membership in each group) or the canonical variable
scores.
BY groups. DISCRIM analyzes data by groups.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. DISCRIM uses a FREQ variable to increase the number of cases.
Case weights. You can weight each case in a discriminant analysis using a weight
variable. Use a binary weight variable coded 0 and 1 for cross-validation. Cases that
have a zero weight do not influence the estimation of the discriminant functions but are
classified into groups.

Examples
Example 1
Discriminant Analysis Using Complete Estimation
In this example, we examine measurements made on 150 iris flowers: sepal length,
sepal width, petal length, and petal width (in centimeters). The data are from Fisher

I-289
Discriminant Analysis

(1936) and are grouped by species: Setosa, Versicolor, and Virginica (coded as 1, 2,
and 3, respectively).
The goal of the discriminant analysis is to find a linear combination of the four
measures that best classifies or discriminates among the three species (groups of
flowers). Here is a SPLOM of the four measures with within-group bivariate
confidence ellipses and normal curves. The input is:
DISCRIM
USE iris
SPLOM sepallen..petalwid / HALF GROUP=species ELL
DENSITY=NORM OVERLAY

PETALLEN

SEPALWID

SEPALLEN

The plot follows:

PETALWID

SPECIES

SEPALLEN

SEPALWID

PETALLEN

PETALWID

1
2
3

Lets see what a default analysis tells us about the separation of the groups and the
usefulness of the variables for the classification. The input is:
USE iris
LABEL species / 1=Setosa, 2=Versicolor,
DISCRIM
MODEL species = sepallen .. petalwid
PRINT / MEANS
ESTIMATE

3=Virginica

Note the shortcut notation (..) in the MODEL statement for listing consecutive variables
in the file (otherwise, simply list each variable name separated by a space).

I-290
Chapter 11

The output follows:


Group frequencies
----------------Frequencies
Group means
----------SEPALLEN
SEPALWID
PETALLEN
PETALWID

Setosa
50

Versicolor
50

Virginica
50

5.0060
3.4280
1.4620
0.2460

5.9360
2.7700
4.2600
1.3260

6.5880
2.9740
5.5520
2.0260

Between groups F-matrix -- df =


4
144
---------------------------------------------Setosa
Versicolor
Virginica
Setosa
0.0
Versicolor
550.1889
0.0
Virginica
1098.2738
105.3127
0.0

Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------2 SEPALLEN
4.72
0.347993 |
3 SEPALWID
21.94
0.608859 |
4 PETALLEN
35.59
0.365126 |
5 PETALWID
24.90
0.649314 |

Classification matrix (cases in row categories classified into columns)


--------------------Setosa Versicolo Virginica %correct
Setosa
50
0
0
100
Versicolor
0
48
2
96
Virginica
0
1
49
98
Total

50

51

98

Jackknifed classification matrix


-------------------------------Setosa Versicolo Virginica
Setosa
50
0
0
Versicolor
0
48
2
Virginica
0
1
49

%correct
100
96
98

Total

50

Eigen
values
--------32.192
0.285

49

49

Canonical
correlations
-----------0.985
0.471

Canonical scores of group means


------------------------------Setosa
7.608
.215
Versicolor
-1.825
-.728
Virginica
-5.783
.513

51

98

Cumulative proportion
of total dispersion
--------------------0.991
1.000

I-291
Discriminant Analysis

Group Frequencies
The Group frequencies panel shows the count of flowers within each group and the
means for each variable. If the group code or one or more measures are missing, the
case is not used in the analysis.

Between Groups F-Matrix


For each pair of groups, use these F statistics to test the equality of group means. These
2
values are proportional to distance measures and are computed from Mahalanobis D
statistics. Thus, the centroids for Versicolor and Virginica are closest (105.3); those for
Setosa and Virginica (1098.3) are farthest apart. If you explore differences among
several pairs, dont use the probabilities associated with these Fs as a test because of
the simultaneous inference problem. Compare the relative size of these values with the
distances between-group means in the canonical variable plot.

F Statistics and Tolerance


Use F-to-remove statistics to determine the relative importance of variables included
in the model. The numerator degrees of freedom for each F is the number of groups
minus 1, and the denominator df is the (total sample size) (number of groups)
(number of variables in the model) + 1; for example, for these data, 3 1 and 150 3

I-292
Chapter 11

4 + 1, or 2 and 144. Because you may be scanning Fs for several variables, do not
use the probabilities from the usual F tables for a test. Here we conclude that
SEPALLEN is least helpful for discriminating among the species (F = 4.72).

Classification Tables
In the Classification matrix, each case is classified into the group where the value of
its classification function is largest. For Versicolor (row name), 48 flowers are
classified correctly and 2 are misclassified (classified as Virginica)96% of the
Versicolor flowers are classified correctly. Overall, 98% of the flowers are classified
correctly (see the last row of the table). The results in the first table can be misleading
because we evaluated the classification rule using the same cases used to compute it.
They may provide an overly optimistic estimate of the rules success. The Jackknifed
classification matrix attempts to remedy the problem by using functions computed
from all of the data except the case being classified. The method of leaving out one case
at a time is called the jackknife and is one form of cross-validation.
For these data, the results are the same. If the percentage for correct classification is
considerably lower in the Jackknifed panel than in the first matrix, you may have too
many predictors in your model.

Eigenvalues, Canonical Correlations, Cumulative Proportion of


Total Dispersion, and Canonical Scores of Group Means
The first canonical variable is the linear combination of the variables that best
discriminates among the groups, the second canonical variable is orthogonal to the first
and is the next best combination of variables, and so on. For our data, the first
eigenvalue (32.2) is very large relative to the second, indicating that the first canonical
variable captures most of the difference among the groupsat the right of this panel,
notice that it accounts for more than 99% of the total dispersion of the groups.
The Canonical correlation between the first canonical variable and a set of two
dummy variables representing the groups is 0.985; the correlation between the second
canonical variable and the dummy variables is 0.471. (The number of dummy variables
is the number of groups minus 1.) Finally, the canonical variables are evaluated at the
group means. That is, in the canonical variable plot, the centroid for the Setosa flowers
is (7.608, 0.215), Versicolor is (1.825, 0.728), and so on, where the first canonical
variable is the x coordinate and the second, the y coordinate.

I-293
Discriminant Analysis

Canonical Scores Plot


The axes of this Quick Graph are the first two canonical variables, and the points are
the canonical variable scores. The confidence ellipses are centered on the centroid of
each group. The Setosa flowers are well differentiated from the others. There is some
overlap between the other two groups. Look for outliers in these displays because they
can affect your analysis.

Example 2
Discriminant Analysis Using Automatic Forward Stepping
Our problem for this example is to derive a rule for classifying countries as European,
Islamic, or New World. We know that strong correlations exist among the candidate
predictor variables, so we are curious about just which subset will be useful. Here are
the candidate predictors:
URBAN
BIRTH_RT
DEATH_RT
B_TO_D
BABYMORT
GDP_CAP
LIFEEXPM
LIFEEXPF
EDUC
HEALTH
MIL
LITERACY

Percentage of the population living in cities


Births per 1000 people in 1990
Deaths per 1000 people in 1990
Ratio of births to deaths in 1990
Infant deaths during the first year per 1000 live births
Gross domestic product per capita (in U.S. dollars)
Years of life expectancy for males
Years of life expectancy for females
U.S. dollars spent per person on education in 1986
U.S. dollars spent per person on health in 1986
U.S. dollars spent per person on the military in 1986
Percentage of the population who can read

Because the distributions of the economic variables are skewed with long right tails,
we log transform GDP_CAP and take the square root of EDUC, HEALTH, and MIL.
LET
LET
LET
LET

gdp_cap = L10(gdp_cap)
educ = SQR(educ)
health = SQR(health)
mil = SQR(mil)

Alternatively, you could also use shortcut notation to request the square root
transformations:
LET (educ, health, mil) = SQR(@)

I-294
Chapter 11

We use automatic forward stepping in an effort to identify the best subset of predictors.
After stepping stops, you need to type STOP to ask SYSTAT to produce the summary
table, classification matrices, and information about canonical variables. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
PRINT / MEANS
START / FORWARD
STEP / AUTO FENTER=4 FREMOVE=3.9
STOP

Notice that the initial results appear after START / FORWARD is specified. STEP /
AUTO and STOP are selected later, as indicated in the output that follows:
Group frequencies
----------------Frequencies
Group means
----------URBAN
BIRTH_RT
DEATH_RT
BABYMORT
GDP_CAP
EDUC
HEALTH
MIL
B_TO_D
LIFEEXPM
LIFEEXPF
LITERACY

Europe
19

Islamic
15

NewWorld
21

68.7895
12.5789
10.1053
7.8947
4.0431
21.5275
21.9537
15.9751
1.2658
72.3684
79.5263
97.5263

30.0667
42.7333
13.4000
102.3333
2.7640
6.4156
3.1937
7.5431
3.5472
54.4000
57.1333
36.7333

56.3810
26.9524
7.4762
42.8095
3.2139
8.9619
6.8898
6.0903
3.9509
66.6190
71.5714
79.9571

Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------|
6 URBAN
23.20
1.000000
|
8 BIRTH_RT
103.50
1.000000
|
10 DEATH_RT
14.41
1.000000
|
12 BABYMORT
53.62
1.000000
|
16 GDP_CAP
59.12
1.000000
|
19 EDUC
27.12
1.000000
|
21 HEALTH
49.62
1.000000
|
23 MIL
19.30
1.000000
|
34 B_TO_D
31.54
1.000000
|
30 LIFEEXPM
37.08
1.000000
|
31 LIFEEXPF
50.30
1.000000
|
32 LITERACY
63.64
1.000000

I-295
Discriminant Analysis

Using commands, type STEP / AUTO.


****************

Step

--

Variable BIRTH_RT Entered

****************

Between groups F-matrix -- df =


1
52
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
206.5877
0.0
NewWorld
55.8562
59.0625
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
103.50
1.000000 |
6 URBAN
1.26
0.724555
|
10 DEATH_RT
19.41
0.686118
|
12 BABYMORT
2.13
0.443802
|
16 GDP_CAP
4.56
0.581395
|
19 EDUC
5.12
0.831381
|
21 HEALTH
9.52
0.868614
|
23 MIL
8.55
0.907501
|
34 B_TO_D
14.94
0.987994
|
30 LIFEEXPM
4.31
0.437850
|
31 LIFEEXPF
3.58
0.371618
|
32 LITERACY
10.32
0.324635
****************

Step

--

Variable DEATH_RT Entered

****************

Between groups F-matrix -- df =


2
51
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
120.1297
0.0
NewWorld
59.7595
29.7661
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
118.41
0.686118 |
6 URBAN
0.07
0.694384
10 DEATH_RT
19.41
0.686118 |
12 BABYMORT
1.83
0.279580
|
16 GDP_CAP
7.88
0.520784
|
19 EDUC
5.03
0.812622
|
21 HEALTH
6.47
0.864170
|
23 MIL
13.21
0.789555
|
34 B_TO_D
0.82
0.186108
|
30 LIFEEXPM
3.34
0.158185
|
31 LIFEEXPF
5.20
0.120507
|
32 LITERACY
2.22
0.265285
****************

Step

--

Variable MIL Entered

****************

Between groups F-matrix -- df =


3
50
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
80.7600
0.0
NewWorld
55.6502
24.6740
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
77.85
0.683054 |
6 URBAN
3.87
0.509585
10 DEATH_RT
25.39
0.596945 |
12 BABYMORT
1.02
0.258829
23 MIL
13.21
0.789555 |
16 GDP_CAP
0.67
0.304330
|
19 EDUC
0.01
0.534243
|
21 HEALTH
1.24
0.652294
|
34 B_TO_D
0.81
0.186064
|
30 LIFEEXPM
0.28
0.135010
|
31 LIFEEXPF
1.34
0.091911
|
32 LITERACY
3.51
0.252509

I-296
Chapter 11

When using commands, type STOP.


Variable
F-to-enter Number of
entered or
or
variables
Wilks
Approx.
removed
F-to-remove in model
lambda
F-value df1
df2
p-tail
------------ ----------- --------- ----------- ----------- ---- ----- --------BIRTH_RT
103.495
1
0.2008
103.4953
2
52
0.00000
DEATH_RT
19.406
2
0.1140
50.0200
4
102
0.00000
MIL
13.212
3
0.0746
44.3576
6
100
0.00000

Classification matrix (cases in row categories classified into columns)


--------------------Europe
Islamic NewWorld %correct
Europe
19
0
0
100
Islamic
0
13
2
87
NewWorld
2
2
17
81
Total

21

15

19

89

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
19
0
Islamic
0
13
NewWorld
2
3

NewWorld
0
2
16

%correct
100
87
76

18

87

Total

21

Eigen
values
--------5.247
1.146

16

Canonical
correlations
-----------0.916
0.731

Cumulative proportion
of total dispersion
--------------------0.821
1.000

Canonical scores of group means


------------------------------Europe
-2.938
.409
Islamic
2.481
1.243
NewWorld
.886 -1.258

Canonical Scores Plot


4

FACTOR(2)

-2

-4
-4

GROUP

-2

0
FACTOR(1)

Europe
Islamic
NewWorld

I-297
Discriminant Analysis

From the panel of Group means, note that, on the average, the percentage of the
population living in cities (URBAN) is 68.8% in Europe, 30.1% in Islamic nations, and
56.4% in the New World. The LITERACY rates for these same groups are 97.5%,
36.7%, and 80.0%, respectively.
After the group means, you will find the F-to-enter statistics for each variable not
in the functions. When no variables are in the model, each F is the same as that for a
one-way analysis of variance. Thus, group differences are the strongest for BIRTH_RT
(F = 103.5) and weakest for DEATH_RT (F = 14.41). At later steps, each F
corresponds to the F for a one-way analysis of covariance where the covariates are the
variables already included.
At step 1, SYSTAT enters BIRTH_RT because its F-to-enter is largest in the last
panel and now displays the same F in the F-to-remove panel. BIRTH_RT is correlated
with several candidate variables, so notice how their F-to-enter values drop when
BIRTH_RT enters (for example, for GDP_CAP, from 59.1 to 4.6). DEATH_RT now
has the highest F-to-enter, so SYSTAT will enter it at step 2. From the between-groups
F-matrix, note that when BIRTH_RT is used alone, Europe and Islamic countries are
the groups that differ most (206.6), and Europe and the New World are the groups that
differ least (55.9).
After DEATH_RT enters, the F-to-enter for MIL (money spent per person on the
military) is largest, so SYSTAT enters it at step 3. The SYSTAT default limit for F-toenter values is 4.0. No variable has an F-to-enter above the limit, so the stepping stops.
Also, all F-to-remove values are greater than 3.9, so no variables are removed.
The summary table contains one row for each variable moved into the model. The
F-to-enter (F-to-remove) is printed for each, along with Wilks lambda and its
approximate F statistic, numerator and denominator degrees of freedom, and tail
probability.
After the summary table, SYSTAT prints the classification matrices. From the
biased estimate in the first matrix, our three-variable rule classifies 89% of the
countries correctly. For the jackknifed results, this percentage drops to 87%. All of the
European nations are classified correctly (100%), while almost one-fourth of the New
World countries are misclassified (two as Europe and three as Islamic). These
countries can be identified by using MAHALthe posterior probability for each case
belonging to each group is printed. You will find, for example, that Canada is
misclassified as European and that Haiti and Bolivia are misclassified as Islamic.
If you focus on the canonical results, you motice that the first canonical variable
accounts for 82.1% of the dispersion, and in the Canonical scores of group means panel,
the groups are ordered from left to right: Europe, New World, and then Islamic. The
second canonical variable contrasts Islamic versus New World (1.243 versus 1.258).

I-298
Chapter 11

In the canonical variable plot, the European nations (on the left) are well separated
from the other groups. The plus sign (+) next to the European confidence ellipse is
Canada. If you are unsure about which ellipse corresponds to what group, look at the
Canonical scores of group means.

Example 3
Discriminant Analysis Using Automatic Backward Stepping
It is possible that classification rules for other subsets of the variables perform better
than that found using forward steppingespecially when there are correlations among
the variables. We try backward stepping. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
PRINT SHORT / CFUNC
IDVAR = country$
START / BACKWARD
STEP / AUTO FENTER=4 FREMOVE=3.9
PRINT / TRACES CDFUNC SCDFUNC
STOP

Notice that we request STEP after an initial report and PRINT and STOP later.
The output follows:
Between groups F-matrix -- df =
12
41
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
25.3059
0.0
NewWorld
18.0596
7.3754
0.0
Classification functions
---------------------Europe
Constant
-4408.4004
URBAN
-2.4175
BIRTH_RT
41.9790
DEATH_RT
50.0202
BABYMORT
9.3190
GDP_CAP
243.6686
EDUC
2.0078
HEALTH
-17.9706
MIL
-9.8420
B_TO_D
-59.6547
LIFEEXPM
-9.8216
LIFEEXPF
93.5933
LITERACY
7.5909

Islamic
-4396.8904
-2.3572
43.1675
48.1539
9.3806
234.5165
4.0450
-19.8527
-10.1746
-62.2446
-9.1537
93.0934
7.5834

NewWorld
-4408.5297
-2.2871
43.1322
48.1950
9.3461
237.0805
3.4276
-19.3068
-10.6076
-61.8195
-9.4952
93.4108
7.7178

I-299
Discriminant Analysis

Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.17
0.436470 |
8 BIRTH_RT
2.01
0.059623 |
10 DEATH_RT
2.26
0.091463 |
12 BABYMORT
0.10
0.083993 |
16 GDP_CAP
0.62
0.143526 |
19 EDUC
6.12
0.065095 |
21 HEALTH
5.36
0.083198 |
23 MIL
7.11
0.323519 |
34 B_TO_D
0.55
0.136148 |
30 LIFEEXPM
0.26
0.036088 |
31 LIFEEXPF
0.07
0.012280 |
32 LITERACY
1.45
0.177756 |

Using commands, type STEP / AUTO.


****************

Step

--

Variable LIFEEXPF Removed

****************

Between groups F-matrix -- df =


11
42
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
28.2000
0.0
NewWorld
20.1693
8.2086
0.0
Classification functions
---------------------Europe
Constant
-2135.2865
URBAN
-0.8690
BIRTH_RT
20.1471
DEATH_RT
29.3876
BABYMORT
3.7505
GDP_CAP
292.1240
EDUC
-3.8832
HEALTH
-5.8347
MIL
-6.9769
B_TO_D
-13.7461
LIFEEXPM
32.7200
LITERACY
5.5340

Islamic
-2147.9924
-0.8170
21.4523
27.6314
3.8419
282.7130
-1.8145
-7.7816
-7.3247
-16.5811
33.1607
5.5374

NewWorld
-2144.2709
-0.7416
21.3429
27.6026
3.7885
285.4413
-2.4518
-7.1945
-7.7480
-16.0004
32.9634
5.6648

Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.45
0.466202 |
31 LIFEEXPF
0.07
0.012280
8 BIRTH_RT
3.04
0.077495 |
10 DEATH_RT
2.45
0.100658 |
12 BABYMORT
0.41
0.140589 |
16 GDP_CAP
0.68
0.144854 |
19 EDUC
6.71
0.066537 |
21 HEALTH
6.78
0.092071 |
23 MIL
7.39
0.328943 |
34 B_TO_D
0.70
0.148030 |
30 LIFEEXPM
0.24
0.077817 |
32 LITERACY
1.48
0.185492 |

I-300
Chapter 11

(We omit the output for steps 2 through 6.)


****************

Step

--

Variable URBAN Removed

****************

Between groups F-matrix -- df =


5
48
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
61.5899
0.0
NewWorld
40.9350
15.6004
0.0
Classification functions
---------------------Europe
Constant
-22.4825
BIRTH_RT
0.3003
DEATH_RT
1.4220
EDUC
-0.1787
HEALTH
0.7483
MIL
0.7537

Islamic
-38.4306
1.3372
0.6592
1.3011
-0.8816
0.4181

NewWorld
-17.6982
0.9382
0.2591
0.8506
-0.3976
0.1794

Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
27.89
0.622699 |
6 URBAN
3.65
0.504724
10 DEATH_RT
15.51
0.583392 |
12 BABYMORT
1.12
0.243722
19 EDUC
5.20
0.083925 |
16 GDP_CAP
1.20
0.171233
21 HEALTH
6.67
0.102470 |
34 B_TO_D
1.24
0.180347
23 MIL
7.42
0.501019 |
30 LIFEEXPM
0.02
0.123573
|
31 LIFEEXPF
0.49
0.076049
|
32 LITERACY
3.42
0.250341
Variable
F-to-enter Number of
entered or
or
variables
Wilks
Approx.
removed
F-to-remove in model
lambda
F-value df1
df2
p-tail
------------ ----------- --------- ----------- ----------- ---- ----- --------LIFEEXPF
0.068
11
0.0405
15.1458
22
84
0.00000
LIFEEXPM
0.237
10
0.0410
16.9374
20
86
0.00000
BABYMORT
0.219
9
0.0414
19.1350
18
88
0.00000
B_TO_D
0.849
8
0.0430
21.4980
16
90
0.00000
GDP_CAP
1.429
7
0.0457
24.1542
14
92
0.00000
LITERACY
2.388
6
0.0505
27.0277
12
94
0.00000
URBAN
3.655
5
0.0583
30.1443
10
96
0.00000
Classification matrix (cases in row categories classified into columns)
--------------------Europe
Islamic NewWorld %correct
Europe
19
0
0
100
Islamic
0
13
2
87
NewWorld
1
2
18
86
15

20

91

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
19
0
Islamic
0
13
NewWorld
1
2

Total

20

NewWorld
0
2
18

%correct
100
87
86

20

91

Total

20

Eigen
values
--------6.984
1.147

15

Canonical
correlations
-----------0.935
0.731

Cumulative proportion
of total dispersion
--------------------0.859
1.000

I-301
Discriminant Analysis

Using commands, type PRINT / TRACES CDFUNC SCDFUNC, then STOP.


Wilks lambda=
Approx.F=

0.058
30.144

df=

10,

96

p-tail=

0.0000

Pillais trace=
Approx.F=

1.409
23.360

df=

10,

98

p-tail=

0.0000

Lawley-Hotelling trace=
Approx.F=

8.131
38.215

df=

10,

94

p-tail=

0.0000

Canonical discriminant functions


-------------------------------1
2
Constant
-1.9836
-5.4022
URBAN
.
.
BIRTH_RT
0.1603
0.0414
DEATH_RT
-0.1588
0.2771
BABYMORT
.
.
GDP_CAP
.
.
EDUC
0.2358
0.0063
HEALTH
-0.2604
-0.0015
MIL
-0.0736
0.1497
B_TO_D
.
.
LIFEEXPM
.
.
LIFEEXPF
.
.
LITERACY
.
.
Canonical discriminant functions -- standardized by within variances
-------------------------------------------------------------------1
2
URBAN
.
.
BIRTH_RT
0.9737
0.2512
DEATH_RT
-0.5188
0.9050
BABYMORT
.
.
GDP_CAP
.
.
EDUC
1.5574
0.0413
HEALTH
-1.5572
-0.0091
MIL
-0.3910
0.7952
B_TO_D
.
.
LIFEEXPM
.
.
LIFEEXPF
.
.
LITERACY
.
.

Canonical scores of group means


------------------------------Europe
-3.389
.410
Islamic
2.864
1.243
NewWorld
1.020 -1.259

I-302
Chapter 11

Before stepping starts, SYSTAT uses all candidate variables to compute classification
functions. The output includes the coefficients for these functions used to classify
cases into groups. A variable is omitted only if it fails the Tolerance limit. For each
case, SYSTAT computes three functions. The first is:
4408.4 2.417*urban + 41.979*birth_rt + ... + 7.591*literacy

Each case is assigned to the group with the largest value.


Tolerance measures the correlation of a candidate variable with the variables
included in the model, and its values range from 0 to 1.0. If a variable is highly
correlated with one or more of the others, the value of Tolerance is very small and the
resulting estimates of the discriminant function coefficients may be very unstable. To
avoid a loss of accuracy in the matrix inversion computations, rarely should you set the
value of this limit to a lower value (the default is 0.001). LIFEEXPF, female life
expectancy, has a very low Tolerance value, so it may be redundant or highly
correlated with another variable or a linear combination of other variables. The
Tolerance value of LIFEEXPM, male life expectancy, is also lowthese two measures
of life expectancy may be highly correlated with one another. Notice also that the value
for BIRTH_RT is very low (0.059623) and its F-to-remove value is 2.01; its F-to-enter
at step 0 in the forward stepping example was 103.5.
At step 7, no variable has an F-to-remove value less than 3.9, so the stepping stops.
The final model found by backward stepping includes five variables: BIRTH_RT,
DEATH_RT, EDUC, HEALTH, and MIL. We are not happy, however, with the low
Tolerance values for two of these variables. The model found via automatic forward

I-303
Discriminant Analysis

stepping did not include EDUC or HEALTH (their F-to-enter statistics at step 3 are
0.01 and 1.24, respectively). URBAN and LITERACY appear more likely candidates,
but their Fs are still less than 4.0.
In both classification matrices, 91% of the countries are classified correctly using
the five-variable discriminant functions. This is a slight improvement over the threevariable model from the forward stepping example, where the percentages were 89%
for the first matrix and 87% for the jackknifed results. The improvement from 87% to
91% is because two New World countries are now classified correctly. We add two
variables and gain two correct classifications.
Wilks lambda (or U statistic), a multivariate analysis of variance statistic that varies
between 0 and 1, tests the equality of group means for the variables in the discriminant
functions. Wilks lambda is transformed to an approximate F statistic for comparison
with the F distribution. Here, the associated probability is less than 0.00005, indicating
a highly significant difference among the groups. The Lawley-Hotelling trace and its
F approximation are documented in Morrison (1976). When there are only two groups,
it and Wilks lambda are equivalent. Pillais trace and its F approximation are taken
from Pillai (1960).
The canonical discriminant functions list the coefficients of the canonical variables
computed first for the data as input and then for the standardized values. For the
unstandardized data, the first canonical variable is:
1.984 + 0.160*birth_rt 0.159*death_rt + 0.236*educ 0.260*health 0.074*mil

The coefficients are adjusted so that the overall mean of the corresponding scores is 0
and the pooled within-group variances are 1. After standardizing, the first canonical
variable is:
0.974*birth_rt 0.519*death_rt + 1.557*educ 1.557*health 0.391*mil

Usually, one uses the latter set of coefficients to interpret what variables drive each
canonical variable. Here, EDUC and HEALTH, the variables with low tolerance
values, have the largest coefficients, and they appear to cancel one another. Also, in the
final model, the size of their F-to-remove values indicates they are the least useful
variables in the model. This indicates that we do not have an optimum set of variables.
These two variables contribute little alone, while together they enhance the separation
of the groups. This suggests that the difference between EDUC and HEALTH could be
a useful variable (for example, LET diff = educ health). We did this, and the following
is the first canonical variable for standardized values (we omit the constant):
1.024*birth_rt 0.539*death_rt 0.480*mil + 0.553*diff

I-304
Chapter 11

From the Canonical scores of group means for the first canonical variable, the groups
line up with Europe first, then New World in the middle, and Islamic on the right. In
the second dimension, DEATH_RT and MIL (military expenditures) appear to separate
Islamic and New World countries.

Mahalanobis Distances and Posterior Probabilities


For Mahalanobis distances, even if you have already specified PRINT=LONG, you must
type PRINT / MAHAL to obtain Mahalanobis distances. The output is:
Mahalanobis distance-square from group means and
Posterior probabilities for group membership
Priors =
.333
.333
.333
Europe
Islamic
NewWorld
Europe
-----------Ireland
Austria
Belgium
*
Denmark
Finland
France
Greece
Switzerland
Spain
UK
Italy
Sweden
Portugal
Netherlands
WGermany
Norway
Poland
Hungary
EGermany
Czechoslov

3.0
4.0
.3
9.1
2.1
2.3
5.7
11.9
3.6
2.1
.6
4.3
3.6
2.1
6.0
5.3
2.7
4.4
8.0
1.8

1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
.99
1.00
1.00
1.00

33.7
37.7
42.7
37.6
40.5
45.5
48.6
71.7
42.8
42.8
44.7
51.7
40.4
43.9
65.8
38.5
29.5
39.8
42.4
40.9

.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00

13.6
19.8
26.0
24.9
22.3
29.1
28.3
48.3
18.9
29.9
23.0
35.9
18.8
24.2
45.5
28.4
12.5
24.3
31.9
25.1

.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.01
.00
.00
.00

Islamic
-----------Gambia
Iraq
Pakistan
Bangladesh
Ethiopia
Guinea
Malaysia
-->
Senegal
Mali
Libya
Somalia
Afghanistan *
Sudan
Turkey
-->
Algeria
Yemen

43.2
71.3
38.7
37.2
40.5
41.2
36.6
42.8
49.3
60.3
50.0
.
43.8
25.0
43.1
57.4

.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.
.00
.00
.00
.00

2.9
23.5
.5
2.0
1.1
8.0
7.7
.9
5.5
15.6
1.1
.
.3
7.2
4.1
3.1

1.00
1.00
.98
.91
.99
1.00
.17
.98
1.00
1.00
1.00
.
.99
.05
.79
1.00

15.3
41.7
8.6
6.8
10.0
24.1
4.5
9.1
23.5
30.1
13.1
.
10.1
1.5
6.7
23.2

.00
.00
.02
.09
.01
.00
.83
.02
.00
.00
.00
.
.01
.95
.21
.00

I-305
Discriminant Analysis

NewWorld
-----------Argentina
11.5 .03
19.8 .00
4.4
Barbados
16.4 .00
20.9 .00
4.7
Bolivia
-->
27.7 .00
3.4 .56
3.8
Brazil
27.4 .00
11.5 .00
.6
Canada
-->
6.7 1.00
35.9 .00
19.3
Chile
21.1 .00
15.7 .00
1.5
Colombia
35.2 .00
13.9 .00
1.9
CostaRica
34.8 .00
21.1 .00
5.5
Venezuela
41.2 .00
13.4 .01
4.6
DominicanR.
26.0 .00
13.2 .00
1.3
Uruguay
13.6 .07
22.9 .00
8.6
Ecuador
32.8 .00
8.6 .02
1.0
ElSalvador
35.3 .00
7.5 .07
2.5
Jamaica
25.6 .00
19.1 .00
1.9
Guatemala
37.6 .00
4.5 .33
3.1
Haiti
-->
37.9 .00
2.0 .99
10.6
Honduras
39.8 .00
6.4 .27
4.5
Trinidad
34.1 .00
11.4 .03
4.1
Peru
20.2 .00
10.5 .02
2.4
Panama
23.8 .00
16.5 .00
2.4
Cuba
12.0 .03
18.5 .00
5.1
-->
case misclassified
*
case not used in computation

.97
1.00
.44
1.00
.00
1.00
1.00
1.00
.99
1.00
.93
.98
.93
1.00
.67
.01
.73
.97
.98
1.00
.97

For each case (up to 250 cases), the Mahalanobis distance squared ( D ) is computed
to each group mean. The closer a case is to a particular mean, the more likely it belongs
to that group. The posterior probability for the distance of a case to a mean is the ratio
2
2
of EXP( 0.5 * D ) for the group divided by the sum of EXP( 0.5 * D ) for all groups
(prior probabilities, if specified, affect these computations).
An arrow (-->) marks incorrectly classified cases, and an asterisk (*) flags cases with
missing values. New World countries Bolivia and Haiti are classified as Islamic, and
Canada is classified as Europe. Note that even though an asterisk marks Belgium,
results are printedthe value of the unused candidate variable URBAN is missing. No
results are printed for Afghanistan because MIL, a variable in the final model, is
missing.
You can identify cases with all large distances as outliers. A case can have a 1.0
probability of belonging to a particular group but still have a large distance. Look at
Iraq. It is correctly classified as Islamic, but its distance is 23.5. The distances in this
panel are distributed approximately as a chi-square with degrees of freedom equal to
the number of variables in the function.

I-306
Chapter 11

Example 4
Discriminant Analysis Using Interactive Stepping
Automatic forward and backward stepping can produce different sets of predictor
variables, and still other subsets of the variables may perform equally as well or
possibly better. Here we use interactive stepping to explore alternative sets of
variables.
Using the OURWORLD data, lets say you decide not to include birth and death
rates in the model because the rates are changing rapidly for several nations (that is, we
omit these variables from the model). We also add the difference between EDUC and
HEALTH as a candidate variable.
SYSTAT provides several ways to specify which variables to move into (or out of)
the model. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
PRINT SHORT / SCDFUNC
GRAPH=NONE
START / BACK

After interpreting these commands and printing the output below, SYSTAT waits for
us to enter STEP instructions.
Between groups F-matrix -- df =
12
41
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
25.3059
0.0
NewWorld
18.0596
7.3754
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.17
0.436470 |
40 DIFFRNCE 0000000.00
0.000000
8 BIRTH_RT
2.01
0.059623 |
10 DEATH_RT
2.26
0.091463 |
12 BABYMORT
0.10
0.083993 |
16 GDP_CAP
0.62
0.143526 |
19 EDUC
6.12
0.065095 |
21 HEALTH
5.36
0.083198 |
23 MIL
7.11
0.323519 |
34 B_TO_D
0.55
0.136148 |
30 LIFEEXPM
0.26
0.036088 |
31 LIFEEXPF
0.07
0.012280 |
32 LITERACY
1.45
0.177756 |

I-307
Discriminant Analysis

A summary of the STEP arguments (variable numbers are visible in the output)
follows:
a.
b.
c.
d.
e.
f.
g.
h.

STEP
STEP
STEP
STEP
STEP
STEP
STEP
STEP
STOP

birth_rt death_rt
lifeexpf

educ health diffrnce


+

Remove two variables


Remove one variable
Remove lifeexpm
Remove babymort
Remove urban
Remove gdp_cap
Remove educ and health; add diffrnce
Enter gdp_cap

Notice that the seventh STEP specification (g) removes EDUC and HEALTH and
enters DIFFRNCE. Remember, after the last step, type STOP for the canonical variable
results and other summaries.

Steps 1 and 2
Input:
STEP

birth_rt

death_rt

Output:
****************

Step

--

Variable BIRTH_RT Removed

****************

Between groups F-matrix -- df =


11
42
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
26.3672
0.0
NewWorld
18.0391
8.2404
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.64
0.437926 |
8 BIRTH_RT
2.01
0.059623
10 DEATH_RT
2.00
0.092765 |
40 DIFFRNCE
0000.00
0.000000
12 BABYMORT
0.14
0.091364 |
16 GDP_CAP
1.40
0.150944 |
19 EDUC
5.99
0.065824 |
21 HEALTH
4.24
0.090886 |
23 MIL
5.92
0.384992 |
34 B_TO_D
0.35
0.329976 |
30 LIFEEXPM
0.42
0.036548 |
31 LIFEEXPF
0.96
0.015962 |
32 LITERACY
1.79
0.292005 |

I-308
Chapter 11

****************

Step

--

Variable DEATH_RT Removed

****************

Between groups F-matrix -- df =


10
43
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
27.8162
0.0
NewWorld
18.1733
9.2794
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.20
0.452548 |
8 BIRTH_RT
1.75
0.060472
12 BABYMORT
0.23
0.108992 |
10 DEATH_RT
2.00
0.092765
16 GDP_CAP
1.14
0.153540 |
40 DIFFRNCE
0.00
0.000000
19 EDUC
6.52
0.065850 |
21 HEALTH
6.28
0.093470 |
23 MIL
6.69
0.385443 |
34 B_TO_D
6.48
0.651944 |
30 LIFEEXPM
0.51
0.036592 |
31 LIFEEXPF
0.28
0.019231 |
32 LITERACY
1.89
0.312350 |

Step 3
Input:
STEP

lifeexpf

Output:
****************

Step

--

Variable LIFEEXPF Removed

****************

Between groups F-matrix -- df =


9
44
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
31.1645
0.0
NewWorld
20.4611
10.4752
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.27
0.472161 |
8 BIRTH_RT
1.88
0.086049
12 BABYMORT
0.79
0.147553 |
10 DEATH_RT
1.31
0.111768
16 GDP_CAP
1.80
0.171189 |
31 LIFEEXPF
0.28
0.019231
19 EDUC
7.51
0.066995 |
40 DIFFRNCE
00000.00
0.000000
21 HEALTH
7.37
0.095626 |
23 MIL
6.88
0.389511 |
34 B_TO_D
6.49
0.683545 |
30 LIFEEXPM
0.28
0.151179 |
32 LITERACY
2.44
0.338715 |

I-309
Discriminant Analysis

Steps 4 through 7
Input:
STEP

Output:
****************

Step

--

Variable LIFEEXPM Removed

****************

Between groups F-matrix -- df =


8
45
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
35.3422
0.0
NewWorld
23.3116
11.9720
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------6 URBAN
2.48
0.486188 |
8 BIRTH_RT
0.68
0.138508
12 BABYMORT
0.52
0.249802 |
10 DEATH_RT
1.38
0.182210
16 GDP_CAP
1.71
0.173599 |
30 LIFEEXPM
0.28
0.151179
19 EDUC
7.32
0.069441 |
31 LIFEEXPF
0.04
0.079455
21 HEALTH
7.18
0.099905 |
40 DIFFRNCE
000.00
0.000000
23 MIL
7.05
0.391379 |
34 B_TO_D
9.06
0.769167 |
32 LITERACY
2.40
0.346292 |

(We omit steps 5, 6, and 7. Each step corresponds to a STEP .)

Steps 8, 9, and 10
Input:
STEP

educ health diffrnce

Output:
****************

Step

--

Variable EDUC Removed

****************

Between groups F-matrix -- df =


4
49
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
49.9302
0.0
NewWorld
34.1490
20.8722
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------21 HEALTH
2.44
0.652730 |
6 URBAN
2.32
0.520120
23 MIL
6.67
0.601236 |
8 BIRTH_RT
3.24
0.248104
34 B_TO_D
16.14
0.887452 |
10 DEATH_RT
0.40
0.241846
32 LITERACY
33.24
0.761872 |
12 BABYMORT
2.09
0.326834
|
16 GDP_CAP
1.12
0.277122
|
19 EDUC
5.14
0.083616
|
30 LIFEEXPM
0.88
0.313546
|
31 LIFEEXPF
2.03
0.250043
|
40 DIFFRNCE
5.14
0.743192

I-310
Chapter 11

****************

Step

--

Variable HEALTH Removed

****************

Between groups F-matrix -- df =


3
50
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
61.6708
0.0
NewWorld
41.4085
28.1939
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------23 MIL
14.70
0.771975 |
6 URBAN
2.55
0.523182
34 B_TO_D
27.09
0.914822 |
8 BIRTH_RT
3.91
0.248706
32 LITERACY
52.35
0.805675 |
10 DEATH_RT
0.42
0.241913
|
12 BABYMORT
3.11
0.337422
|
16 GDP_CAP
3.02
0.391015
|
19 EDUC
0.33
0.538428
|
21 HEALTH
2.44
0.652730
|
30 LIFEEXPM
1.58
0.327654
|
31 LIFEEXPF
3.33
0.269779
|
40 DIFFRNCE
6.98
0.772114
****************

Step

10

--

Variable DIFFRNCE Entered

****************

Between groups F-matrix -- df =


4
49
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
60.8974
0.0
NewWorld
38.7925
22.4751
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------23 MIL
16.65
0.683968 |
6 URBAN
2.50
0.522963
34 B_TO_D
13.97
0.900149 |
8 BIRTH_RT
3.89
0.246110
32 LITERACY
47.38
0.792219 |
10 DEATH_RT
0.41
0.241913
40 DIFFRNCE
6.98
0.772114 |
12 BABYMORT
3.26
0.333341
|
16 GDP_CAP
4.30
0.372308
|
19 EDUC
0.94
0.514966
|
21 HEALTH
0.94
0.628279
|
30 LIFEEXPM
0.98
0.326826
|
31 LIFEEXPF
2.40
0.269658

I-311
Discriminant Analysis

Step 11
Input:
STEP

Output:
****************

Step

11

--

Variable GDP_CAP Entered

****************

Between groups F-matrix -- df =


5
48
---------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
57.5419
0.0
NewWorld
35.7426
18.6879
0.0
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------16 GDP_CAP
4.30
0.372308 |
6 URBAN
2.72
0.513543
23 MIL
5.88
0.478530 |
8 BIRTH_RT
1.04
0.189556
34 B_TO_D
9.46
0.887953 |
10 DEATH_RT
1.00
0.215879
32 LITERACY
12.31
0.609614 |
12 BABYMORT
0.71
0.256567
40 DIFFRNCE
8.37
0.735173 |
19 EDUC
0.36
0.324618
|
21 HEALTH
0.36
0.396047
|
30 LIFEEXPM
0.04
0.259888
|
31 LIFEEXPF
0.24
0.180725

Final Model
Input:
STOP

Output:
Variable
F-to-enter Number of
entered or
or
variables
Wilks
Approx.
removed
F-to-remove in model
lambda
F-value df1
df2
p-tail
------------ ----------- --------- ----------- ----------- ---- ----- --------BIRTH_RT
2.011
11
0.0444
14.3085
22
84
0.00000
DEATH_RT
2.002
10
0.0486
15.2053
20
86
0.00000
LIFEEXPF
0.275
9
0.0492
17.1471
18
88
0.00000
LIFEEXPM
0.277
8
0.0498
19.5708
16
90
0.00000
BABYMORT
0.524
7
0.0510
22.5267
14
92
0.00000
URBAN
2.615
6
0.0568
25.0342
12
94
0.00000
GDP_CAP
3.583
5
0.0655
27.9210
10
96
0.00000
EDUC
5.143
4
0.0795
31.1990
8
98
0.00000
HEALTH
2.438
3
0.0874
39.7089
6
100
0.00000
DIFFRNCE
6.983
4
0.0680
34.7213
8
98
0.00000
GDP_CAP
4.299
5
0.0577
30.3710
10
96
0.00000
Classification matrix (cases in row categories classified into columns)
--------------------Europe
Islamic NewWorld %correct
Europe
19
0
0
100
Islamic
0
14
1
93
NewWorld
1
1
19
90
Total

20

15

20

95

I-312
Chapter 11

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
19
0
Islamic
0
12
NewWorld
1
3
Total

20
Eigen
values
--------6.319
1.369

15

Canonical
correlations
-----------0.929
0.760

NewWorld
0
3
17

%correct
100
80
81

20

87

Cumulative proportion
of total dispersion
--------------------0.822
1.000

Canonical discriminant functions -- standardized by within variances


-------------------------------------------------------------------1
2
URBAN
.
.
BIRTH_RT
.
.
DEATH_RT
.
.
BABYMORT
.
.
GDP_CAP
0.6868
0.0377
EDUC
.
.
HEALTH
.
.
MIL
0.0676
0.8395
B_TO_D
-0.4461
-0.5037
LIFEEXPM
.
.
LIFEEXPF
.
.
LITERACY
0.3903
-0.8573
DIFFRNCE
-0.6378
-0.0291
Canonical scores of group means
------------------------------Europe
3.162
.535
Islamic
-2.890
1.281
NewWorld
-.796 -1.399

A summary of results for the models estimated by forward, backward, and interactive
stepping follows:
Model

% Correct
(Class)

% Correct
(Jackknife)

Forward (automatic)
1. BIRTH_RT DEATH_RT MIL
Backward (automatic)
2. BIRTH_RT DEATH_RT MIL EDUC HEALTH

89

87

91

91

Interactive (ignoring BIRTH_RT and DEATH_RT)


3. MIL B_TO_D LITERACY
4. MIL B_TO_D LITERACY EDUC HEALTH
5. MIL B_TO_D LITERACY DIFFRNCE
6. MIL B_TO_D LITERACY DIFFRNCE GDP_CAP

84
91
91
95

84
89
89
87

Notice that the largest difference between the two classification methods (95% versus
87%) occurs for the last model, which includes the most variables. A difference like

I-313
Discriminant Analysis

this one (8%) can indicate overfitting of correlated candidate variables. Since the
jackknifed results can still be overly optimistic, cross-validation should be considered.

Example 5
Contrasts
Contrasts are available with commands only. When you have specific hypotheses
about differences among particular groups, you can specify one or more contrasts to
direct the entry (or removal) of variables in the model.
According to the jackknifed classification results in the stepwise examples, the
European countries are always classified correctly (100% correct). All of the
misclassifications are New World countries classified as Islamic or vice versa. In order
to maximize the difference between the second (Islamic) and third groups (New
World), we specify contrast coefficients with commands:
CONTRAST [0 -1 1]

If we want to specify linear and quadratic contrasts across four groups, we could
specify:
CONTRAST [-3 -1 1 3; -1 1 1 -1]

or
CONTRAST [-3 -1
-1 1

1 3
1 -1]

Here, we use the first contrast and request interactive forward stepping. The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy
CONTRAST [0 -1 1]
PRINT / SHORT
START / FORWARD
STEP literacy
STEP mil
STEP urban
STOP

I-314
Chapter 11

After viewing the results, remember to cancel the contrast if you plan to do other
discriminant analyses:
CONTRAST / CLEAR

The output follows:


Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------|
6 URBAN
21.87
1.000000
|
8 BIRTH_RT
59.06
1.000000
|
10 DEATH_RT
28.79
1.000000
|
12 BABYMORT
44.12
1.000000
|
16 GDP_CAP
14.32
1.000000
|
19 EDUC
1.30
1.000000
|
21 HEALTH
3.34
1.000000
|
23 MIL
0.65
1.000000
|
34 B_TO_D
1.12
1.000000
|
30 LIFEEXPM
35.00
1.000000
|
31 LIFEEXPF
43.16
1.000000
|
32 LITERACY
64.84
1.000000

****************

Step

--

Variable LITERACY Entered

****************

(We omit results for steps 1, 2, and 3.)


Variable
F-to-enter Number of
entered or
or
variables
Wilks
Approx.
removed
F-to-remove in model
lambda
F-value df1
df2
p-tail
------------ ----------- --------- ----------- ----------- ---- ----- --------LITERACY
64.844
1
0.4450
64.8444
1
52
0.00000
MIL
9.963
2
0.3723
42.9917
2
51
0.00000
URBAN
2.953
3
0.3515
30.7433
3
50
0.00000
Classification matrix (cases in row categories classified into columns)
--------------------Europe
Islamic NewWorld %correct
Europe
18
0
1
95
Islamic
0
14
1
93
NewWorld
2
3
16
76
17

18

87

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
18
0
Islamic
0
14
NewWorld
2
3

Total

20

NewWorld
1
1
16

%correct
95
93
76

18

87

Total

20

Eigen
values
--------1.845

17

Canonical
correlations
-----------0.805

Canonical scores of group means


------------------------------Europe
.882
Islamic
-2.397
NewWorld
.914

Cumulative proportion
of total dispersion
--------------------1.000

I-315
Discriminant Analysis

Compare the F-to-enter values with those in the forward stepping example. The
statistics here indicate that for the economic variables (GDP_CAP, EDUC, HEALTH,
and MIL), differences between the second and third groups are much smaller than those
when European countries are included.
The Jackknifed classification matrix indicates that when LITERACY, MIL, and
URBAN are used, 87% of the countries are classified correctly. This is the same
percentage correct as in the forward stepping example for the model with BIRTH_RT,
DEATH_RT, and MIL. Here, however, one fewer Islamic country is misclassified, and
one European country is now classified incorrectly.
When you look at the canonical results, you see that because a single contrast has
one degree of freedom, only one dimension is definedthat is, there is only one
eigenvalue and one canonical variable.

Example 6
Quadratic Model
One of the assumptions necessary for linear discriminant analysis is equality of
covariance matrices. Within-group scatterplot matrices (SPLOMs) provide a picture
of how measures co-vary. Here we add 85% ellipses of concentration to enhance our
view of the bivariate relations. Since our sample sizes do not differ markedly (15 to 21
countries per group), the ellipses for each pair of variables should have approximately
the same shape and tilt across groups if the equality of covariance assumption holds.
The input is:
DISCRIM
USE ourworld
LET(educ, health, mil) = SQR(@)
STAND
SPLOM birth_rt death_rt educ health mil / HALF ROW=1,
GROUP=group$ ELL=.85 DENSITY=NORMAL

I-316
Chapter 11

NewWorld

DEATH_RT
EDUC

DEATH_RT

BIRTH_RT

DEATH_RT

EDUC

HEALTH

MIL

MIL

HEALTH

EDUC
HEALTH
MIL

MIL

HEALTH

EDUC

DEATH_RT

BIRTH_RT

BIRTH_RT

Islamic

BIRTH_RT

Europe

BIRTH_RT

DEATH_RT

EDUC

HEALTH

MIL

BIRTH_RT

DEATH_RT

EDUC

HEALTH

MIL

Because the length, width, and tilt of the ellipses for most pairs of variables vary
markedly across groups, the assumption of equal covariance matrices has not been met.
Fortunately, the quadratic model does not require equality of covariances. However,
it has a different problem: it requires a larger minimum sample size than that needed
for the linear model. For five variables, for example, the linear and quadratic models,
respectively, for each group are:

f = a + bx 1 + cx 2 + dx 3 + ex 4 + fx 5
2

f = a + bx 1 + cx 2 + dx 3 + ex 4 + fx 5 + gx 1 x 2 + ... + px 4 x 5 + qx 1 + ... + ux 5

So the linear model has six parameters to estimate for each group, and the quadratic
has 21. These parameters arent all independent, so we dont require as many as
( 3 21 ) cases for a quadratic fit.
In this example, we fit a quadratic model using the subset of variables identified in
the backward stepping example. Following this, we examine results for the subset
identified in the interactive stepping example before EDUC and HEALTH are
removed. The input is:
DISCRIM
USE ourworld
LET (educ, health, mil) = SQR(@)
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = birth_rt death_rt educ health mil / QUAD
PRINT SHORT / GCOV WCOV GCOR CFUNC MAHAL
IDVAR = country$
ESTIMATE
MODEL group = educ health mil b_to_d literacy / QUAD
ESTIMATE

I-317
Discriminant Analysis

Output for the first model follows:


Pooled within covariance matrix -- df=
53
-----------------------------------------------BIRTH_RT
DEATH_RT
EDUC
BIRTH_RT
36.2044
DEATH_RT
10.8948
10.4790
EDUC
-16.1749
-7.2497
42.8231
HEALTH
-12.9261
-4.9333
36.5504
MIL
-9.6390
-7.7297
22.0789
Group Europe covariance matrix
-----------------------------------BIRTH_RT
DEATH_RT
BIRTH_RT
1.7342
DEATH_RT
0.0184
1.8184
EDUC
2.0051
1.3359
HEALTH
1.3943
-0.3625
MIL
0.8255
1.2689
Group Europe correlation matrix
-----------------------------------BIRTH_RT
DEATH_RT
BIRTH_RT
1.0000
DEATH_RT
0.0104
1.0000
EDUC
0.2217
0.1442
HEALTH
0.1539
-0.0391
MIL
0.1579
0.2370
Ln( Det(COV of group Europe) )=

HEALTH

MIL

35.0939
16.9130

27.7095

EDUC

HEALTH

MIL

47.1696
44.2594
15.2891

47.3538
14.7387

15.7686

EDUC

HEALTH

MIL

1.0000
0.9365
0.5606

1.0000
0.5394

1.0000

HEALTH

MIL

-0.0617
0.0144
1.7008

-0.0249
-0.0468

EDUC

HEALTH

MIL

33.7508
18.8309
36.6788

10.8603
19.3235

66.0183

EDUC

HEALTH

MIL

1.0000
0.9836
0.7770

1.0000
0.7217

1.0000

8.67105970

Group Europe discriminant function coefficients


------------------------------------------------BIRTH_RT
DEATH_RT
EDUC
BIRTH_RT
-0.1588
DEATH_RT
-0.0487
-0.2038
EDUC
0.0498
0.1140
-0.0627
HEALTH
-0.0408
-0.1196
0.1162
MIL
0.0104
0.0367
0.0011
Constant
4.1354
4.3332
-1.6504
Constant
Constant
-51.1780

Group Islamic covariance matrix


-----------------------------------BIRTH_RT
DEATH_RT
BIRTH_RT
48.6381
DEATH_RT
27.5429
25.5429
EDUC
-19.8729
-20.3689
HEALTH
-10.9262
-10.6192
MIL
-15.5902
-28.4991
Group Islamic correlation matrix
-----------------------------------BIRTH_RT
DEATH_RT
BIRTH_RT
1.0000
DEATH_RT
0.7814
1.0000
EDUC
-0.4905
-0.6937
HEALTH
-0.4754
-0.6376
MIL
-0.2751
-0.6940
Ln( Det(COV of group Islamic) )=

10.34980794

I-318
Chapter 11

Group Islamic discriminant function coefficients


------------------------------------------------BIRTH_RT
DEATH_RT
EDUC
BIRTH_RT
-0.0236
DEATH_RT
0.0703
-0.0751
EDUC
0.0099
-0.0726
-0.3578
HEALTH
-0.0424
0.1331
1.0933
MIL
0.0261
-0.0469
0.0485
Constant
0.9492
-0.5959
1.2818
Constant
Constant
-20.4487
Group NewWorld covariance matrix
-----------------------------------BIRTH_RT
DEATH_RT
BIRTH_RT
60.2476
DEATH_RT
9.5738
8.1619
EDUC
-30.8573
-6.2226
HEALTH
-27.9303
-5.2955
MIL
-15.4143
-1.7399

HEALTH

MIL

-0.8951
-0.0360
-0.9994

-0.0190
-0.3956

EDUC

HEALTH

MIL

45.0446
41.6304
18.3092

40.4104
17.2913

12.2372

EDUC

HEALTH

MIL

1.0000
0.9758
0.7798

1.0000
0.7776

1.0000

HEALTH

MIL

-0.1331
0.0210
-0.6543

-0.0580
0.5229

Group NewWorld correlation matrix


-----------------------------------BIRTH_RT
DEATH_RT
BIRTH_RT
1.0000
DEATH_RT
0.4317
1.0000
EDUC
-0.5923
-0.3245
HEALTH
-0.5661
-0.2916
MIL
-0.5677
-0.1741
Ln( Det(COV of group NewWorld) )=

11.46371023

Group NewWorld discriminant function coefficients


------------------------------------------------BIRTH_RT
DEATH_RT
EDUC
BIRTH_RT
-0.0077
DEATH_RT
0.0121
-0.0401
EDUC
-0.0079
-0.0213
-0.1260
HEALTH
0.0040
0.0114
0.2418
MIL
-0.0115
0.0196
0.0225
Constant
0.4354
0.2643
0.8264
Constant
Constant
-13.3124
Ln( Det(Pooled covariance matrix) )=

13.05914566

Test for equality of covariance matrices


Chisquare=
139.5799
df=
30

prob=

0.0000

Between groups F-matrix -- df =


5
49
--------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
64.4526
0.0
NewWorld
43.1437
15.9199
0.0

Priors =

Mahalanobis distance-square from group means and


Posterior probabilities for group membership
.333
.333
.333
Europe
Islamic
NewWorld

I-319
Discriminant Analysis

(We omit the distances and probabilities for the Europe and Islamic groups.)
NewWorld
-----------Argentina
48.1 .00
45.2 .00
3.8
Barbados
31.8 .00
65.0 .00
6.5
Bolivia
--> 369.3 .00
4.1 .65
4.2
Brazil
133.3 .00
9.4 .03
1.1
Canada
-->
14.5 .88 533.6 .00
15.7
Chile
66.6 .00
16.6 .00
1.8
Colombia
161.1 .00
9.2 .04
1.8
CostaRica
181.6 .00
93.2 .00
7.8
Venezuela
180.9 .00
16.6 .01
6.0
DominicanR.
175.3 .00
21.5 .00
2.3
Uruguay
23.1 .00
38.4 .00
5.8
Ecuador
212.2 .00
5.8 .13
.8
ElSalvador
312.9 .00
10.0 .03
2.0
Jamaica
73.8 .00
20.2 .00
2.5
Guatemala
404.9 .00
6.0 .17
1.7
Haiti
--> 792.1 .00
3.9 .99
11.2
Honduras
395.9 .00
16.1 .00
4.1
Trinidad
164.1 .00
38.0 .00
5.6
Peru
167.6 .00
18.9 .00
4.9
Panama
133.9 .00
97.7 .00
3.4
Cuba
33.6 .00
39.7 .00
6.8
-->
case misclassified
*
case not used in computation

1.00
1.00
.35
.97
.12
1.00
.96
1.00
.99
1.00
1.00
.87
.97
1.00
.83
.01
1.00
1.00
1.00
1.00
1.00

Classification matrix (cases in row categories classified into columns)


--------------------Europe
Islamic NewWorld %correct
Europe
20
0
0
100
Islamic
0
14
1
93
NewWorld
1
2
18
86
16

19

93

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
20
0
Islamic
0
13
NewWorld
1
2

Total

NewWorld
0
2
18

%correct
100
87
86

20

91

Total

21

21

15

(We omit the eigenvalues, etc.)


Look at the quadratic function displayed at the beginning of this example. For our data,
the coefficients for the European group are:
a = 51.178, b = 4.135, c = 4.333, d = 1.650, e = 1.701, f = 0.047, g = 0.049, ,
p = 0.014, q = 0.159, , and u = 0.025

or
f = 51.178 + 4.135*birth_rt + 0.049*birth_rt*death_rt + 0.159*birth_rt2 +
0.025*mil2

Similar functions exist for the other two groups.

I-320
Chapter 11

The output also includes the chi-square test for equality of covariance matrices. The
results are highly significant ( p < 0.00005 ). Thus, we reject the hypothesis of equal
covariance matrices.
The Mahalanobis distances reveal that only four cases are misclassified: Turkey as
a New World country, Canada as European, and Haiti and Bolivia as Islamic.
The classification matrix indicates that 93% of the countries are correctly classified;
using the jackknifed results, the percentage drops to 91%. The latter percentage agrees
with that for the linear model using the same variables.
The output for the second model follows:
Between groups F-matrix -- df =
5
49
--------------------------------------------Europe
Islamic
NewWorld
Europe
0.0
Islamic
51.5154
0.0
NewWorld
33.6025
17.9915
0.0

Priors =

Mahalanobis distance-square from group means and


Posterior probabilities for group membership
.333
.333
.333
Europe
Islamic
NewWorld

NewWorld
-----------Argentina
30.9 .00
48.3 .00
4.3
Barbados
35.5 .00
68.7 .00
7.4
Bolivia
186.2 .00
10.1 .08
2.2
Brazil
230.8 .00
8.1 .13
1.2
Canada
-->
19.4 .74 524.3 .00
16.3
Chile
144.3 .00
17.2 .00
1.6
Colombia
475.1 .00
29.8 .00
1.9
CostaRica
834.5 .00 190.5 .00
10.3
Venezuela
932.5 .00
83.6 .00
8.8
DominicanR.
267.4 .00
18.6 .00
2.0
Uruguay
15.2 .04
60.5 .00
3.9
Ecuador
276.0 .00
11.5 .02
1.0
ElSalvador
498.0 .00
17.6 .00
1.7
Jamaica
312.0 .00
15.5 .00
.7
Guatemala
501.3 .00
7.9 .24
2.5
Haiti
--> 648.4 .00
4.6 .99
10.2
Honduras
688.1 .00
31.8 .00
4.0
Trinidad
315.4 .00
43.1 .00
4.6
Peru
179.9 .00
16.3 .02
5.1
Panama
411.0 .00 109.7 .00
3.6
Cuba
54.7 .00
54.5 .00
6.8
-->
case misclassified
*
case not used in computation

1.00
1.00
.92
.87
.26
1.00
1.00
1.00
1.00
1.00
.96
.98
1.00
1.00
.76
.01
1.00
1.00
.98
1.00
1.00

Classification matrix (cases in row categories classified into columns)


--------------------Europe
Islamic NewWorld %correct
Europe
20
0
0
100
Islamic
0
15
0
100
NewWorld
1
1
19
90
Total

21

16

19

96

I-321
Discriminant Analysis

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
19
0
Islamic
0
14
NewWorld
1
1
Total

20
Eigen
values
--------5.585
1.391

15

Canonical
correlations
-----------0.921
0.763

NewWorld
1
1
19

%correct
95
93
90

21

93

Cumulative proportion
of total dispersion
--------------------0.801
1.000

Canonical scores of group means


------------------------------Europe
-2.916
.501
Islamic
2.725
1.322
NewWorld
.831 -1.422

This model does slightly better than the first onethe classification matrices here
show that 96% and 93%, respectively, are classified correctly. This is because Turkey
and Bolivia are classified correctly here and misclassified with the first model.

Example 7
Cross-Validation
At the end of the interactive stepping example, we reported the percentage of correct
classification for six models. The same sample was used to compute the estimates and
evaluate the success of the rules. We also reported results for the jackknifed
classification procedure that removes and replaces one case at a time. This approach,
however, may still give an overly optimistic picture. Ideally, we should try the rules on
a new sample and compare results with those for the original data. Since this usually
isnt practical, researchers often use a cross-validation procedurethat is, they
randomly split the data into two samples, use the first sample to estimate the
classification functions, and then use the resulting functions to classify the second
sample. The first sample is often called the learning sample and the second, the test
sample. The proportion of correct classification for the test sample is an empirical
measure for the success of the discrimination.
Cross-validation is easy to implement in discriminant analysis. Cases assigned a
weight of 0 are not used to estimate the discriminant functions but are classified into
groups. In this example, we generate a uniform random number (values range from 0
to 1.0) for each case, and when it is less than 0.65, the value 1.0 is stored in a new
weight variable named CASE_USE. If the random number is equal to or greater than
0.65, a 0 is placed in the weight variable. So, approximately 65% of the cases have a
weight of 1.0, and 35%, a weight of 0.

I-322
Chapter 11

We now request a cross-validation for each of the following six models using the
OURWORLD data:
1.
2.
3.
4.
5.
6.

BIRTH_RT DEATH_RT MIL


BIRTH_RT DEATH_RT MIL EDUC HEALTH
MIL B_TO_D LITERACY
MIL B_TO_D LITERACY EDUC HEALTH
MIL B_TO_D LITERACY DIFFRNCE
MIL B_TO_D LITERACY DIFFRNCE GDP_CAP

Use interactive forward stepping to toggle variables in and out of the model subsets.
The input is:
DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
LET case_use = URN < .65
WEIGHT = case_use
LABEL group / 1=Europe, 2=Islamic, 3=NewWorld
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
PRINT NONE / FSTATS CLASS JCLASS
GRAPH NONE
START / FORWARD
STEP birth_rt death_rt mil
STEP educ health
STEP birth_rt death_rt educ health b_to_d literacy
STEP educ health
STEP educ health diffrnce
STEP gdp_cap
STOP

Here are the results from the first STEP after MIL enters:
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
57.86
0.640126 |
6 URBAN
7.41
0.415097
10 DEATH_RT
24.56
0.513344 |
12 BABYMORT
0.20
0.234804
23 MIL
13.43
0.760697 |
16 GDP_CAP
3.22
0.394128
|
19 EDUC
2.00
0.673136
|
21 HEALTH
4.68
0.828565
|
34 B_TO_D
0.16
0.209796
|
30 LIFEEXPM
0.42
0.136526
|
31 LIFEEXPF
0.83
0.104360
|
32 LITERACY
1.54
0.244547
|
40 DIFFRNCE
5.23
0.784797

I-323
Discriminant Analysis

Classification matrix (cases in row categories classified into columns)


--------------------Europe
Islamic NewWorld %correct
Europe
13
0
0
100
Islamic
0
8
1
89
NewWorld
0
1
15
94
Total

13

16

95

Classification of cases with zero weight or frequency


----------------------------------------------------Europe
Islamic NewWorld %correct
Europe
6
0
0
100
Islamic
0
4
2
67
NewWorld
2
0
3
60
Total

76

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
13
0
Islamic
0
8
NewWorld
1
1

NewWorld
0
1
14

%correct
100
89
88

15

92

Total

14

Three classification matrices result. The first presents results for the learning sample,
the cases with CASE_USE values of 1.0. Overall, 95% of these countries are classified
correctly. The sample size is 13 + 9 + 16 = 38or 67.9% of the original sample of 56
countries. The second classification table reflects those cases not used to compute
estimates, the test sample. The percentage of correct classification drops to 76% for
these 17 countries. The final classification table presents the jackknifed results for the
learning sample. Notice that the percentages of correct classification are closer to those
for the learning sample than for the test sample.
Now we add the variables EDUC and HEALTH, with the following results:
Variable
F-to-remove Tolerance |
Variable
F-to-enter Tolerance
-------------------------------------+------------------------------------8 BIRTH_RT
21.13
0.588377 |
6 URBAN
6.50
0.414511
10 DEATH_RT
16.52
0.508827 |
12 BABYMORT
0.07
0.221475
19 EDUC
2.24
0.103930 |
16 GDP_CAP
3.06
0.242491
21 HEALTH
4.88
0.127927 |
34 B_TO_D
0.32
0.198963
23 MIL
5.68
0.567128 |
30 LIFEEXPM
0.05
0.117494
|
31 LIFEEXPF
0.04
0.080161
|
32 LITERACY
1.75
0.238831
|
40 DIFFRNCE
0.00
0.000000
Classification matrix (cases in row categories classified into columns)
--------------------Europe
Islamic NewWorld %correct
Europe
13
0
0
100
Islamic
0
8
1
89
NewWorld
0
1
15
94
Total

13

16

95

I-324
Chapter 11

Classification of cases with zero weight or frequency


----------------------------------------------------Europe
Islamic NewWorld %correct
Europe
6
0
0
100
Islamic
0
5
1
83
NewWorld
1
0
4
80
Total

88

Jackknifed classification matrix


-------------------------------Europe
Islamic
Europe
13
0
Islamic
0
8
NewWorld
0
1

NewWorld
0
1
15

%correct
100
89
94

16

95

Total

13

After we add EDUC and HEALTH, the results here for the learning sample do not
differ from those for the previous model. However, for the test sample, the addition of
EDUC and HEALTH increases the percentage correct from 76% to 88%.
We continue by issuing the STEP specifications listed above, each time noting the
total percentage correct as well as the percentages for the Islamic and New World
groups. After scanning the classification results from both the test sample and the
learning sample jackknifed panel, we conclude that model 2 (BIRTH_RT, DEATH_RT,
MIL, EDUC, and HEALTH) is best and that model 1 performs the worst.

Classification of New Cases


Group membership is known in the current example. What if you have cases where the
group membership is unknown? For example, you might want to apply the rules
developed for one sample to a new sample.
When the value of the grouping variable is missing, SYSTAT still classifies the
case. For example, we set the group code for New World countries to missing
IF group = 3 THEN LET group$ = .

and request automatic forward stepping for the model containing BIRTH_RT,
DEATH_RT, MIL, EDUC, and HEALTH:

I-325
Discriminant Analysis

DISCRIM
USE ourworld
LET gdp_cap = L10 (gdp_cap)
LET (educ, health, mil) = SQR(@)
LET diffrnce = educ - health
IF group = 3 THEN LET group = .
LABEL group / 1=Europe, 2=Islamic
MODEL group = urban birth_rt death_rt babymort,
gdp_cap educ health mil b_to_d,
lifeexpm lifeexpf literacy diffrnce
IDVAR = country$
PRINT / MAHAL
START / FORWARD
STEP / AUTO
STOP

The following are the Mahalanobis distances and posterior probabilities for the
countries with missing group codes and also the classification matrix. The weight
variable is not used here.
Mahalanobis distance-square from group means and
Posterior probabilities for group membership
Priors =
.500
.500
Europe
Islamic
Not Grouped
-----------Argentina
*
Barbados
*
Bolivia
*
Brazil
*
Canada
*
Chile
*
Colombia
*
CostaRica
*
Venezuela
*
DominicanR. *
Uruguay
*
Ecuador
*
ElSalvador *
Jamaica
*
Guatemala
*
Haiti
*
Honduras
*
Trinidad
*
Peru
*
Panama
*
Cuba
*
-->
*

28.6 1.00
59.6 .00
25.9 1.00
71.9 .00
120.7 .00
2.7 1.00
115.7 .00
10.0 1.00
2.1 1.00 124.1 .00
63.2 .00
35.5 1.00
204.0 .00
22.4 1.00
306.5 .00
60.8 1.00
297.0 .00
49.2 1.00
129.9 .00
10.8 1.00
12.5 1.00
91.4 .00
149.3 .00
8.4 1.00
183.7 .00
10.1 1.00
100.2 .00
32.7 1.00
155.3 .00
5.5 1.00
136.8 .00
1.4 1.00
216.6 .00
13.2 1.00
132.6 .00
14.0 1.00
99.4 .00
7.4 1.00
160.5 .00
18.9 1.00
19.4 1.00
70.7 .00
case misclassified
case not used in computation

Classification matrix (cases in row categories classified into columns)


--------------------Europe
Islamic %correct
Europe
19
0
100
Islamic
0
15
100
Total
Not Grouped

19

15

16

100

Argentina, Barbados, Canada, Uruguay, and Cuba are classified as European; the other
15 countries are classified as Islamic.

I-326
Chapter 11

References
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179188.
Hill, M. A. and Engelman, L. (1992). Graphical aids for nonlinear regression and
discriminant analysis. Computational Statistics, Vol. 2, Y. Dodge and J. Whittaker, eds.
Proceedings of the 10th Symposium on Computational Statistics Physica-Verlag,
111126.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The
Statistical Center, University of Phillipines.

Chapter

12
Factor Analysis
Herb Stenson and Leland Wilkinson

Factor analysis provides principal components analysis and common factor analysis
(maximum likelihood and iterated principal axis). SYSTAT has options to rotate, sort,
plot, and save factor loadings. With the principal components method, you can also
save the scores and coefficients. Orthogonal methods of rotation include varimax,
equamax, quartimax, and orthomax. A direct oblimin method is also available for
oblique rotation. Users can explore other rotations by interactively rotating a 3-D
Quick Graph plot of the factor loadings. Various inferential statistics (for example,
confidence intervals, standard errors, and chi-square tests) are provided, depending on
the nature of the analysis that is run.

Statistical Background
Principal components (PCA) and common factor (MLA for maximum likelihood and
IPA for iterated principal axis) analyses are methods of decomposing a correlation or
covariance matrix. Although principal components and common factor analyses are
based on different mathematical models, they can be used on the same data and both
usually produce similar results. Factor analysis is often used in exploratory data
analysis to:
n Study the correlations of a large number of variables by grouping the variables in

factors so that variables within each factor are more highly correlated with
variables in that factor than with variables in other factors.
n Interpret each factor according to the meaning of the variables.
n Summarize many variables by a few factors. The scores from the factors can be

used as input data for t tests, regression, ANOVA, discriminant analysis, and so on.
I-327

I-328
Chapter 12

Often the users of factor analysis are overwhelmed by the gap between theory and
practice. In this chapter, we try to offer practical hints. It is important to realize that
you may need to make several passes through the procedure, changing options each
time, until the results give you the necessary information for your problem.
If you understand the component model, you are on the way toward understanding
the factor model, so lets begin with the former.

A Principal Component
What is a principal component? The simplest way to see is through real data. The
following data consist of Graduate Record Examination verbal and quantitative scores.
These scores are from 25 applicants to a graduate psychology department.
VERBAL

QUANTITATIVE

590
620
640
650
620
610
560
610
600
740
560
680
600
520
660
750
630
570
600
570
600
690
770
610
600

530
620
620
550
610
660
570
730
650
790
580
710
540
530
650
710
640
660
650
570
550
540
670
660
640

I-329
Factor Analysis

Now, we could decide to try linear regression to predict verbal scores from
quantitative. Or, we could decide to predict quantitative from verbal by the same
method. The data dont suggest which is a dependent variable; either will do. What if
we arent interested in predicting either one separately but instead want to know how
both variables hang together jointly? This is what a principal component does. Karl
Pearson, who developed principal component analysis in 1901, described a component
as a line of closest fit to systems of points in space. In short, the regression line
indicates best prediction, and the component line indicates best association.
The following figure shows the regression and component lines for our GRE data.
The regression of y on x is the line with the smallest slope (flatter than diagonal). The
regression of x on y is the line with the largest slope (steeper than diagonal). The
component line is between the other two. Interestingly, when most people are asked to
draw a line relating two variables in a scatterplot, they tend to approximate the
component line. It takes a lot of explaining to get them to realize that this is not the best
line for predicting the vertical axis variable (y) or the horizontal axis variable (x).

Quantitative GRE Score

800

700

600

500
500

600
700
Verbal GRE Score

800

Notice that the slope of the component line is approximately 1, which means that the
two variables are weighted almost equally (assuming the axis scales are the same). We
could make a new variable called GRE that is the sum of the two tests:
GRE = VERBAL + QUANTITATIVE

This new variable could summarize, albeit crudely, the information in the other two. If
the points clustered almost perfectly around the component line, then the new
component variable could summarize almost perfectly both variables.

I-330
Chapter 12

Multiple Principal Components


The goal of principal components analysis is to summarize a multivariate data set as
accurately as possible using a few components. So far, we have seen only one
component. It is possible, however, to draw a second component perpendicular to the
first. The first component will summarize as much of the joint variation as possible.
The second will summarize whats left. If we do this with the GRE data, of course, we
will have as many components as original variablesnot much of a saving. We usually
seek fewer components than variables, so that the variation left over is negligible.

Component Coefficients
In the above equation for computing the first principal component on our test data, we
made both coefficients equal. In fact, when you run the sample covariance matrix using
factor analysis in SYSTAT, the coefficients are as follows:
GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE

They are indeed nearly equal. Their magnitude is considerably less than 1 because
principal components are usually scaled to conserve variance. That is, once you
compute the components with these coefficients, the total variance on the components
is the same as the total variance on the original variables.

Component Loadings
Most researchers want to know the relation between the original variables and the
components. Some components may be nearly identical to an original variable; in other
words, their coefficients may be nearly 0 for all variables except one. Other
components may be a more even amalgam of several original variables.
Component loadings are the covariances of the original variables with the
components. In our example, these loadings are 51.085 for VERBAL and 62.880 for
QUANTITATIVE. You may have noticed that these are proportional to the coefficients;
they are simply scaled differently. If you square each of these loadings and add them
up separately for each component, you will have the variance accounted for by each
component.

I-331
Factor Analysis

Correlations or Covariances
Most researchers prefer to analyze the correlation rather than covariance structure
among their variables. Sample correlations are simply covariances of sample
standardized variables. Thus, if your variables are measured on very different scales or
if you feel the standard deviations of your variables are not theoretically significant,
you will want to work with correlations instead of covariances. In our test example,
working with correlations yields loadings of 0.879 for each variable instead of 51.085
and 62.880. When you factor the correlation instead of the covariance matrix, then the
loadings are the correlations of each component with each original variable.
For our test data, loadings of 0.879 mean that if you created a GRE component by
standardizing VERBAL and QUANTITATIVE and adding them together weighted by
the coefficients, you would find the correlation between these component scores and
the original VERBAL scores to be 0.879. The same would be true for QUANTITATIVE.

Signs of Component Loadings


The signs of loadings within components are arbitrary. If a component (or factor) has
more negative than positive loadings, you may change minus signs to plus and plus to
minus. SYSTAT does this automatically for components that have more negative than
positive loadings, and thus will occasionally produce components or factors that have
different signs from those in other computer programs. This occasionally confuses
users. In mathematical terms, Ax = x and Ax = x are equivalent.

Factor Analysis
We have seen how principal components analysis is a method for computing new
variables that summarize variation in a space parsimoniously. For our test variables,
the equation for computing the first component was:
GRE = 0.008 * VERBAL + 0.01 * QUANTITATIVE

This component equation is linear, of the form:


Component = Linear combination of {Observed variables}

Factor analysts turn this equation around:


Observed variable = Linear combination of {Factors} + Error

I-332
Chapter 12

This model was presented by Spearman near the turn of the century in the context of a
single intelligence factor and extended to multiple mental measurement factors by
Thurstone several decades later. Notice that the factor model makes observed variables
a function of unobserved factors. Even though this looks like a linear regression model,
none of the graphical and analytical techniques used for regression can be applied to
the factor model because there is no unique, observable set of factor scores or residuals
to examine.
Factor analysts are less interested in prediction than in decomposing a covariance
matrix. This is why the fundamental equation of factor analysis is not the above linear
model, but rather its quadratic form:
Observed covariances = Factor covariances + Error covariances

The covariances in this equation are usually expressed in matrix form, so that the
model decomposes an observed covariance matrix into a hypothetical factor
covariance matrix plus a hypothetical error covariance matrix. The diagonals of these
two hypothetical matrices are known, respectively, as communalities and
specificities.
In ordinary language, then, the factor model expresses variation within and relations
among observed variables as partly common variation among factors and partly
specific variation among random errors.

Estimating Factors
Factor analysis involves several steps:
n First, the correlation or covariance matrix is computed from the usual cases-by-

variables data file or it is input as a matrix.


n Second, the factor loadings are estimated. This is called initial factor extraction.

Extraction methods are described in this section.


n Third, the factors are rotated to make the loadings more interpretablethat is,

rotation methods make the loadings for each factor either large or small, not inbetween. These methods are described in the next section.
Factors must be estimated iteratively in a computer. There are several methods
available. The most popular approach, available in SYSTAT, is to modify the diagonal
of the observed covariance matrix and calculate factors the same way components are
computed. This procedure is repeated until the communalities reproduced by the factor
covariances are indistinguishable from the diagonal of the modified matrix.

I-333
Factor Analysis

Rotation
Usually the initial factor extraction does not give interpretable factors. One of the
purposes of rotation is to obtain factors that can be named and interpreted. That is, if
you can make the large loadings larger than before and the smaller loadings smaller,
then each variable is associated with a minimal number of factors. Hopefully, the
variables that load strongly together on a particular factor will have a clear meaning
with respect to the subject area at hand.
It helps to study plots of loadings for one factor against those for another. Ideally,
you want to see clusters of loadings at extreme values for each factor: like what A and
C are for factor 1, and B and D are for factor 2 in the left plot, and not like E and F in
the middle plot.
1

1
d

-1

-1

0
e

-1
-1

-1
-1

1
1

In the middle plot, the loadings in groups E and F are sizeable for both factors 1 and 2.
However, if you lift the plot axes away from E and F, rotating them 45 degrees, and
then set them down as on the right, you achieve the desired effect. Sounds easy for two
factors. For three factors, imagine that the loadings are balls floating in a room and that
you rotate the floor and walls so that each loading is as close to the floor or a wall as it
can be. This concept generalizes to more dimensions.
Researchers let the computer do the rotation automatically. There are many criteria
for achieving a simple structure among component loadings, although Thurstones are
most widely cited. For p variables and m components:
n Each component should have at least m near-zero loadings.
n Few components should have nonzero loadings on the same variable.

SYSTAT provides five methods of rotating loadings: varimax, equamax, quartimax,


orthomax, and oblimin.

I-334
Chapter 12

Principal Components versus Factor Analysis


SYSTAT can perform both principal components and common factor analysis. Some
view principal components analysis as a method of factor analysis, although there is a
theoretical distinction between the two. Principal components are weighted linear
composites of observed variables. Common factors are unobserved variables that are
hypothesized to account for the intercorrelations among observed variables.
One significant practical difference is that common factor scores are indeterminate,
whereas principal component scores are not. There are no sufficient estimators of
scores for subjects on common factors (rotated or unrotated, maximum likelihood, or
otherwise). Some computer models provide regression estimates of factor scores,
but these are not estimates in the usual statistical sense. This problem arises not
because factors can be arbitrarily rotated (so can principal components), but because
the common factor model is based on more unobserved parameters than observed data
points, an unusual circumstance in statistics.
In recent years, maximum likelihood factor analysis algorithms have been
devised to estimate common factors. The implementation of these algorithms in
popular computer packages has led some users to believe that the factor indeterminacy
problem does not exist for maximum likelihood factor estimates. It does.
Mathematicians and psychometricians have known about the factor indeterminacy
problem for decades. For a historical review of the issues, see Steiger (1979); for a
general review, see Rozeboom (1982). For further information on principal
components, consult Harman (1976), Mulaik (1978), Gnanadesikan (1977), or Mardia,
Kent, and Bibby (1979).
Because of the indeterminacy problem, SYSTAT computes subjects scores only
for the principal components model where subjects scores are a simple linear
transformation of scores on the factored variables. SYSTAT does not save scores from
a common factor model.

Applications and Caveats


While there is not room here to discuss more statistical issues, you should realize that
there are several myths about factors versus components:
Myth. The factor model allows hypothesis testing; the component model doesnt.
Fact. Morrison (1967) and others present a full range of formal statistical tests for
components.

I-335
Factor Analysis

Myth. Factor loadings are real; principal component loadings are approximations.
Fact. This statement is too ambiguous to have any meaning. It is easy to define things
so that factors are approximations of components.
Myth. Factor analysis is more likely to uncover lawful structure in your data; principal
components are more contaminated by error.
Fact. Again, this statement is ambiguous. With further definition, it can be shown to be
true for some data, false for other. It is true that, in general, factor solutions will have
lower dimensionality than corresponding component solutions. This can be an
advantage when searching for simple structure among noisy variables, as long as you
compare the result to a principal components solution to avoid being fooled by the sort
of degeneracies illustrated above.

Factor Analysis in SYSTAT


Factor Analysis Main Dialog Box
For factor analysis, from the menus choose:
Statistics
Data Reduction
Factor Analysis

I-336
Chapter 12

The following options are available:


Model variables. Variables used to create factors.
Method. SYSTAT offers three estimation methods:
n Principal components analysis (PCA) is the default method of analysis.
n Iterated principal axis (IPA) provides an iterative method to extract common

factors by starting with the principal components solution and iteratively solving
for communalities.
n Maximum likelihood analysis (MLA) iteratively finds communalities and common

factors.
Display. You can sort factor loadings by size or display extended results. Selecting
Extended results displays all possible Factor output.
Sample size for matrix input. If your data are in the form of a correlation or covariance
matrix, you must specify the sample size on which the input matrix is based so that
inferential statistics (available with extended results) can be computed.
Matrix for extraction. You can factor a correlation matrix or a covariance matrix. Most
frequently, the correlation matrix is used. You can also delete missing cases pairwise
instead of listwise. Listwise deletes any case with missing data for any variable in the
list. Pairwise examines each pair of variables and uses all cases with both values
present.
Extraction parameters. You can limit the results by specifying extraction parameters.
n Minimum eigenvalue. Specify the smallest eigenvalue to retain. The default is 1.0

for PCA and IPA (not available with maximum likelihood). Incidentally, if you
specify 0, factor analysis ignores components with negative eigenvalues (which
can occur with pairwise deletion).
n Number of factors. Specify the number of factors to compute. If you specify both

the number of factors and the minimum eigenvalue, factor analysis uses whichever
criterion results in the smaller number of components.
n Iterations. Specify the number of iterations SYSTAT should perform (not available

for principal components). The default is 25.


n Convergence. Specify the convergence criterion (not available for principal

components). The default is 0.001.

I-337
Factor Analysis

Rotation Parameters
This dialog box specifies the factor rotation method.

The following methods are available:


n No rotation. Factors are not rotated.
n Varimax. An orthogonal rotation method that minimizes the number of variables

that have high loadings on each factor. It simplifies the interpretation of the factors.
n Equamax. A rotation method that is a combination of the varimax method, which

simplifies the factors, and the quartimax method, which simplifies the variables.
The number of variables that load highly on a factor and the number of factors
needed to explain a variable are minimized.
n Quartimax. A rotation method that minimizes the number of factors needed to

explain each variable. It simplifies the interpretation of the observed variables.


n Orthomax. Specifies families of orthogonal rotations. Gamma specifies the member

of the family to use. Varying Gamma changes maximization of the variances of the
loadings from columns (Varimax) to rows (Quartimax).
n Oblimin. Specifies families of oblique (non-orthogonal) rotations. Gamma

specifies the member of the family to use. For Gamma, specify 0 for moderate
correlations, positive values to allow higher correlations, and negative values to
restrict correlations.

I-338
Chapter 12

Save
You can save factor analysis results for further analyses.

For the maximum likelihood and iterated principal axis methods, you can save only
loadings. For the principal components method, select from these options:
n Do not save results. Results are not saved.
n Factor scores. Standardized factor scores.
n Residuals. Residuals for each case. For a correlation matrix, the residual is the

actual z score minus the predicted z score using the factor scores times the loadings
to get the predicted scores. For a covariance matrix, the residuals are from
unstandardized predictions. With an orthogonal rotation, Q and PROB are also
saved. Q is the sum of the squared residuals, and PROB is its probability.
n Principal components. Unstandardized principal components scores with mean 0

and variance equal to the eigenvalue for the factor (only for PCA without rotation).
n Factor coefficients. Coefficients that produce standardized scores. For a correlation

matrix, multiply the coefficients by the standardized variables; for a covariance


matrix, use the original variables.
n Eigenvectors. Eigenvectors (only for PCA without a rotation). Use to produce

unstandardized scores.
n Factor loadings. Factor loadings.
n Save data with scores. Saves the selected item and all the variables in the working

data file as a new data file. Use with options for scores (not loadings, coefficients,
or other similar options).

I-339
Factor Analysis

If you save scores, the variables in the file are labeled FACTOR(1), FACTOR(2), and
so on. Any observations with missing values on any of the input variables will have
missing values for all scores. The scores are normalized to have zero mean and, if the
correlation matrix is used, unit variance. If you use the covariance matrix and perform
no rotations, SYSTAT does not standardize the component scores. The sum of their
variances is the same as for the original data.
If you want to use the score coefficients to get component scores for new data,
multiply the coefficients by the standardized data. SYSTAT does this when it saves
scores. Another way to do cross-validation is to assign a zero weight to those cases not
used in the factoring and to assign a unit weight to those cases used. The zero-weight
cases are not used in the factoring, but scores are computed for them.
When Factor scores or Principal components is requested, T2 and PROB are also
2
saved. The former is the Hotelling T statistic that squares the standardized distance
from each case to the centroid of the factor space (that is, the sum of the squared,
standardized factor scores). PROB is the upper-tail probability of T2. Use this statistic
to identify outliers within the factor space. T2 is not computed with an oblique rotation.

Using Commands
After selecting a data file with USE filename, continue with:
FACTOR
MODEL varlist
SAVE filename / SCORES DATA LOAD COEF VECTORS PC
ESTIMATE / METHOD = PCA or IPA or MLA ,
LISTWISE or PAIRWISE N=n CORR or COVA ,
NUMBER=n EIGEN=n ITER=n CONV=n SORT ,
ROTATE = VARIMAX or EQUAMAX or QUARTIMAX ,
or ORTHOMAX or OBLIMIN
GAMMA=n

RESID

Usage Considerations
Types of data. Data for factor analysis can be a cases-by-variables data file, a correlation
matrix, or a covariance matrix.
Print options. Factor analysis offers three categories of output: short (the default),
medium, and long. Each has specific output panels associated with it.
For Short, the default, panels are: Latent roots or eigenvalues (not MLA), initial and
final communality estimates (not PCA), component loadings (PCA) or factor pattern

I-340
Chapter 12

(MLA, IPA), variance explained by components (PCA) or factors (MLA, IPA),


percentage of total variance explained, change in uniqueness and log likelihood at each
iteration (MLA only), and canonical correlations (MLA only). When a rotation is
requested: rotated loadings (PCA) or pattern (MLA, IPA) matrix, variance explained
by rotated components, percentage of total variance explained, and correlations among
oblique components or factors (oblimin only).
By specifying Medium, you get the panels listed for Short, plus: the matrix to factor,
the chi-square test that all eigenvalues are equal (PCA only), the chi-square test that
last k eigenvalues are equal (PCA only), and differences of original correlations or
covariances minus fitted values. For covariance matrix input (not MLA or IPA):
asymptotic 95% confidence limits for the eigenvalues and estimates of the population
eigenvalues with standard errors.
With Long, you get the panels listed for Short and Medium, plus: latent vectors
(eigenvectors) with standard errors (not MLA) and the chi-square test that the number
of factors is k (MLA only). With an oblimin rotation: direct and indirect contribution
of factors to variances and the rotated structure matrix.
Quick Graphs. Factor analysis produces a scree plot and a factor loadings plot.
Saving files. You can save factor scores, residuals, principal components, factor
coefficients, eigenvectors, or factor loadings as a new data file. For the iterated
principal axis and maximum likelihood methods, you can save only factor loadings.
You can save only eigenvectors and principal components for unrotated solutions
using the principal components method.
BY groups. Factor analysis produces separate analyses for each level of any BY
variables.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. Factor analysis uses FREQUENCY variables to duplicate cases for
rectangular data files.
Case weights. For rectangular data, you can weight cases using a WEIGHT variable.

I-341
Factor Analysis

Examples
Example 1
Principal Components
Principal components (PCA, the default method) is a good way to begin a factor
analysis (and possibly the only method you may need). If one variable is a linear
combination of the others, the program will not stop (MLA and IPA both require a
nonsingular correlation or covariance matrix). The PCA output can also provide
indications that:
n One or more variables have little relation to the others and, therefore, are not suited

for factor analysisso in your next run, you might consider omitting them.
n The final number of factors may be three or four and not double or triple this

number.
To illustrate this method of factor extraction, we borrow data from Harman (1976),
who borrowed them from a 1937 unpublished thesis by Mullen. This classic data set is
widely used in the literature. For example, Jackson (1991) reports loadings for the
PCA, MLA, and IPA methods. The data are measurements recorded for 305 youth
aged seven to seventeen: height, arm span, length of forearm, length of lower leg,
weight, bitrochanteric diameter (the upper thigh), girth, and width. Because the units
of these measurements differ, we analyze a correlation matrix:
Height

Arm_Span

Forearm Lowerleg Weight

Bitro

Girth

Height

1.000

Arm_Span

0.846

1.000

Forearm

0.805

0.881

1.000

Lowerleg

0.859

0.826

0.801

1.000

Weight

0.473

0.376

0.380

0.436

1.000

Bitro

0.398

0.326

0.319

0.329

0.762

1.000

Girth

0.301

0.277

0.237

0.327

0.730

0.583

1.000

Width

0.382

0.415

0.345

0.365

0.629

0.577

0.539

Width

1.000

I-342
Chapter 12

The correlation matrix is stored in the YOUTH file. SYSTAT knows that the file contains
a correlation matrix, so no special instructions are needed to read the matrix. The input is:
FACTOR
USE youth
MODEL height .. width
ESTIMATE / METHOD=PCA

N=305

SORT

ROTATE=VARIMAX

Notice the shortcut notation (..) for listing consecutive variables in a file.
The output follows:
Latent Roots (Eigenvalues)
1
4.6729
6
0.1867

2
1.7710
7
0.1373

3
0.4810
8
0.0965

Component loadings
1
HEIGHT
ARM_SPAN
LOWERLEG
FOREARM
WEIGHT
BITRO
WIDTH
GIRTH

0.8594
0.8416
0.8396
0.8131
0.7580
0.6742
0.6706
0.6172

2
0.3723
0.4410
0.3953
0.4586
-0.5247
-0.5333
-0.4185
-0.5801

Variance Explained by Components


1
4.6729

2
1.7710

Percent of Total Variance Explained


1
58.4110

2
22.1373

Rotated Loading Matrix ( VARIMAX, Gamma =


1
ARM_SPAN
FOREARM
HEIGHT
LOWERLEG
WEIGHT
BITRO
GIRTH
WIDTH

0.9298
0.9191
0.8998
0.8992
0.2507
0.1806
0.1068
0.2509

2
0.1955
0.1638
0.2599
0.2295
0.8871
0.8404
0.8403
0.7496

1.0000)

4
0.4214

5
0.2332

I-343
Factor Analysis

"Variance" Explained by Rotated Components


1

3.4973

2.9465

Percent of Total Variance Explained


1

43.7165

36.8318

Scree Plot

Factor Loadings Plot

1.0
WEIGHT
GIRTH
BITRO
WIDTH

FACTOR(2)

Eigenvalue

0.5
3
2

0.0

-0.5

1
0
0

HEIGHT
LOWERLEG
FOREARM
ARM_SPAN

3 4 5 6 7
Number of Factors

-1.0
-1.0

-0.5

0.0
0.5
FACTOR(1)

1.0

Notice that we did not specify how many factors we wanted. For PCA, the assumption
is to compute as many factors as there are eigenvalues greater than 1.0so, in this run,
you study results for two factors. After examining the output, you may want to specify
a minimum eigenvalue or, very rarely, a lower limit.
Unrotated loadings (and orthogonally rotated loadings) are correlations of the
variables with the principal components (factors). They are also the eigenvectors of the
correlation matrix multiplied by the square roots of the corresponding eigenvalues.
Usually these loadings are not useful for interpreting the factors. For some industrial
applications, researchers prefer to examine the eigenvectors alone.
The Variance explained for each component is the eigenvalue for the factor. The
first factor accounts for 58.4% of the variance; the second, 22.1%. The Total Variance
is the sum of the diagonal elements of the correlation (or covariance) matrix. By
summing the Percent of Total Variance Explained for the two factors
( 58.411 + 22.137 = 80.548 ), you can say that more than 80% of the variance of all
eight variables is explained by the first two factors.
In the Rotated Loading Matrix, the rows of the display have been sorted, placing the
loadings > 0.5 for factor 1 first, and so on. These are the coefficients of the factors after

I-344
Chapter 12

rotation, so notice that large values for the unrotated loadings are larger here and the
small values are smaller. The sums of squares of these coefficients (for each factor or
column) are printed below under the heading Variance Explained by Rotated
Components. Together, the two rotated factors explain more than 80% of the variance.
Factor analysis offers five types of rotation. Here, by default, the orthogonal varimax
method is used.
To interpret each factor, look for variables with high loadings. The four variables
that load highly on factor 1 can be said to measure lankiness; while the four that load
highly on factor 2, stockiness. Other data sets may include variables that do not load
highly on any specific factor.
In the factor scree plot, the eigenvalues are plotted against their order (or associated
component). Use this display to identify large values that separate well from smaller
eigenvalues. This can help to identify a useful number of factors to retain. Scree is the
rubble at the bottom of a cliff; the large retained roots are the cliff, and the deleted ones
are the rubble.
The points in the factor loadings plot are variables, and the coordinates are the
rotated loadings. Look for clusters of loadings at the extremes of the factors. The four
variables at the right of the plot load highly on factor 1 and all reflect length. The
variables at the top of the plot load highly on factor 2 and reflect width.

Example 2
Maximum Likelihood
This example uses maximum likelihood for initial factor extraction and 2 as the
number of factors. Other options remain as in the principal components example. The
input is:
FACTOR
USE youth
MODEL height .. width
ESTIMATE / METHOD=MLA

N=305

NUMBER=2

SORT

ROTATE=VARIMAX

I-345
Factor Analysis

The output follows:


Initial Communality Estimates
1

0.8162

0.8493

3
0.8006

0.6041

0.5622

Maximum Change in
SQRT(uniqueness)
0.722640
0.243793
0.051182
0.010359
0.000493

0.7884

5
0.7488

8
0.4778

Iterative Maximum Likelihood Factor Analysis: Convergence =


Iteration
Number
1
2
3
4
5

0.001000.

Negative log of
Likelihood
0.384050
0.273332
0.253671
0.253162
0.253162

Final Communality Estimates


1
0.8302
6
0.6363

2
0.8929
7
0.5837

Canonical Correlations
1
0.9823

2
0.9489

Factor pattern
1
HEIGHT
ARM_SPAN
LOWERLEG
FOREARM
WEIGHT
BITRO
WIDTH
GIRTH

0.8797
0.8735
0.8551
0.8458
0.7048
0.5887
0.5743
0.5265

2
0.2375
0.3604
0.2633
0.3442
-0.6436
-0.5383
-0.3653
-0.5536

Variance Explained by Factors


1
4.4337

2
1.5179

Percent of Total Variance Explained


1
55.4218

2
18.9742

3
0.8338
8
0.4633

4
0.8006

5
0.9109

I-346
Chapter 12

Rotated Pattern Matrix ( VARIMAX, Gamma =


1
ARM_SPAN
FOREARM
HEIGHT
LOWERLEG
WEIGHT
BITRO
GIRTH
WIDTH

1.0000)

0.9262
0.8942
0.8628
0.8569
0.2268
0.1891
0.1289
0.2734

0.1873
0.1853
0.2928
0.2576
0.9271
0.7750
0.7530
0.6233

"Variance" Explained by Rotated Factors


1

3.3146

2.6370

Percent of Total Variance Explained


1
41.4331

2
32.9628

Percent of Common Variance Explained


1
55.6927

2
44.3073

Factor Loadings Plot


1.0

WEIGHT
BITRO
GIRTH
WIDTH

FACTOR(2)

0.5
LOWERLEG
HEIGHT
FOREARM
ARM_SPAN

0.0

-0.5

-1.0
-1.0

-0.5

0.0
0.5
FACTOR(1)

1.0

The first panel of output contains the communality estimates. The communality of a
variable is its theoretical squared multiple correlation with the factors extracted. For
MLA (and IPA), the assumption for the initial communalities is the observed squared
multiple correlation with all the other variables.

I-347
Factor Analysis

The canonical correlations are the largest multiple correlations for successive
orthogonal linear combinations of factors with successive orthogonal linear
combinations of variables. These values are comfortably high. If, for other data, some
of the factors have values that are much lower, you might want to request fewer factors.
The loadings and amount of variance explained are similar to those found in the
principal components example. In addition, maximum likelihood reports the
percentage of common variance explained. Common variance is the sum of the
communalities. If A is the unrotated MLA factor pattern matrix, common variance is
the trace of AA.

Number of Factors
In this example, we specified two factors to extract. If you were to omit this
specification and rerun the example, SYSTAT adds this report to the output:
The Maximum Number of Factors for Your Data is 4

SYSTAT will also report this message if you request more than four factors for these
data. This result is due to a theorem by Lederman and indicates that the degrees of
freedom allow estimates of loadings and communalities for only four factors.
If we set the print length to long, SYSTAT reports:
Chi-square Test that the Number of Factors is 4
CSQ = 4.3187 P = 0.1154 DF = 2.00

The results of this chi-square test indicate that you do not reject the hypothesis that
there are four factors (p value > 0.05). Technically, the hypothesis is that no more than
four factors are required. This, of course, does not negate 2 as the right number. For
the YOUTH data, here are rotated loadings for four factors:
Rotated Pattern Matrix ( VARIMAX, Gamma =
1
ARM_SPAN
LOWERLEG
HEIGHT
FOREARM
WEIGHT
BITRO
GIRTH
WIDTH

0.9372
0.8860
0.8776
0.8732
0.2414
0.1823
0.1133
0.2597

2
0.1984
0.2142
0.2819
0.1957
0.8830
0.8233
0.7315
0.6459

1.0000)
3
-0.2831
0.1878
0.1134
-0.0851
0.1077
0.0163
-0.0048
-0.1400

4
0.0465
0.1356
-0.0077
-0.0065
0.1080
-0.0784
0.5219
0.0819

I-348
Chapter 12

The loadings for the last two factors do not make sense. Possibly, the fourth factor has
one variable, GIRTH, but it still has a healthier loading on factor 2. This test is based
on an assumption of multivariate normality (as is MLA itself). If not true, then the test
is invalid.

Example 3
Iterated Principal Axis
This example continues with the YOUTH data described in the principal components
example, this time using the IPA (iterated principal axis) method to extract factors. The
input is:
FACTOR
USE youth
MODEL height .. width
ESTIMATE / METHOD=IPA

SORT

ROTATE=VARIMAX

The output is:


Initial Communality Estimates
1
0.8162
6
0.6041

2
0.8493
7
0.5622

3
0.8006

5
0.7488

8
0.4778

Iterative Principal Axis Factor Analysis: Convergence =


Iteration
Number
1
2
3
4
5
6
7
8
9

4
0.7884

0.001000.

Maximum Change in
SQRT(communality)
0.308775
0.039358
0.017077
0.008751
0.004934
0.002923
0.001776
0.001093
0.000677

Final Communality Estimates


1
0.8381
6
0.6403

2
0.8887
7
0.5835

3
0.8205
8
0.4921

4
0.8077

5
0.8880

I-349
Factor Analysis

Latent Roots (Eigenvalues)


1

4.4489

1.5100

6
-0.0374

7
-0.0602

3
0.1016
8
-0.0743

Factor pattern
1
HEIGHT
ARM_SPAN
LOWERLEG
FOREARM
WEIGHT
BITRO
WIDTH
GIRTH

0.8561
0.8482
0.8309
0.8082
0.7500
0.6307
0.6074
0.5688

2
0.3244
0.4114
0.3424
0.4090
-0.5706
-0.4924
-0.3509
-0.5098

Variance Explained by Factors


1

4.4489

1.5100

Percent of Total Variance Explained


1
55.6110

2
18.8753

Rotated Pattern Matrix ( VARIMAX, Gamma =


1
ARM_SPAN
FOREARM
HEIGHT
LOWERLEG
WEIGHT
BITRO
GIRTH
WIDTH

0.9203
0.8874
0.8724
0.8639
0.2334
0.1884
0.1291
0.2581

0.2045
0.1815
0.2775
0.2478
0.9130
0.7777
0.7529
0.6523

"Variance" Explained by Rotated Factors


1

3.3150

2.6439

Percent of Total Variance Explained


1
41.4377

2
33.0485

Percent of Common Variance Explained


1
55.6314

2
44.3686

1.0000)

4
0.0551

5
0.0150

I-350
Chapter 12

Factor Loadings Plot


1.0

WEIGHT
BITRO
GIRTH
WIDTH

FACTOR(2)

0.5
LOWERLEG
HEIGHT
FOREARM
ARM_SPAN

0.0

-0.5

-1.0
-1.0

-0.5

0.0
0.5
FACTOR(1)

1.0

Before the first iteration, the communality of a variable is its multiple correlation
squared with the remaining variables. At each iteration, communalities are estimated
from the loadings matrix, A, by finding the trace of AA, where the number of
columns in A is the number of factors. Iterations continue until the largest change in
any communality is less than that specified with Convergence. Replacing the diagonal
of the correlation (or covariance) matrix with these final communality estimates and
computing the eigenvalues yields the latent roots in the next panel.

Example 4
Rotation
Lets compare the unrotated and orthogonally rotated loadings from the principal
components example with those from an oblique rotation. The input is:
FACTOR
USE youth
PRINT = LONG
MODEL height .. width
ESTIMATE / METHOD=PCA

N=305

SORT

MODEL height .. width


ESTIMATE / METHOD=PCA

N=305

SORT

ROTATE=VARIMAX

MODEL height .. width


ESTIMATE / METHOD=PCA

N=305

SORT

ROTATE=OBLIMIN

I-351
Factor Analysis

We focus on the output directly related to the rotations:


Component loadings
1
HEIGHT
ARM_SPAN
LOWERLEG
FOREARM
WEIGHT
BITRO
WIDTH
GIRTH

0.8594
0.8416
0.8396
0.8131
0.7580
0.6742
0.6706
0.6172

2
0.3723
0.4410
0.3953
0.4586
-0.5247
-0.5333
-0.4185
-0.5801

Variance Explained by Components


1
4.6729

2
1.7710

Percent of Total Variance Explained


1
58.4110

2
22.1373

Rotated Loading Matrix ( VARIMAX, Gamma =


1
ARM_SPAN
FOREARM
HEIGHT
LOWERLEG
WEIGHT
BITRO
GIRTH
WIDTH

0.9298
0.9191
0.8998
0.8992
0.2507
0.1806
0.1068
0.2509

1.0000)

2
0.1955
0.1638
0.2599
0.2295
0.8871
0.8404
0.8403
0.7496

"Variance" Explained by Rotated Components


1
3.4973

2
2.9465

Percent of Total Variance Explained


1
43.7165

2
36.8318

Rotated Pattern Matrix (OBLIMIN, Gamma =


1
ARM_SPAN
FOREARM
LOWERLEG
HEIGHT
WEIGHT
GIRTH
BITRO
WIDTH

0.9572
0.9533
0.9157
0.9090
0.0537
-0.0904
-0.0107
0.0876

2
-0.0166
-0.0482
0.0276
0.0604
0.8975
0.8821
0.8642
0.7487

0.0)

I-352
Chapter 12

"Variance" Explained by Rotated Components


1

3.5273

2.9166

Percent of Total Variance Explained


1

44.0913

36.4569

Direct and Indirect Contributions of Factors To Variance


1
1
2

3.5087
0.0186

2.8979

Rotated Structure Matrix


1
ARM_SPAN
FOREARM
LOWERLEG
HEIGHT
WEIGHT
GIRTH
BITRO
WIDTH

0.9350
0.9500
0.9277
0.9325
0.4407
0.3620
0.4104
0.2900

0.4523
0.3962
0.4225
0.3629
0.9206
0.8596
0.7865
0.8431
No rotation

1.0

FOREARM
ARM_SPAN
LOWERLEG
HEIGHT

FACTOR(2)

0.5

0.0

WIDTH
WEIGHT
BITRO
GIRTH

-0.5

-1.0
-1.0

-0.5

0.0
0.5
FACTOR(1)

1.0

Varimax

Oblimin

1.0

1.0
WEIGHT
GIRTH
BITRO
WIDTH

GIRTH
BITRO

WIDTH

0.5
HEIGHT
LOWERLEG
FOREARM
ARM_SPAN

0.0

-0.5

FACTOR(2)

FACTOR(2)

0.5

-1.0
-1.0

WEIGHT

HEIGHT
LOWERLEG
ARM_SPAN
FOREARM

0.0

-0.5

-0.5

0.0
0.5
FACTOR(1)

1.0

-1.0
-1.0

-0.5

0.0
0.5
FACTOR(1)

1.0

I-353
Factor Analysis

The values in Direct and Indirect Contributions of Factors to Variance are useful for
determining if part of a factors contribution to Variance Explained is due to its
correlation with another factor. Notice that
3.509 + 0.019 = 3.528 (or 3.527)

is the Variance Explained for factor 1, and


2.898 + 0.019 = 2.917

is the Variance Explained for factor 2 (differences in the last digit are due to a
rounding error).
Think of the values in the Rotated Structure Matrix as correlations of the variable
with the factors. Here we see that the first four variables are highly correlated with the
first factor. The remaining variables are highly correlated with the second factor.
The factor loading plots illustrate the effects of the rotation methods. While the
unrotated factor loadings form two distinct clusters, they both have strong positive
loadings for factor 1. The lanky variables have moderate positive loadings on factor
2 while the stocky variables have negative loadings on factor 2. With the varimax
rotation, the lanky variables load highly on factor 1 with small loadings on factor 2;
the stocky variables load highly on factor 2. The oblimin rotation does a much better
job of centering each cluster at 0 on its minor factor.

Example 5
Factor Analysis Using a Covariance Matrix
Jackson (1991) describes a project in which the maximum thrust of ballistic missiles
was measured. For a specific measure called total impulse, it is necessary to calculate
the area under a curve. Originally, a planimeter was used to obtain the area, and later
an electronic device performed the integration directly but unreliably in its early usage.
As data, two strain gauges were attached to each of 40 Nike rockets, and both types of
measurements were recorded in parallel (making four measurements per rocket). The
covariance matrix of the measures is stored in the MISSLES file.
In this example, we illustrate features associated with covariance matrix input
(asymptotic 95% confidence limits for the eigenvalues, estimates of the population
eigenvalues with standard errors, and latent vectors (eigenvectors or characteristic
vectors) with standard errors).

I-354
Chapter 12

The input is:


FACTOR
USE missles
MODEL integra1 planmtr1 integra2 planmtr2
PRINT = LONG
ESTIMATE / METHOD=PCA COVA N=40

The output is:


Latent Roots (Eigenvalues)
1
335.3355

2
48.0344

3
29.3305

4
16.4096

Empirical upper bound for the first Eigenvalue =


398.0000.
Asymptotic 95% Confidence Limits for the Eigenvalues, N = 40.
Upper Limits:
1
596.9599

2
85.5102

3
52.2138

4
29.2122

Lower Limits:
1
233.1534

2
33.3975

3
20.3930

4
11.4093

Unbiased Estimates of Population Eigenvalues


1
332.6990

2
46.9298

3
31.0859

4
18.3953

Unbiased Estimates of Standard Errors of Eigenvalues


1
74.9460

2
10.1768

3
5.7355

Chi-Square Test that all Eigenvalues are Equal, N = 40


CSQ =
110.6871
P = 0.0000
df =

4
3.2528

9.00

Latent Vectors (Eigenvectors)


1
INTEGRA1
PLANMTR1
INTEGRA2
PLANMTR2

0.4681
0.6079
0.4590
0.4479

2
0.6215
0.1788
-0.1387
-0.7500

3
0.5716
-0.7595
0.1677
0.2615

4
0.2606
0.1473
-0.8614
0.4104

Standard Error for Each Eigenvector Element


1
INTEGRA1
PLANMTR1
INTEGRA2
PLANMTR2

0.0532
0.0412
0.0342
0.0561

2
0.1879
0.2456
0.1359
0.1058

3
0.2106
0.0758
0.2366
0.2633

4
0.1773
0.2066
0.0519
0.1276

I-355
Factor Analysis

Component loadings
1
INTEGRA1
PLANMTR1
INTEGRA2
PLANMTR2

8.5727
11.1325
8.4051
8.2017

4.3072
1.2389
-0.9616
-5.1983

3.0954
-4.1131
0.9084
1.4165

1.0559
0.5965
-3.4893
1.6625

Variance Explained by Components


1

335.3355

48.0344

29.3305

16.4096

Percent of Total Variance Explained


1

78.1467

11.1940

6.8352

3.8241

Differences: Original Minus Fitted Correlations or Covariances


INTEGRA1
INTEGRA1
PLANMTR1
INTEGRA2
PLANMTR2

PLANMTR1

0.0000
0.0000
0.0000
0.0000

INTEGRA2

0.0000
0.0000
0.0000

0.0000
0.0000

FACTOR(1)

FACTOR(1)
FACTOR(2)
FACTOR(3)
FACTOR(4)

2
3
4
Number of Factors

PLANMTR1
INTEGRA2INTEGRA2
PLANMTR2

PLANMTR2

INTEGRA1
INTEGRA2

PLANMTR1

PLANMTR1

PLANMTR2
PLANMTR2
PLANMTR1INTEGRA1

INTEGRA1
PLANMTR1
PLANMTR1

INTEGRA2
INTEGRA2

FACTOR(1)

INTEGRA1

PLANMTR1

PLANMTR2

INTEGRA1
PLANMTR2
PLANMTR2
INTEGRA2
INTEGRA2

PLANMTR1
INTEGRA1
PLANMTR2

FACTOR(2)

INTEGRA1
PLANMTR2
PLANMTR1

FACTOR(4)

100

INTEGRA1

INTEGRA1
PLANMTR1
INTEGRA2

FACTOR(4)

INTEGRA1
INTEGRA2INTEGRA2
PLANMTR2

PLANMTR2
INTEGRA1
INTEGRA2

FACTOR(3)

FACTOR(3)

200

FACTOR(3)

PLANMTR1
PLANMTR1
INTEGRA1

FACTOR(2)

300

FACTOR(2)

INTEGRA2
PLANMTR2

FACTOR(1)

400

Eigenvalue

0.0000

Factor Loadings Plot

Scree Plot

0
0

PLANMTR2

FACTOR(4)

SYSTAT performs a test to determine if all eigenvalues are equal. The null hypothesis
is that all eigenvalues are equal against an alternative hypothesis that at least one root
is different. The results here indicate that you reject the null hypothesis (p < 0.00005).
At least one of the eigenvalues differs from the others.
The size and sign of the loadings reflect how the factors and variables are related.
The first factor has fairly similar loadings for all four variables. You can interpret this

I-356
Chapter 12

factor as an overall average of the area under the curve across the four measures. The
second factor represents gauge differences because the signs are different for each. The
third factor is primarily a comparison between the first planimeter and the first
integration device. The last factor has no simple interpretation.
When there are four or more factors, the Quick Graph of the loadings is a SPLOM.
The first component represents 78% of the variability of the product, so plots of
loadings for factors 2 through 4 convey little information (notice that values in the
stripe displays along the diagonal concentrate around 0, while those for factor 1 fall to
the right).

Example 6
Factor Analysis Using a Rectangular File
Begin this analysis from the OURWORLD cases-by-variables data file. Each case
contains information for one of 57 countries. We will study the interrelations among a
subset of 13 variables including economic measures (gross domestic product per capita
and U.S. dollars spent per person on education, health, and the military), birth and
death rates, population estimates for 1983, 1986, and 1990 plus predictions for 2020,
and the percentages of the population who can read and who live in cities.
We request principal components extraction with an oblique rotation. As a first step,
SYSTAT computes the correlation matrix. Correlations measure linear relations.
However, plots of the economic measures and population values as recorded indicate
a lack of linearity, so you use base 10 logarithms to transform six variables, and you
use square roots to transform two others. The input is:
FACTOR
USE ourworld
LET (gdp_cap, gnp_86, pop_1983, pop_1986, pop_1990, pop_2020),
= L10(@)
LET (mil,educ) = SQR(@)
MODEL urban birth_rt death_rt gdp_cap gnp_86 mil,
educ b_to_d literacy pop_1983 pop_1986,
pop_1990 pop_2020
PRINT=MEDIUM
SAVE pcascore / SCORES
ESTIMATE / METHOD=PCA SORT ROTATE=OBLIMIN

I-357
Factor Analysis

The output is:


Matrix to be factored
URBAN
URBAN
BIRTH_RT
DEATH_RT
GDP_CAP
GNP_86
MIL
EDUC
B_TO_D
LITERACY
POP_1983
POP_1986
POP_1990
POP_2020

1.0000
-0.8002
-0.5126
0.7636
0.7747
0.6453
0.6238
-0.3074
0.7997
0.2133
0.1898
0.1700
0.0054

MIL
MIL
EDUC
B_TO_D
LITERACY
POP_1983
POP_1986
POP_1990
POP_2020

1.0000
0.8869
-0.6184
0.6421
0.2206
0.1942
0.1727
-0.0339
POP_1986

POP_1986
POP_1990
POP_2020

1.0000
0.9992
0.9605

BIRTH_RT
1.0000
0.5110
-0.9189
-0.8786
-0.7547
-0.7528
0.5106
-0.9302
-0.0836
-0.0523
-0.0252
0.1880

EDUC
1.0000
-0.5252
0.6869
-0.0062
-0.0306
-0.0513
-0.2555
POP_1990

DEATH_RT

1.0000
-0.4012
-0.4518
-0.1482
-0.2151
-0.4340
-0.6601
0.0152
0.0291
0.0284
0.0743

B_TO_D

1.0000
-0.2737
-0.1526
-0.1358
-0.1070
0.0617

GDP_CAP

1.0000
0.9736
0.8657
0.8996
-0.5293
0.8337
0.0583
0.0248
-0.0015
-0.2116

LITERACY

1.0000
-0.0050
-0.0327
-0.0534
-0.2360

GNP_86

1.0000
0.8514
0.9207
-0.4411
0.8404
0.0090
-0.0215
-0.0447
-0.2484

POP_1983

1.0000
0.9984
0.9966
0.9531

POP_2020

1.0000
0.9673

1.0000

Latent Roots (Eigenvalues)


1
6.3950
6

4.0165

1.6557

0.0966

0.0812

0.0403

11

12

13

0.0054

0.0012

0.0002

Chi-Square Test that the Last 10 Eigenvalues Are Equal


CSQ =
636.4350
P = 0.0000
df =
Component loadings
1

2
-0.0366
-0.0846
0.0136

9
0.0251

7.4817.

Chi-Square Test that all Eigenvalues are Equal, N = 49


CSQ =
1542.2903
P = 0.0000
df =

0.9769
0.9703
-0.9512

0.4327

Empirical upper bound for the first Eigenvalue =

GDP_CAP
GNP_86
BIRTH_RT

3
-0.0606
0.0040
-0.0774

78.00
59.89

5
0.2390
10
0.0110

I-358
Chapter 12

LITERACY
EDUC
MIL
URBAN
B_TO_D
POP_1990
POP_1986
POP_1983
POP_2020
DEATH_RT

0.8972
0.8927
0.8770
0.8393
-0.5166
0.0382
0.0636
0.0945
-0.1796
-0.4533

-0.1008
-0.0857
0.1501
0.1425
-0.1225
0.9972
0.9966
0.9940
0.9748
0.0820

0.3004
-0.2296
-0.2909
0.2300
0.7762
0.0394
0.0253
0.0248
0.1002
-0.8662

Variance Explained by Components


1
6.3950

4.0165

1.6557

Percent of Total Variance Explained


1

49.1924
30.8964
Rotated Pattern Matrix (OBLIMIN, Gamma =
1
GDP_CAP
GNP_86
BIRTH_RT
EDUC
LITERACY
MIL
URBAN
B_TO_D
POP_1990
POP_1986
POP_1983
POP_2020
DEATH_RT

0.9779
0.9714
-0.9506
0.8961
0.8956
0.8777
0.8349
-0.5224
0.0236
0.0491
0.0801
-0.1945
-0.4459

3
12.7361
0.0)

2
-0.0399
-0.0816
0.0040
-0.1049
-0.0700
0.1242
0.1658
-0.0501
0.9977
0.9958
0.9932
0.9805
-0.0011

3
0.0523
-0.0146
0.0843
0.2194
-0.3112
0.2924
-0.2285
-0.7787
0.0095
0.0234
0.0235
-0.0510
0.8730

"Variance" Explained by Rotated Components


1
6.3946

4.0057

1.6669

Percent of Total Variance Explained


1
49.1895

2
30.8129

3
12.8225

Correlations among Oblique Factors or Components


1
1
2
3

1.0000
0.0127
-0.0020

2
1.0000
0.0452

1.0000

I-359
F actor Anal ysi s

Factor Loadings Plot

DEATH_RT

MIL
POP_1983
POP_1986
POP_1990

POP_2020

EDUC
GDP_CAP
GNP_86
URBAN

BIRTH_RT

LITERACY

B_TO_D

By default, SYSTAT extracts three factors because three eigenvalues are greater than
1.0. On factor 1, seven or eight variables have high loadings. The eighth, B_TO_D (a
ratio of birth-to-death rate) has a higher loading on factor 3. With the exception of
BIRTH_RT, the other variables are economic measures, so lets identify this as the
economic factor. Clearly, the second factor can be named population, and the
third, less clearly, death rates.
The economic and population factors account for 80% (49.19 + 30.81) of the total
variance, so a plot of the scores for these factors should be useful for characterizing
differences among the countries. The third factor accounts for 13% of the total
variance, a much smaller amount than the other two factors. Notice, too, that only 7%
of the total variance is not accounted for by these three factors.

Revisiting the Correlation Matrix


Lets examine the correlation matrix for these variables. In an effort to group the
variables contributing to each factor, we order the variables according to their factor
loadings for the factor on which they load the highest. The input is:
CORR
USE ourworld
LET (gdp_cap, gnp_86, pop_1983, pop_1986, pop_1990,
pop_2020) = L10(@)
LET (mil,educ) = SQR(@)
PEARSON gdp_cap gnp_86 birth_rt educ literacy mil urban ,
pop_1990 pop_1986 pop_1983 pop_2020 b_to_d death_rt

I-360
Chapter 12

The resulting matrix is:


Pearson correlation matrix
GDP_CAP
GNP_86
BIRTH_RT
EDUC
LITERACY
MIL
URBAN
GDP_CAP
1.000
GNP_86
0.974
1.000
BIRTH_RT
-0.919
-0.879
1.000
EDUC
0.900
0.921
-0.753
1.000
LITERACY
0.834
0.840
-0.930
0.687
1.000
MIL
0.866
0.851
-0.755
0.887
0.642
1.000
URBAN
0.764
0.775
-0.800
0.624
0.800
0.645
1.000
------------------------------------------------------------------------------POP_1990
-0.002
-0.045
-0.025
-0.051
-0.053
0.173
0.170
POP_1986
0.025
-0.021
-0.052
-0.031
-0.033
0.194
0.190
POP_1983
0.058
0.009
-0.084
-0.006
-0.005
0.221
0.213
POP_2020
-0.212
-0.248
0.188
-0.255
-0.236
-0.034
0.005
B_TO_D
-0.529
-0.441
0.511
-0.525
-0.274
-0.618
-0.307
DEATH_RT
-0.401
-0.452
0.511
-0.215
-0.660
-0.148
-0.513
POP_1990
POP_1986 POP_1983 POP_2020
POP_1990
1.000
POP_1986
0.999
1.000
POP_1983
0.997
0.998
1.000
POP_2020
0.967
0.960
0.953
1.000
-------------------------------------------------B_TO_D
-0.107
-0.136
-0.153
0.062
DEATH_RT
0.028
0.029
0.015
0.074

B_TO_D

1.000
-0.434

DEATH_RT

1.000

Use an editor to insert the dotted lines.


The top triangle of the matrix shows the correlations of the variables within the
economic factor. BIRTH_RT has strong negative correlations with the other
variables. Correlations of the population variables with the economic variables are
displayed in the four rows below this top portion, and correlations of the death rates
variables with the economic variables are in the next two rows. Correlations within the
population factor are displayed in the top triangle of the bottom panel. The correlation
between the variables in factor 3 (B_TO_D and DEATH_RT) is 0.434 and is smaller
than any of the other within-factor correlations.

Factor Scores
Look at the scores just stored in PCASCORE. First, merge the name of each country
and the grouping variable GROUP$ with the scores. The values of GROUP$ identify
each country as Europe, Islamic, or New World. Next, plot factor 2 against factor 1
(labeling points with country names) and factor 3 against factor 1 (labeling points with
the first letter of their group membership). Finally, use SPLOMs to display the scores,
adding 75% confidence ellipses for each subgroup in the plots and normal curves for
the univariate distributions. Repeat the latter using kernel density estimators.

I-361
Factor Analysis

The input is:


MERGE "C:\SYSTAT\PCASCORE.SYD"(FACTOR(1) FACTOR(2) FACTOR(3)),
"C:\SYSTAT\DATA\OURWORLD.SYD"(GROUP$ COUNTRY$)
PLOT FACTOR(2)*FACTOR(1) / XLABEL=Economic ,
YLABEL=Population SYMBOL=4,2,3,
SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.250
PLOT FACTOR(3)*FACTOR(1) / XLABEL=Economic ,
YLABEL=Death Rate COLOR=2,1,10,
SYMBOL=GROUP$ SIZE= 1.250 ,1.250 ,1.250
SPLOM FACTOR(1) FACTOR(2) FACTOR(3)/ GROUP=GROUP$ OVERLAY,
DENSITY=NORMAL ELL =0.750,
COLOR=2,1,10 SYMBOL=4,2,3,
DASH=1,1,4
SPLOM FACTOR(1) FACTOR(2) FACTOR(3)/ GROUP=GROUP$ OVERLAY,
DENSITY=KERNEL COLOR=2,1,10,
SYMBOL=4,2,3 DASH=1,1,4

The output is:

Population

1
0
-1

Bangladesh Brazil
Pakistan
WGermany
Italy
Turkey
UK France
Spain
Ethiopia Colombia
Poland
Algeria
Peru
Sudan
Canada
Argentina
Venezuela Netherlands
Malaysia
Chile
Ecuador
Greece
Senegal
Sweden
Hungary
Guatemala
DominicanR.
PortugalSwitzerland
Yemen
Mali
HaitiElSalvador
Austria
Denmark
Somalia
Honduras Bolivia
Finland
Uruguay
Norway
Ireland
Jamaica
CostaRica
Panama

1
Death Rate

I
I I
IN
II II I
N

I
I
N
I
NNNN N N
N N
N
N
N N

N
-1

Trinidad

-2

Gambia

Barbados

FACTOR(1)

FACTOR(2)

0
Economic

FACTOR(1)
FACTOR(2)
FACTOR(3)

FACTOR(3)

FACTOR(3)

NewWorld
Islamic
Europe

FACTOR(2)

FACTOR(2)

FACTOR(3)
FACTOR(3)

GROUP$

0
Economic

FACTOR(1)

FACTOR(2)

FACTOR(2)

-1

FACTOR(1)

FACTOR(3)

FACTOR(1)

FACTOR(1)

-3
-2

FACTOR(1)

-1

FACTOR(2)

-3
-2

FACTOR(3)

-2

E
E
E
EEEE
E
E
E
E
EE
EN
N
N E
N
E

FACTOR(1)

FACTOR(2)

FACTOR(3)

GROUP$
NewWorld
Islamic
Europe

I-362
Chapter 12

High loadings on the economic factor show countries that are strong economically
(Germany, Canada, Netherlands, Sweden, Switzerland, Denmark, and Norway)
relative to those with low loadings (Bangladesh, Ethiopia, Mali, and Gambia). Not
surprisingly, the population factor identifies Barbados as the smallest and Bangladesh,
Pakistan, and Brazil as largest. The questionable third factor (death rate) does help to
separate the New World countries from the others.
In each SPLOM, the dashed lines marking curves, ellipses, and kernel contours
identify New World countries. The kernel contours in the plot of factor 3 against factor
1 identify a pocket of Islamic countries within the New World group.

Computation
Algorithms
Provisional methods are used for computing covariance or correlation matrices (see
Correlations for references). Components are computed by using a Householder
tridiagonalization and implicit QL iterations. Rotations are computed with a variant of
Kaisers iterative algorithm, described in Mulaik (1972).

Missing Data
Ordinarily, Factor Analysis and other multivariate procedures delete all cases having
missing values on any variable selected for analysis. This is listwise deletion. For data
with many missing values, you may end up with too few complete cases for analysis.
Select Pairwise deletion if you want covariances or correlations computed separately
for each pair of variables selected for analysis. Pairwise deletion takes more time than
the standard listwise deletion because all possible pairs of variances and covariances
are computed. The same option is offered for Correlations, should you decide to create
a symmetric matrix for use in factor analysis that way. Also notice that Correlation
provides an EM algorithm for estimating correlation or covariance matrices when data
are missing.
Be careful. When you use pairwise deletion, you can end up with negative
eigenvalues for principal components or be unable to compute common factors at all.
With either method, it is desirable that the pattern of missing data be random.
Otherwise, the factor structure you compute will be influenced systematically by the
pattern of how values are missing.

I-363
Factor Analysis

References
Afifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Clarkson, D. B. and Jennrich, R. I. (1988). Quartic rotation criteria and algorithms,
Psychometrika, 53, 251259.
Dixon, W. J. et al. (1985). BMDP statistical software manual. Berkeley: University of
California Press.
Gnanadesikan, R. (1977). Methods for statistical data analysis of multivariate
observations. New York: John Wiley & Sons, Inc.
Harman, H. H. (1976). Modern factor analysis, 3rd ed. Chicago: University of Chicago
Press.
Jackson, J. E. (1991). A users guide to principal components. New York: John Wiley &
Sons, Inc.
Jennrich, R. I. and Robinson, S. M. (1969). A newton-raphson algorithm for maximum
likelihood factor analysis. Psychometrika, 34, 111123.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate analysis. London:
Academic Press.
Morrison, D. F. (1976). Multivariate statistical methods, 2nd ed. New York: McGraw-Hill.
Mulaik, S. A. (1972). The foundations of factor analysis. New York: McGraw-Hill.
Rozeboom, W. W. (1982). The determinacy of common factors in large item domains.
Psychometrika, 47, 281295.
Steiger, J. H. (1979). Factor indeterminacy in the 1930s and 1970s: some interesting
parallels. Psychometrika, 44, 157167.

Chapter

13
Linear Models

Each chapter in this manual normally has its own statistical background section. In
this part, however, Regression, ANOVA, and General Linear Models are grouped
together. There are two reasons for doing this. First, while some introductory
textbooks treat regression and analysis of variance as distinct, statisticians know that
they are based on the same underlying mathematical model. When you study what
these procedures do, therefore, it is helpful to understand that model and learn the
common terminology underlying each method. Second, although SYSTAT has three
commands (REGRESS, ANOVA, and GLM) and menu settings, it is a not-so-wellguarded secret that these all lead to the same program, originally called MGLH (for
Multivariate General Linear Hypothesis). Having them organized this way means that
SYSTAT can use tools designed for one approach (for example, dummy variables in
ANOVA) in another (such as computing within-group correlations in multivariate
regression). This synergy is not usually available in packages that treat these models
independently.

Simple Linear Models


Linear models are models based on lines. More generally, they are based on linear
surfaces, such as lines, planes, and hyperplanes. Linear models are widely applied
because lines and planes often appear to describe well the relations among variables
measured in the real world. We will begin by examining the equation for a straight
line, and then move to more complex linear models.

I-365

I-366
Chapter 13

Equation for a Line


A linear model looks like this:

y = a + bx
This is the equation for a straight line that you learned in school. The quantities in this
equation are:
y

a dependent variable

an independent variable

Variables are quantities that can vary (have different numerical values) in the same
equation. The remaining quantities are called parameters. A parameter is a quantity
that is constant in a particular equation, but that can be varied to produce other
equations in the same general family. The parameters are:
a

The value of y when x is 0. This is sometimes called a y-intercept (where a line intersects
the y axis in a graph when x is 0).

The slope of the line, or the number of units y changes when x changes by one unit.

Lets look at an example. Here are some data showing the yearly earnings a partner
should theoretically get in a certain large law firm, based on annual personal billings
over quota (both in thousands of dollars):
EARNINGS

60
70
80
90
100
120
140
150
175
190

BILLINGS

20
40
60
80
100
140
180
200
250
280

I-367
Linear Models

We can plot these data with EARNINGS on the vertical axis (dependent variable) and
BILLINGS on the horizontal (independent variable). Notice in the following figure that
all the points lie on a straight line.

What is the equation for this line? Look at the vertical axis value on the sloped line
where the independent variable has a value of 0. Its value is 50. A lawyer is paid
$50,000 even when billing nothing. Thus, a is 50 in our equation. What is b? Notice
that the line rises by $10,000 when billings change by $20,000. The line rises half as
fast as it runs. You can also look at the data and see that the earnings change by $1 as
billing changes by $2. Thus, b is 0.5, or a half, in our equation.
Why bother with all these calculations? We could use the table to determine a
lawyers compensation, but the formula and the line graph allow us to determine wages
not found in the table. For example, we now know that $30,000 in billings would yield
earnings of $65,000:

EARNINGS = 50000 + 0.5 30000 = 65000


When we do this, however, we must be sure that we can use the same equation on these
new values. We must be careful when interpolating, or estimating, wages for billings
between the ones we have been given. Does it make sense to compute earnings for
$25,000 in billings, for example? It probably does. Similarly, we must be careful when
extrapolating, or estimating from units outside the domain of values we have been
given. What about negative billings, for example? Would we want to pay an
embezzler? Be careful. Equations and graphs usually are meaningful only within or
close to the range of y values and domain of x values in the data.

I-368
Chapter 13

Regression
Data are seldom this clean unless we design them to be that way. Law firms typically
fine tune their partners earnings according to many factors. Here are the real billings
and earnings for our law firm (these lawyers predate Reagan, Bush, Clinton, and
Gates):
EARNINGS

86
67
95
105
86
82
140
145
144
184

BILLINGS

20
40
60
80
100
140
180
200
250
280

Our techniques for computing a linear equation wont work with these data. Look at
the following graph. There is no way to draw a straight line through all the data.

Given the irregularities in our data, the line drawn in the figure is a compromise. How
do we find a best fitting line? If we are interested in predicting earnings from the billing
data values rather well, a reasonable method would be to place a line through the points
so that the vertical deviations between the points and the line (errors in predicting

I-369
Linear Models

earnings) are as small as possible. In other words, these deviations (absolute


discrepancies, or residuals) should be small, on average, for a good-fitting line.
The procedure of fitting a line or curve to data such that residuals on the dependent
variable are minimized in some way is called regression. Because we are minimizing
vertical deviations, the regression line often appears to be more horizontal than we
might place it by eye, especially when the points are fairly scattered. It regresses
toward the mean value of y across all the values of x, namely, a horizontal line through
the middle of all the points. The regression line is not intended to pass through as many
points as possible. It is for predicting the dependent variable as accurately as possible,
given each value of the independent variable.

Least Squares
There are several ways to draw the line so that, on average, the deviations are small.
We could minimize the mean, the median, or some other measure of the typical
behavior of the absolute values of the residuals. Or we can minimize the sum (or mean)
of the squared residuals, which yields almost the same line in most cases. Using
squared instead of absolute residuals gives more influence to points whose y value is
farther from the average of all y values. This is not always desirable, but it makes the
mathematics simpler. This method is called ordinary least squares.
By specifying EARNINGS as the dependent variable and BILLINGS as the
independent variable in a MODEL statement, we can compute the ordinary leastsquares regression y-intercept as $62,800 and the slope as 0.375. These values do not
predict any single lawyers earnings exactly. They describe the whole firm well, in the
sense that, on the average, the line predicts a given earnings value fairly closely from
a given billings value.

Estimation and Inference


We often want to do more with such data than draw a line on a picture. In order to
generalize, formulate a policy, or test a hypothesis, we need to make an inference.
Making an inference implies that we think a model describes a more general
population from which our data have been randomly sampled. In the present example,
this population is all possible lawyers who might work for this firm. To make an
inference about compensation, we need to construct a linear model for our population
that includes a parameter for random error. In addition, we need to change our notation

I-370
Chapter 13

to avoid confusion later. We are going to use Greek to denote parameters and italic
Roman letters for variables. The error parameter is usually called .

y = + x +
Notice that is a random variable. It varies like any other variable (for example, x),
but it varies randomly, like the tossing of a coin. Since is random, our model forces
y to be random as well because adding fixed values ( and x ) to a random variable
produces another random variable. In ordinary language, we are saying with our model
that earnings are only partly predictable from billings. They vary slightly according to
many other factors, which we assume are random.
We do not know all of the factors governing the firms compensation decisions, but
we assume:
n All the salaries are derived from the same linear model.
n The error in predicting a particular salary from billings using the model is

independent of (not in any way predictable from) the error in predicting other
salaries.
n The errors in predicting all the salaries come from the same random distribution.

Our model for predicting in our population contains parameters, but unlike our perfect
straight line example, we cannot compute these parameters directly from the data. The
data we have are only a small sample from a much larger population, so we can only
estimate the parameter values using some statistical method on our sample data. Those
of you who have heard this story before may not be surprised that ordinary least
squares is one reasonable method for estimating parameters when our three
assumptions are appropriate. Without going into all the details, we can be reasonably
assured that if our population assumptions are true and if we randomly sample some
cases (that is, each case has an equal chance of being picked) from the population, the
least-squares estimates of and will, on average, be close to their values in the
population.
So far, we have done what seems like a sleight of hand. We delved into some
abstruse language and came up with the same least-squares values for the slope and
intercept as before. There is something new, however. We have now added conditions
that define our least-squares values as sample estimates of population values. We now
regard our sample data as one instance of many possible samples. Our compensation
model is like Platos cave metaphor; we think it typifies how this law firm makes
compensation decisions about any lawyer, not just the ones we sampled. Before, we
were computing descriptive statistics about a sample. Now, we are computing
inferential statistics about a population.

I-371
Linear Models

Standard Errors
There are several statistics relevant to the estimation of and . Perhaps most
important is a measure of how variable we could expect our estimates to be if we
continued to sample data from our population and used least squares to get our
estimates. A statistic calculated by SYSTAT shows what we could expect this variation
to be. It is called, appropriately, the standard error of estimate, or Std Error in the
output. The standard error of the y-intercept, or regression constant, is in the first row
of the coefficients: 10.440. The standard error of the billing coefficient or slope is
0.065. Look for these numbers in the following output:
Dep Var: EARNINGS

N: 10

Multiple R: 0.897

Adjusted squared multiple R: 0.779


Effect
CONSTANT
BILLINGS

Squared multiple R: 0.804

Standard error of estimate: 17.626

Coefficient

Std Error

Std Coef

Tolerance

62.838
0.375

10.440
0.065

0.0
0.897

.
1.000

t
6.019
5.728

P(2 Tail)
0.000
0.000

Analysis of Variance
Source

Sum-of-Squares

Regression
Residual

10191.109
2485.291

df
1
8

Mean-Square

F-ratio

10191.109
310.661

32.805

P
0.000

Hypothesis Testing
From these standard errors, we can construct hypothesis tests on these coefficients.
Suppose a skeptic approached us and said, Your estimates look as if something is
going on here, but in this firm, salaries have nothing to do with billings. You just
happened to pick a sample that gives the impression that billings matter. It was the luck
of the draw that provided you with such a misleading picture. In reality, is 0 in the
population because billings play no role in determining earnings.
We can reply, If salaries had nothing to do with billings but are really just a mean
value plus random error for any billing level, then would it be likely for us to find a
coefficient estimate for at least this different from 0 in a sample of 10 lawyers?
To represent these alternatives as a bet between us and the skeptic, we must agree
on some critical level for deciding who will win the bet. If the likelihood of a sample
result at least this extreme occurring by chance is less than or equal to this critical level
(say, five times out of a hundred), we win; otherwise, the skeptic wins.
This logic might seem odd at first because, in almost every case, our skeptics null
hypothesis would appear ridiculous, and our alternative hypothesis (that the skeptic is
wrong) seems plausible. Two scenarios are relevant here, however. The first is the

I-372
Chapter 13

lawyers. We are trying to make a case here. The only way we will prevail is if we
convince our skeptical jury beyond a reasonable doubt. In statistical practice, that
reasonable doubt level is relatively liberal: fewer than five times in a hundred. The
second scenario is the scientists. We are going to stake our reputation on our model.
If someone sampled new data and failed to find nonzero coefficients, much less
coefficients similar to ours, few would pay attention to us in the future.
To compute probabilities, we must count all possibilities or refer to a mathematical
probability distribution that approximates these possibilities well. The most widely
used approximation is the normal curve, which we reviewed briefly in Chapter 1. For
large samples, the regression coefficients will tend to be normally distributed under the
assumptions we made above. To allow for smaller samples, however, we will add the
following condition to our list of assumptions:
n The errors in predicting the salaries come from a normal distribution.

If we estimate the standard errors of the regression coefficients from the data instead
of knowing them in advance, then we should use the t distribution instead of the
normal. The two-tail value for the probability represents the area under the theoretical
t probability curve corresponding to coefficient estimates whose absolute values are
more extreme than the ones we obtained. For both parameters in the model of lawyers
earnings, these values (given as P(2 tail)) are less than 0.001, leading us to reject our
null hypothesis at well below the 0.05 level.
At the bottom of our output, we get an analysis of variance table that tests the
goodness of fit of our entire model. The null hypothesis corresponding to the F ratio
(32.805) and its associated p value is that the billing variable coefficient is equal to 0.
This test overwhelmingly rejects the null hypothesis that both and are 0.

Multiple Correlation
In the same output is a statistic called the squared multiple correlation. This is the
proportion of the total variation in the dependent variable (EARNINGS) accounted for by
the linear prediction using BILLINGS. The value here (0.804) tells us that approximately
80% of the variation in earnings can be accounted for by a linear prediction from billings.
The rest of the variation, as far as this model is concerned, is random error. The square
root of this statistic is called, not surprisingly, the multiple correlation. The adjusted
squared multiple correlation (0.779) is what we would expect the squared multiple
correlation to be if we used the model we just estimated on a new sample of 10 lawyers
in the firm. It is smaller than the squared multiple correlation because the coefficients
were optimized for this sample rather than for the new one.

I-373
Linear Models

Regression Diagnostics
We do not need to understand the mathematics of how a line is fitted in order to use
regression. You can fit a line to any x-y data by the method of least squares. The
computer doesnt care where the numbers come from. To have a model and estimates
that mean something, however, you should be sure the assumptions are reasonable and
that the sample data appear to be sampled from a population that meets the
assumptions.
The sample analogues of the errors in the population model are the residualsthe
differences between the observed and predicted values of the dependent variable.
There are many diagnostics you can perform on the residuals. Here are the most
important ones:

Expected Value for Normal Distribution

The errors are normally distributed. Draw a normal probability plot (PPLOT) of the
residuals.
2

-1

-2
-40

-30

-20 -10
0
RESIDUAL

10

20

The residuals should fall approximately on a diagonal straight line in this plot. When
the sample size is small, as in our law example, the line may be quite jagged. It is
difficult to tell by any method whether a small sample is from a normal population. You
can also plot a histogram or stem-and-leaf diagram of the residuals to see if they are
lumpy in the middle with thin, symmetric tails.
The errors have constant variance. Plot the residuals against the estimated values. The
following plot shows studentized residuals (STUDENT) against estimated values
(ESTIMATE). Studentized residuals are the true external kind discussed in Velleman

I-374
Chapter 13

and Welsch (1981). Use these statistics to identify outliers in the dependent variable
space. Under normal regression assumptions, they have a t distribution with
( N p 1 ) degrees of freedom, where N is the total sample size and p is the number
of predictors (including the constant). Large values (greater than 2 or 3 in absolute
magnitude) indicate possible problems.
2

STUDENT

1
0
-1
-2
-3
50

100
150
ESTIMATE

200

Our residuals should be arranged in a horizontal band within two or three units around
0 in this plot. Again, since there are so few observations, it is difficult to tell whether
they violate this assumption in this case. There is only one particularly large residual,
and it is toward the middle of the values. This lawyer billed $140,000 and is earning
only $80,000. He or she might have a gripe about supporting a higher share of the
firms overhead.
The errors are independent. Several plots can be done. Look at the plot of residuals
against estimated values above. Make sure that the residuals are randomly scattered
above and below the 0 horizontal and that they do not track in a snaky way across the
plot. If they look as if they were shot at the plot by a horizontally moving machine gun,
then they are probably not independent of each other. You may also want to plot
residuals against other variables, such as time, orientation, or other ways that might
influence the variability of your dependent measure. ACF PLOT in SERIES measures
whether the residuals are serially correlated. Here is an autocorrelation plot:

I-375
Linear Models

All the bars should be within the confidence bands if each residual is not predictable
from the one preceding it, and the one preceding that, and the one preceding that, and
so on.
All the members of the population are described by the same linear model. Plot Cooks
distance (COOK) against the estimated values.
0.5

COOK

0.4
0.3
0.2
0.1
0.0
50

100
150
ESTIMATE

200

Cooks distance measures the influence of each sample observation on the coefficient
estimates. Observations that are far from the average of all the independent variable
values or that have large residuals tend to have a large Cooks distance value (say,
greater than 2). Cooks D actually follows closely an F distribution, so aberrant values
depend on the sample size. As a rule of thumb, under the normal regression
assumptions, COOK can be compared to an F distribution with p and N p degrees of
freedom. We dont want to find a large Cooks D value for an observation because it
would mean that the coefficient estimates would change substantially if we deleted that

I-376
Chapter 13

observation. While none of the COOK values are extremely large in our example, could
it be that the largest one in the upper right corner is the founding partner in the firm?
Despite large billings, this partner is earning more than the model predicts.
Another diagnostic statistic useful for assessing the model fit is leverage, discussed
in Belsley, Kuh, and Welsch (1980) and Velleman and Welsch (1981). Leverage helps
to identify outliers in the independent variable space. Leverage has an average value
of p N , where p is the number of estimated parameters (including the constant) and
N is the number of cases. What is a high value of leverage? In practice, it is useful to
examine the values in a stem-and-leaf plot and identify those that stand apart from the
rest of the sample. However, various rules of thumb have been suggested. For example,
values of leverage less than 0.2 appear to be safe; between 0.2 and 0.5, risky; and above
0.5, to be avoided. Another says that if p > 6 and (N p) > 12, use ( 3p ) N as a
cutoff. SYSTAT uses an F approximation to determine this value for warnings
(Belsley, Kuh, and Welsch, 1980).
In conclusion, keep in mind that all our diagnostic tests are themselves a form of
inference. We can assess theoretical errors only through the dark mirror of our
observed residuals. Despite this caveat, testing assumptions graphically is critically
important. You should never publish regression results until you have examined these
plots.

Multiple Regression
A multiple linear model has more than one independent variable; that is:

y = a + bx + cz
This is the equation for a plane in three-dimensional space. The parameter a is still an
intercept term. It is the value of y when x and z are 0. The parameters b and c are still
slopes. One gives the slope of the plane along the x dimension; the other, along the
z dimension.
The statistical model has the same form:

y = + x + z +

I-377
Linear Models

Before we run out of letters for independent variables, lets switch to a more frequently
used notation:

y = 0 + 1 x + 2 x2 +
Notice that we are still using Greek letters for unobservables and Roman letters for
observables.
Now, lets look at our law firm data again. We have learned that there is another
variable that appears to determine earningsthe number of hours billed per year by
each lawyer. Here is an expanded listing of the data:
EARNINGS

86
67
95
105
86
82
140
145
144
184

BILLINGS

20
40
60
80
100
140
180
200
250
280

HOURS

1771
1556
1749
1754
1594
1400
1780
1737
1645
1863

For our model, 1 is the coefficient for BILLINGS, and 2 is the coefficient for
HOURS. Lets look first at its graphical representation. The following figure shows the
plane fit by least squares to the points representing each lawyer. Notice how the plane
slopes upward on both variables. BILLINGS and HOURS both contribute positively to
EARNINGS in our sample.

I-378
Chap te r 13

Fitting this model involves no more work than fitting the simple regression model. We
specify one dependent and two independent variables and estimate the model as
before. Here is the result:
Dep Var:EARNINGS

N: 10

MULTIPLE R: .998

Adjusted squared Multiple R: .995

Squared Multiple R: .996

Standard Error of Estimate: 2.678

Variable

Coefficient

Std Error

Std Coef

Tolerance

CONSTANT
BILLINGS
HOURS

139.925
0.333
0.124

11.116
0.010
0.007

0.000
0.797
0.449

.
.9510698
.9510698

T
-12.588
32.690
18.429

P(2 tail)
0.000
0.000
0.000

Analysis of Variance
Source
Regression
Residual

Sum-ofSquares
12626.210
50.190

DF
2
7

Mean Square

F-ratio

6313.105
7.170

880.493

0.000

This time, we have one more row in our regression table for HOURS. Notice that its
coefficient (0.124) is smaller than that for BILLINGS (0.333). This is due partly to the
different scales of the variables. HOURS are measured in larger numbers than
BILLINGS. If we wish to compare the influence of each independent of scales, we
should look at the standardized coefficients. Here, we still see that BILLINGS (0.797)
play a greater role in predicting EARNINGS than do HOURS (0.449). Notice also that
both coefficients are highly significant and that our overall model is highly significant,
as shown in the analysis of variance table.

I-379
Linear Models

Variable Selection
In applications, you may not know which subset of predictor variables in a larger set
constitute a good model. Strategies for identifying a good subset are many and
varied: forward selection, backward elimination, stepwise (either a forward or
backward type), and all subsets. Forward selection begins with the best predictor,
adds the next best and continues entering variables to improve the fit. Backward
selection begins with all candidate predictors in an equation and removes the least
useful one at a time as long as the fit is not substantially worsened. Stepwise begins
as either forward or backward, but allows poor predictors to be removed from the
candidate model or good predictors to re-enter the model at any step. Finally, all
subsets methods compute all possible subsets of predictors for each model of a given
size (number of predictors) and choose the best one.
Bias and variance tradeoff. Submodel selection is a tradeoff between bias and variance.
By decreasing the number of parameters in the model, its predictive capability is
enhanced. This is because the variance of the parameter estimates decreases. On the
other side, bias may increase because the true model may have a higher dimension.
So wed like to balance smaller variance against increased bias. There are two aspects
to variable selection: selecting the dimensionality of the submodel (how many
variables to include) and evaluating the model selected. After you determine the
dimension, there may be several alternative subsets that perform equally well. Then,
knowledge of the subject matter, how accurately individual variables are measured,
and what a variable communicates may guide selection of the model to report.
A strategy. If you are in an exploratory phase of research, you might try this version of
backwards stepping. First, fit a model using all candidate predictors. Then identify the
least useful variable, remove it from the model list, and fit a smaller model. Evaluate
your results and select another variable to remove. Continue removing variables. For a
given size model, you may want to remove alternative variables (that is, first remove
variable A, evaluate results, replace A and remove B, etc.).
Entry and removal criteria. Decisions about which variable to enter or remove should be
based on statistics and diagnostics in the output, especially graphical displays of these
values, and your knowledge of the problem at hand.
You can specify your own alpha-to-enter and alpha-to-remove values (do not make
alpha-to-remove less than alpha-to-enter), or you can cycle variables in and out of the
equation (stepping automatically stops if this happens). The default values for these
options are Enter = 0.15 and Remove = 0.15. These values are appropriate for predictor

I-380
Chapter 13

variables that are relatively independent. If your predictor variables are highly
correlated, you should consider lowering the Enter and Remove values well below
0.05.
When there are high correlations among the independent variables, the estimates of
the regression coefficients can become unstable. Tolerance is a measure of this
2
condition. It is ( 1 R ) ; that is, one minus the squared multiple correlation between a
predictor and the other predictors included in the model. (Note that the dependent
variable is not used.) By setting a minimum tolerance value, variables highly correlated
with others already in the model are not allowed to enter.
As a rough guideline, consider models that include only variables that have absolute
t values well above 2.0 and tolerance values greater than 0.1. (We use quotation
marks here because t and other statistics do not have their usual distributions when you
are selecting subset models.)
Evaluation criteria. There is no one test to identify the dimensionality of the best
submodel. Recent research by Leo Breiman emphasizes the usefulness of crossvalidation techniques involving 80% random subsamples. Sample 80% of your file, fit
a model, use the resulting coefficients on the remaining 20% to obtain predicted values,
2
and then compute R for this smaller sample. In over-fitting situations, the discrepancy
2
between the R for the 80% sample and the 20% sample can be dramatic.
A warning. If you do not have extensive knowledge of your variables and expect this
strategy to help you to find a true model, you can get into a lot of trouble. Automatic
stepwise regression programs cannot do your work for you. You must be able to
examine graphics and make intelligent choices based on theory and prior knowledge;
otherwise, you will be arriving at nonsense.
Moreover, if you are thinking of testing hypotheses after automatically fitting a
subset model, dont bother. Stepwise regression programs are the most notorious
source of pseudo p values in the field of automated data analysis. Statisticians seem
to be the only ones who know these are not real p values. The automatic stepwise
option is provided to select a subset model for prediction purposes. It should never be
used without cross-validation.
If you still want some sort of confidence estimate on your subset model, you might
look at tables in Wilkinson (1979), Rencher and Pun (1980), and Wilkinson and Dallal
2
(1982). These tables provide null hypothesis R values for selected subsets given the
number of candidate predictors and final subset size. If you dont know this literature
already, you will be surprised at how large multiple correlations from stepwise
regressions on random data can be. For a general summary of these and other
problems, see Hocking (1983). For more specific discussions of variable selection

I-381
Linear Models

problems, see the previous references and Flack and Chang (1987), Freedman (1983),
and Lovell (1983). Stepwise regression is probably the most abused computerized
statistical technique ever devised. If you think you need automated stepwise regression
to solve a particular problem, it is almost certain that you do not. Professional
statisticians rarely use automated stepwise regression because it does not necessarily
find the best fitting model, the real model, or alternative plausible models.
Furthermore, the order in which variables enter or leave a stepwise program is usually
of no theoretical significance. You are always better off thinking about why a model
could generate your data and then testing that model.

Using an SSCP, a Covariance, or a


Correlation Matrix as Input
Normally for a regression analysis, you use a cases-by-variables data file. You can,
however, use a covariance or correlation matrix saved (from Correlations) as input. If
you use a matrix as input, specify the sample size that generated the matrix where the
number you type is an integer greater than 2.
You can enter an SSCP, a covariance, or a correlation matrix by typing it into the
Data Editor Worksheet, by using BASIC, or by saving it in a SYSTAT file. Be sure to
include the dependent as well as independent variables.
SYSTAT needs the sample size to calculate degrees of freedom, so you need to
enter the original sample size. Linear Regression determines the type of matrix (SSCP,
covariance, etc.) and adjusts appropriately. With a correlation matrix, the raw and
standardized coefficients are the same. Therefore, the Include constant option is
disabled when using SSCP, covariance, or correlation matrices. Because these
matrices are centered, the constant term has already been removed.
The following two analyses of the same data file produce identical results (except
that you dont get residuals with the second). In the first, we use the usual cases-byvariables data file. In the second, we use the CORR command to save a covariance
matrix and then analyze that matrix file with the REGRESS command.
Here are the usual instructions for a regression analysis:
REGRESS
USE filename
MODEL Y = CONSTANT + X(1) + X(2) + X(3)
ESTIMATE

I-382
Chapter 13

Here, we compute a covariance matrix and use it in the regression analysis:


CORR
USE filename1
SAVE filename2
COVARIANCE X(1) X(2) X(3) Y
REGRESS
USE filename2
MODEL Y = X(1) + X(2) + X(3) / N=40
ESTIMATE

The triangular matrix input facility is useful for meta-analysis of published data and
missing-value computations. There are a few warnings, however. First, if you input
correlation matrices from textbooks or articles, you may not get the same regression
coefficients as those printed in the source. Because of round-off error, printed and raw
data can lead to different results. Second, if you use pairwise deletion with CORR, the
degrees of freedom for hypotheses will not be appropriate. You may not even be able
to estimate the regression coefficients because of singularities.
In general, when an incomplete data procedure is used to estimate the correlation
matrix, the estimate of regression coefficients and hypothesis tests produced from it are
optimistic. You can correct for this by specifying a sample size smaller than the
number of actual observations (preferably, set it equal to the smallest number of cases
used for any pair of variables), but this is a crude guess that you could refine only by
doing Monte Carlo simulations. There is no simple solution. Beware, especially, of
multivariate regressions (or MANOVA, etc.) with missing data on the dependent
variables. You can usually compute coefficients, but results from hypothesis tests are
particularly suspect.

Analysis of Variance
Often, you will want to examine the influence of categorical variables (such as gender,
species, country, and experimental group) on continuous variables. The model
equations for this case, called analysis of variance, are equivalent to those used in
linear regression. However, in the latter, you have to figure out a numerical coding for
categories so that you can use the codes in an equation as the independent variable(s).

I-383
Linear Models

Effects Coding
The following data file, EARNBILL, shows the breakdown of lawyers sampled by sex.
Because SEX is a categorical variable (numerical values assigned to MALE or
FEMALE are arbitrary), a code variable with the values 1 or 1 is used. It doesnt
matter which group is assigned 1, as long as the other is assigned 1.
EARNINGS

86
67
95
105
86
82
140
145
144
184

SEX

female
female
female
female
female
male
male
male
male
male

CODE

1
1
1
1
1
1
1
1
1
1

There is nothing wrong with plotting earnings against the code variable, as long as you
realize that the slope of the line is arbitrary because it depends on how you assign your
codes. By changing the values of the code variable, you can change the slope. Here is
a plot with the least-squares regression line superimposed.

I-384
Chapter 13

Lets do a regression on the data using these codes. Here are the coefficients as
computed by ANOVA:
Variable
Constant
Code

Coefficients
113.400
25.600

Notice that Constant (113.4) is the mean of all the data. It is also the regression
intercept because the codes are symmetrical about 0. The coefficient for Code (25.6)
is the slope of the line. It is also one half the difference between the means of the
groups. This is because the codes are exactly two units apart. This slope is often called
an effect in the analysis of variance because it represents the amount that the
categorical variable SEX affects BILLINGS. In other words, the effect of SEX can be
represented by the amount that the mean for males differs from the overall mean.

Means Coding
The effects coding model is useful because the parameters (constant and slope) can be
interpreted as an overall level and as the effect(s) of treatment, respectively. Another
model, however, that yields the means of the groups directly is called the means model.
Here are the codes for this model:.
EARNINGS

86
67
95
105
86
82
140
145
144
184

SEX

female
female
female
female
female
male
male
male
male
male

CODE1

CODE2

1
1
1
1
1
0
0
0
0
0

0
0
0
0
0
1
1
1
1
1

Notice that CODE1 is nonzero for all females, and CODE2 is nonzero for all males. To
estimate a regression model with these codes, you must leave out the constant. With

I-385
Linear Models

only two groups, only two distinct pieces of information are needed to distinguish
them. Here are the coefficients for these codes in a model without a constant:
Variable

Coefficient

Code1
Code2

87.800
139.000

Notice that the coefficients are now the means of the groups.

Models
Lets look at the algebraic models for each of these codings. Recall that the regression
model looks like this:

y = 0 + 1 x1 +
For the effects model, it is convenient to modify this notation as follows:

yj = + j +
When x (the code variable) is 1, j is equivalent to 1; when x is 1, j is equivalent to
2. This shorthand will help you later when dealing with models with many categories.
For this model, the parameter stands for the grand (overall) mean, and the parameter
stands for the effect. In this model, our best prediction of the score of a group member
is derived from the grand mean plus or minus the deviation of that group from this
grand mean.
The means model looks like this:

yj = j +
In this model, our best prediction of the score of a group member is the mean of that
group.

I-386
Chapter 13

Hypotheses
As with regression, we are usually interested in testing hypotheses concerning the
parameters of the model. Here are the hypotheses for the two models:
H 0: 1 = 2 = 0
H0: 1 = 2

(effects model)
(means model)

The tests of this hypothesis compare variation between the means to variation within
each group, which is mathematically equivalent to testing the significance of
coefficients in the regression model. In our example, the F ratio in the analysis of
variance table tells you that the coefficient for SEX is significant at p = 0.019, which is
less than the conventional 0.05 value. Thus, on the basis of this sample and the validity
of our usual regression assumptions, you can conclude that women earn significantly
less than men in this firm.
Dep Var:earnings

N:

10

Multiple R:

.719

Squared Multiple R:

.517

Analysis Of Variance
Source

Df

Mean-square

F-ratio

Sex

Sum-of-squares
6553.600

6553.600

8.563

Error

6122.800

765.350

P
0.019

The nice thing about realizing that ANOVA is specially-coded regression is that the
usual assumptions and diagnostics are appropriate in this context. You can plot
residuals against estimated values, for example, to check for homogeneity of variance.

Multigroup ANOVA
When there are more groups, the coding of categories becomes more complex. For the
effects model, there are one fewer coding variables than number of categories. For two
categories, you need only one coding variable; for three categories, you need two
coding variables:
Category

1
2
3

Code

1
0
1

0
1
1

I-387
Linear Models

For the means model, the extension is straightforward:


Category

1
2
3

Code

1
0
0

0
1
0

0
0
1

For multigroup ANOVA, the models have the same form as for the two-group ANOVA
above. The corresponding hypotheses for testing whether there are differences between
means are:
H0: 1 = 2 = 3 =0
H0: 1 = 2 = 3

(effects model)
(means model)

You do not need to know how to produce coding variables to do ANOVA. SYSTAT
does this for you automatically. All you need is a single variable that contains different
values for each group. SYSTAT translates these values into different codes. It is
important to remember, however, that regression and analysis of variance are not
fundamentally different models. They are both instances of the general linear model.

Factorial ANOVA
It is possible to have more than one categorical variable in ANOVA. When this
happens, you code each categorical variable exactly the same way as you do with
multi-group ANOVA. The coded design variables are then added as a full set of
predictors in the model.
ANOVA factors can interact. For example, a treatment may enhance bar pressing
by male rats, yet suppress bar pressing by female rats. To test for this possibility, you
can add (to your model) variables that are the product of the main effect variables
already coded. This is similar to what you do when you construct polynomial models.
For example, this is a model without an interaction:
y = CONSTANT + treat + sex

This is a model that contains interaction:


y = CONSTANT + treat + sex + treat*sex

I-388
Chapter 13

If the hypothesis test of the coefficients for the TREAT*SEX term is significant, then
you must qualify your conclusions by referring to the interaction. You might say, It
works one way for males and another for females.

Data Screening and Assumptions


Most analyses have assumptions. If your data do not meet the necessary assumptions,
then the resulting probabilities for the statistics may be suspect. Before an ANOVA,
look for:
n Violations of the equal variance assumption. Your groups should have the same

dispersion or spread (their shapes do not differ markedly).


n Symmetry. The mean of each group should fall roughly in the middle of the spread

(the within-group distributions are not extremely skewed).


n Independence of the group means and standard deviations (the size of the group

means is not related to the size of their standard deviations).


n Gross outliers (no values stand apart from the others in the batch).

Graphical displays are useful for checking assumptions. For analysis of variance, try
dit plots, box-and-whisker displays, or bar charts with standard error bars.

Levene Test
Analysis of variance assumes that the data within cells are independent and normally
distributed with equal variances. This is the ANOVA equivalent of the regression
assumptions for residuals. When the homogeneous variance part of the assumptions is
false, it is sometimes possible to adjust the degrees of freedom to produce
approximately distributed F statistics.
Levene (1960) proposed a test for unequal variances. You can use this test to
determine whether you need an unequal variance F test. Simply fit your model in
ANOVA and save residuals. Then transform the residuals into their absolute values.
Merge these with your original grouping variable(s). Then redo your ANOVA on the
absolute residuals. If it is significant, then you should consider using the separate
variances test.
Before doing all this work, you should do a box plot by groups to see whether the
distributions differ. If you see few differences in the spread of the boxes, Levenes test
is unlikely to be significant.

I-389
Linear Models

Pairwise Mean Comparisons


The results in an ANOVA table serve only to indicate whether means differ
significantly or not. They do not indicate which mean differs from another.
To report which pairs of means differ significantly, you might think of computing a
two-sample t test for each pair; however, do not do this. The probability associated
with the two-sample t test assumes that only one test is performed. When several means
are tested pairwise, the probability of finding one significant difference by chance
alone increases rapidly with the number of pairs. If you use a 0.05 significance level to
test that means A and B are equal and to test that means C and D are equal, the overall
acceptance region is now 0.95 x 0.95, or 0.9025. Thus, the acceptance region for two
independent comparisons carried out simultaneously is about 90%, and the critical
region is 10% (instead of the desired 5%). For six pairs of means tested at the 0.05
significance level, the probability of a difference falling in the critical region is not 0.05
but
1 (0.95)6 = 0.265

For 10 pairs, this probability increases to 0.40. The result of following such a strategy
is to declare differences as significant when they are not.
As an alternative to the situation described above, SYSTAT provides four
techniques to perform pairwise mean comparisons: Bonferroni, Scheffe, Tukey, and
Fishers LSD. The first three methods provide protection for multiple tests. To
determine significant differences, simply look for pairs with probabilities below
your critical value (for example, 0.05 or 0.01).
There is an abundance of literature covering multiple comparisons (see Miller, 1985);
however, a few points are worth noting here:
n If you have a small number of groups, the Bonferroni pairwise procedure will often

be more powerful (sensitive). For more groups, consider the Tukey method. Try all
the methods in ANOVA (except Fishers LSD) and pick the best one.
n All possible pairwise comparisons are a waste of power. Think about a meaningful

subset of comparisons and test this subset with Bonferroni levels. To do this, divide
your critical level, say 0.05, by the number of comparisons you are making. You
will almost always have more power than with any other pairwise multiple
comparison procedures.

I-390
Chapter 13

n Some popular multiple comparison procedures are not found in SYSTAT.

Duncans test, for example, does not maintain its claimed protection level. Other
stepwise multiple range tests, such as Newman-Keuls, have not been conclusively
demonstrated to maintain overall protection levels for all possible distributions of
means.

Linear and Quadratic Contrasts


Contrasts are used to test relationships among means. A contrast is a linear
combination of means i with coefficients i:
11 + 22 + + kk= 0

where 1+ 2 + + k = 0. In SYSTAT, hypotheses can be specified about contrasts


and tests performed. Typically, the hypothesis has the form:
H0: 11 + 22 + + kk= 0

The test statistic for a contrast is similar to that for a two-sample t test; the result of the
contrast (a relation among means, such as mean A minus mean B) is in the numerator
of the test statistic, and an estimate of within-group variability (the pooled variance
estimate or the error term from the ANOVA) is part of the denominator.
You can select contrast coefficients to test:
n Pairwise comparisons (test for a difference between two particular means)
n A linear combination of means that are meaningful to the study at hand (compare

two treatments versus a control mean)


n Linear, quadratic, or the like increases (decreases) across a set of ordered means

(that is, you might test a linear increase in sales by comparing people with no
training, those with moderate training, and those with extensive training)
Many experimental design texts place coefficients for linear and quadratic contrasts for
three groups, four groups, and so on, in a table. SYSTAT allows you to type your
contrasts or select a polynomial option. A polynomial contrast of order 1 is linear; of
order 2, quadratic; of order 3, cubic; and so on.

I-391
Linear Models

Unbalanced Designs
An unbalanced factorial design occurs when the numbers of cases in cells are unequal
and not proportional across rows or columns. The following is an example of a
2 2 design:

A1

A2

B1

B2

1
2

5
3
4

6
7
9
8
4

2
1
5
3

Unbalanced designs require a least-squares procedure like the General Linear Model
because the usual maximum likelihood method of adding up sums of squared
deviations from cell means and the grand mean does not yield maximum likelihood
estimates of effects. The General Linear Model adjusts for unbalanced designs when
you get an ANOVA table to test hypotheses.
However, the estimates of effects in the unbalanced design are no longer orthogonal
(and thus statistically independent) across factors and their interactions. This means
that the sum of squares associated with one factor depends on the sum of squares for
another or its interaction.
Analysts accustomed to using multiple regression have no problem with this
situation because they assume that their independent variables in a model are
correlated. Experimentalists, however, often have difficulty speaking of a main effect
conditioned on another. Consequently, there is extensive literature on hypothesis
testing methodology for unbalanced designs (for example, Speed and Hocking, 1976,
and Speed, Hocking, and Hackney, 1978), and there is no consensus on how to test
hypotheses with non-orthogonal designs.
Some statisticians advise you to do a series of hierarchical tests beginning with
interactions. If the highest-order interactions are insignificant, drop them from the
model and recompute the analysis. Then, examine the lower-order interactions. If they
are insignificant, recompute the model with main effects only. Some computer
programs automate this process and print sums of squares and F tests according to the
hierarchy (ordering of effects) you specify in the model. SAS and SPSS GLM, for
example, calls these Type I sums of squares.

I-392
Chapter 13

This procedure is analogous to stepwise regression in which hierarchical subsets of


models are tested. This example assumes you have specified the following model:
Y = CONSTANT + a + b + c + ab + ac + bc + abc

The hierarchical approach tests the following models:


Y
Y
Y
Y
Y
Y
Y

=
=
=
=
=
=
=

CONSTANT
CONSTANT
CONSTANT
CONSTANT
CONSTANT
CONSTANT
CONSTANT

+
+
+
+
+
+
+

a
a
a
a
a
a
a

+
+
+
+
+
+

b
b
b
b
b
b

+
+
+
+
+

c
c
c
c
c

+
+
+
+

ab + ac + bc + abc
ab + ac + bc
ab + ac
ab

The problem with this approach, however, is that plausible subsets of effects are
ignored if you examine only one hierarchy. The following model, which may be the
best fit to the data, is never considered:
Y = CONSTANT + a + b + ab

Furthermore, if you decide to examine all the other plausible subsets, you are really
doing all possible subsets regression, and you should use Bonferroni confidence levels
before rejecting a null hypothesis. The example above has 127 possible subset models
(excluding ones without a CONSTANT). Interactive stepwise regression allows you to
explore subset models under your control.
If you have done an experiment and have decided that higher-order effects
(interactions) are of enough theoretical importance to include in your model, you
should condition every test on all other effects in the model you selected. This is the
classical approach of Fisher and Yates. It amounts to using the default F values on the
ANOVA output, which are the same as the SAS and SPSS Type III sums of squares.
Probably the most important reason to stay with one model is that if you eliminate
a series of effects that are not quite significant (for example, p = 0.06), you could end
up with an incorrect subset model because of the dependencies among the sums of
squares. In summary, if you want other sums of squares, compute them. You can
supply the mean square error to customize sums of squares by using a hypothesis test
in GLM, selecting MSE, and specifying the mean square error and degrees of freedom.

I-393
Linear Models

Repeated Measures
In factorial ANOVA designs, each subject is measured once. For example, the
assumption of independence would be violated if a subject is measured first as a
control group member and later as a treatment group member. However, in a repeated
measures design, the same variable is measured several times for each subject (case).
A paired-comparison t test is the most simple form of a repeated measures design (for
example, each subject has a before and after measure).
Usually, it is not necessary for you to understand how SYSTAT carries out
calculations; however, repeated measures is an exception. It is helpful to understand
the quantities SYSTAT derives from your data. First, remember how to calculate a
paired-comparison t test by hand:
n For each subject, compute the difference between the two measures.
n Calculate the average of the differences.
n Calculate the standard deviation of the differences.
n Calculate the test statistic using this mean and standard deviation.

SYSTAT derives similar values from your repeated measures and uses them in
analysis-of-variance computations to test changes across the repeated measures
(within subjects) as well as differences between groups of subjects (between subjects.)
Tests of the within-subjects values are called polynomial tests of order 1, 2,..., up to k,
where k is one less than the number of repeated measures. The first polynomial is used
to test linear changes (for example, do the repeated responses increase (or decrease)
around a line with a significant slope?). The second polynomial tests if the responses
fall along a quadratic curve, and so on.
For each case, SYSTAT uses orthogonal contrast coefficients to derive one
number for each polynomial. For the coefficients of the linear polynomial, SYSTAT
uses (1, 0, 1) when there are three measures; (3, 1, 1, 3) when there are four
measures; and so on. When there are three repeated measures, SYSTAT multiplies the
first by 1, the second by 0, and the third by 1, and sums these products (this sum is
then multiplied by a constant to make the sum of squares of the coefficients equal to
1). Notice that when the responses are the same, the result of the polynomial contrast
is 0; when the responses fall closely along a line with a steep slope, the polynomial
differs markedly from 0.
For the coefficients of the quadratic polynomial, SYSTAT uses (1, 2, 1) when
there are three measures; (1, 1, 1, 1) when there are four measures; and so on. The
cubic and higher-order polynomials are computed in a similar way.

I-394
Chapter 13

Lets continue the discussion for a design with three repeated measures. Assume
that you record body weight once a month for three months for rats grouped by diet.
(Diet A includes a heavy concentration of alcohol and Diet B consists of normal lab
chow.) For each rat, SYSTAT computes a linear component and a quadratic
component. SYSTAT also sums the weights to derive a total response. These derived
values are used to compute two analysis of variance tables:
n The total response is used to test between-group differences; that is, the total is

used as the dependent variable in the usual factorial ANOVA computations. In the
example, this test compares total weight for Diet A against that for Diet B. This is
analogous to a two-sample t test using total weight as the dependent variable.
n The linear and quadratic components are used to test changes across the repeated

measures (within subjects) and also to test the interaction of the within factor with
the grouping factor. If the test for the linear component is significant, you can
report a significant linear increase in weight over the three months. If the test for
the quadratic component is also significant (but much less so than the linear
component), you might report that growth is predominantly linear, but there is a
significant curve in the upward trend.
n A significant interaction between Diet (the between-group factor) and the linear

component across time might indicate that the slopes for Diet A and Diet B differ.
This test may be the most important one for the experiment.

Assumptions in Repeated Measures


SYSTAT computes both univariate and multivariate statistics. Like all standard
ANOVA procedures, the univariate repeated measures approach requires that the
distributions within cells be normal. The univariate repeated measures approach also
requires that the covariances between all possible pairs of repeated measures be equal.
(Actually, the requirement is slightly less restrictive, but this difference is of little
practical importance.) Of course, the usual ANOVA requirement that all variances
within cells are equal still applies; thus, the covariance matrix of the measures should
have a constant diagonal and equal elements off the diagonal. This assumption is called
compound symmetry.
The multivariate analysis does not require compound symmetry. It requires that the
covariance matrices within groups (there is only one group in this example) be
equivalent and that they be based on multivariate normal distributions. If the classical
assumptions hold, then you should generally ignore the multivariate tests at the bottom

I-395
Linear Models

of the output and stay with the classical univariate ANOVA table because the
multivariate tests will be generally less powerful.
There is a middle approach. The Greenhouse-Geisser and Huynh-Feldt statistics are
used to adjust the probability for the classical univariate tests when compound
symmetry fails. (Huynh-Feldt is a more recent adjustment to the conservative
Greenhouse-Geiser statistic.) If the Huynh-Feldt p values are substantially different
from those under the column directly to the right of the F statistic, then you should be
aware that compound symmetry has failed. In this case, compare the adjusted p values
under Huynh-Feldt to those for the multivariate tests.
If all else fails, single degree-of-freedom polynomial tests can always be trusted. If
there are several to examine, however, remember that you may want to use Bonferroni
adjustments to the probabilities; that is, divide the normal value (for example, 0.05) by
the number of polynomial tests you want to examine. You need to make a Bonferroni
adjustment only if you are unable to use the summary univariate or multivariate tests
to protect the overall level; otherwise, you can examine the polynomials without
penalty if the overall test is significant.

Issues in Repeated Measures Analysis


Repeated measures designs can be generated in SYSTAT with a single procedure. You
need not worry about weighting cases in unbalanced designs or selecting error terms.
The program does this automatically; however, you should keep the following in mind:
n The sums of squares for the univariate F tests are pooled across subjects within

groups and their interactions with trials. This means that the traditional analysis
method has highly restrictive assumptions. You must assume that the variances
within cells are homogeneous and that the covariances across all pairs of cells are
equivalent (compound symmetry). There are some mathematical exceptions to this
requirement, but they rarely occur in practice. Furthermore, the compound
symmetry assumption rarely holds for real data.
n Compound symmetry is not required for the validity of the single degree-of-

freedom polynomial contrasts. These polynomials partition sums of squares into


orthogonal components. You should routinely examine the magnitude of these
sums of squares relative to the hypothesis sum of squares for the corresponding
univariate repeated measures F test when your trials are ordered on a scale.
n Think of the repeated measures output as an expanded traditional ANOVA table.

The effects are printed in the same order as they appear in Winer (1971) and other
texts, but they include the single degree-of-freedom and multivariate tests to

I-396
Chapter 13

protect you from false conclusions. If you are satisfied that both are in agreement,
you can delete the additional lines in the output file.
n You can test any hypothesis after you have estimated a repeated measures design

and examined the output. For example, you can use polynomial contrasts to test
single degree-of-freedom components in an unevenly spaced design. You can also
use difference contrasts to do post hoc tests on adjacent trials.

Types of Sums of Squares


Some other statistics packages print several types of sums of squares for testing
hypotheses. The following names for these sums of squares are not statistical terms,
but they were popularized originally by SAS GLM.
Type I. Type I sums of squares are computed from the difference between the residual
sums of squares of two different models. The particular models needed for the
computation depend on the order of the variables in the MODEL statement. For
example, if the model is
MODEL y = CONSTANT + a + b + a*b

then the sum of squares for AB is produced from the difference between SSE (sum of
squared error) in the two following models:
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a + b + a*b

Similarly, the Type I sums of squares for B in this model are computed from the
difference in SSE between the following models:
MODEL y = CONSTANT + a
MODEL y = CONSTANT + a + b

Finally, the Type I sums of squares for A is computed from the difference in residual
sums of squares for the following:
MODEL y = CONSTANT
MODEL y = CONSTANT + a

In summary, to compute sums of squares, move from right to left and construct models
which differ by the right-most term only.

I-397
Linear Models

Type II. Type II sums of squares are computed similarly to Type I except that main
effects and interactions determine the ordering of differences instead of the MODEL
statement order. For the above model, Type II sums of squares for the interaction are
computed from the difference in residual sums of squares for the following models:
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a + b + a*b

For the B effect, difference the following models:


MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + a

For the A effect, difference the following (this is not the same as for
Type I):
MODEL y = CONSTANT + a + b
MODEL y = CONSTANT + b

In summary, include interactions of the same order as well as all lower order
interactions and main effects when differencing to get an interaction. When getting
sums of squares for a main effect, difference against all other main effects only.
Type III. Type III sums of squares are the default for ANOVA and are much simpler to
understand. Simply difference from the full model, leaving out only the term in
question. For example, the Type III sum of squares for A is taken from the following
two models:
MODEL y = CONSTANT + b + a*b
MODEL y = CONSTANT + a + b + a*b

Type IV. Type IV sums of squares are designed for missing cells designs and are not
easily presented in the above terminology. They are produced by balancing over the
means of nonmissing cells not included in the current hypothesis.

SYSTATs Sums of Squares


Printing more than one sum of squares in a table is potentially confusing to users. There
is a strong temptation to choose the most significant sum of squares without
understanding the hypothesis being tested.
A Type I test is produced by first estimating the full models and noting the error
term. Then, each effect is entered sequentially and tested with the error term from the

I-398
Chapter 13

full model. Later, effects are conditioned on earlier effects, but earlier effects are not
conditioned on later effects. A Type II test is produced most easily with interactive
stepping (STEP). Type III is printed in the regression and ANOVA table. Finally, Type
IV is produced by the careful use of SPECIFY in testing means models. The advantage
of this approach is that the user is always aware that sums of squares depend on explicit
mathematical models rather than additions and subtractions of dimensionless
quantities.

Chapter

14
Linear Models I:
Linear Regression
Leland Wilkinson and Mark Coward

The model for simple linear regression is:

y = 0 +1x +
where y is the dependent variable, x is the independent variable, and the s are the
regression parameters (the intercept and the slope of the line of best fit). The model
for multiple linear regression is:

y = 0 + 1 x 1 + 2 x 2 + ... + p x p +
Both Regression and General Linear Model can estimate and test simple and multiple
linear regression models. Regression is easier to use than General Linear Model when
you are doing simple regression, multiple regression, or stepwise regression because it
has fewer options. To include interaction terms in your model or for mixture models, use
General Linear Model. With Regression, all independent variables must be continuous;
in General Linear Model, you can identify categorical independent variables and
SYSTAT will generate a set of design variables for each. Both General Linear Model
and Regression allow you to save residuals. In addition, you can test a variety of
hypotheses concerning the regression coefficients using General Linear Model.
The ability to do stepwise regression is available in three ways: use the default
values, specify your own selection criteria, or at each step, interactively select a
variable to add or remove from the model.
2
2
For each model you fit in REGRESS, SYSTAT reports R , adjusted R , the
standard error of the estimate, and an ANOVA table for assessing the fit of the model.

I-399

I-400
Chapter 14

For each variable in the model, the output includes the estimate of the regression
coefficient, the standard error of the coefficient, the standardized coefficient, tolerance,
and a t statistic for measuring the usefulness of the variable in the model.

Linear Regression in SYSTAT


Regression Main Dialog Box
To obtain a regression analysis, from the menus choose:
Statistics
Regression
Linear

The following options can be specified:


Include constant. Includes the constant in the regression equation. Deselect this option
to remove the constant. You almost never want to remove the constant, and you should
be familiar with no-constant regression terminology before considering it.
Cases. If your data are in the form of a correlation matrix, enter the number of cases
used to compute the correlation matrix.
Save. You can save residuals and other data to a new data file. The following
alternatives are available:
n Residuals. Saves predicted values, residuals, Studentized residuals, leverage for

each observation, Cooks distance measure, and the standard error of predicted
values.
n Residuals/Data. Saves the residual statistics given by Residuals plus all the

variables in the working data file, including any transformed data values.

I-401
Linear Models I: Linear Regression

n Partial. Saves partial residuals. Suppose your model is:


Y=CONSTANT + X1 + X2 + X3

The saved file contains:


YPARTIAL(1):
XPARTIAL(1):
YPARTIAL(2):
XPARTIAL(2):
YPARTIAL(3):
XPARTIAL(3):

Residual of Y = CONSTANT + X2 + X3
Residual of X1 = CONSTANT + X2 + X3
Residual of Y = CONSTANT + X1 + X3
Residual of X2 = CONSTANT + X1 + X3
Residual of Y = CONSTANT + X1 + X2
Residual of X3 = CONSTANT + X1 + X2

n Partial/Data. Saves partial residuals plus all the variables in the working data file,

including any transformed data values.


n Model. Saves statistics given in Residuals and the variables used in the model.
n Coefficients. Saves the estimates of the regression coefficients.

Regression Options
To open the Options dialog box, click Options in the Regression dialog box.

You can specify a tolerance level, select complete or stepwise entry, and specify entry
and removal criteria.

I-402
Chapter 14

Tolerance. Prevents the entry of a variable that is highly correlated with the
independent variables already included in the model. Enter a value between 0 and 1.
Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the
correlation required to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Mixture model. Constrains the independent variables to sum to a constant.
n Stepwise. Variables are entered or removed from the model one at a time.

The following alternatives are available for stepwise entry and removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT

removes the variable with the largest Remove value.


n Forward. Begins with no variables in the model. At each step, SYSTAT adds the

variable with the smallest Enter value.


n Automatic. For Backward, at each step SYSTAT automatically removes a variable

from your model. For Forward, SYSTAT automatically adds a variable to the
model at each step.
n Interactive. At each step in the model building, you select the variable to enter or

remove from the model.


You can also control the criteria used to enter and remove variables from the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified

value. Enter a value between 0 and 1.


n Remove. Removes a variable from the model if its alpha value is greater than the

specified value. Enter a value between 0 and 1.


n Force. Force the first n variables listed in your model to remain in the equation.
n FEnter. F-to-enter limit. Variables with F greater than the specified value are

entered into the model if Tolerance permits.


n FRemove. F-to-remove limit. Variables with F less than the specified value are

removed from the model.


n Max step. Maximum number of steps.

I-403
Linear Models I: Linear Regression

Using Commands
First, specify your data with USE filename. Continue with:
REGRESS
MODEL var=CONSTANT + var1 + var2 + / N=n
SAVE filename / COEF MODEL RESID DATA PARTIAL
ESTIMATE / TOL=n
(use START instead of ESTIMATE for stepwise model building)
START / FORWARD BACKWARD TOL=n ENTER=p REMOVE=p ,
FENTER=n FREMOVE=n FORCE=n
STEP / AUTO ENTER=p REMOVE=p FENTER=n FREMOVE=n
STOP

For hypothesis testing commands, see Chapter 16.

Usage Considerations
Types of data. Input can be the usual cases-by-variables data file or a covariance,
correlation, or sum of squares and cross-products matrix. Using matrix input requires
specification of the sample size which generated the matrix.
Print options. Using PRINT = MEDIUM, the output includes eigenvalues of XX,
condition indices, and variance proportions. PRINT = LONG adds the correlation matrix
of the regression coefficients to this output.
Quick Graphs. SYSTAT plots the residuals against the predicted values.
Saving files. You can save the results of the analysis (predicted values, residuals, and
diagnostics that identify unusual cases) for further use in examining assumptions.
BY groups. Separate regressions result for each level of any BY variables.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. REGRESS uses the FREQ variable to duplicate cases. This inflates
the degrees of freedom to be the sum of the number of frequencies.
Case weights. REGRESS weights cases using the WEIGHT variable for rectangular
data. You can perform cross-validation if the weight variable is binary and coded 0 or
1. SYSTAT computes predicted values for cases with zero weight even though they are
not used to estimate the regression parameters.

I-404
Chapter 14

Examples
Example 1
Simple Linear Regression
In this example, we explore the relation between gross domestic product per capita
(GDP_CAP) and spending on the military (MIL) for 57 countries that report this
information to the United Nationswe want to determine whether a measure of the
financial well being of a country is useful for predicting its military expenditures. Our
model is:

mil = 0 + 1gdp _ cap +


Initially. we plot the dependent variable against the independent variable. Such a plot
may reveal outlying cases or suggest a transformation before applying linear
regression. The input is:
USE ourworld
PLOT MIL*GDP_CAP / SMOOTH=LOWESS TENSION =0.500 ,
YLABEL=Military Spending,
SYMBOL=4 SIZE= 1.500 LABEL=NAME$
CSIZE=2.000

The scatterplot follows:

I-405
Linear Models I: Linear Regression

To obtain the scatterplot, we created a new variable, NAME$, that had missing values for
all countries except Libya and Iraq. We then used the new variable to label plot points.
Iraq and Libya stand apart from the other countriesthey spend considerably more
for the military than countries with similar GDP_CAP values. The smoother indicates
that the relationship between the two variables is fairly linear. Distressing, however, is
the fact that many points clump in the lower left corner. Many data analysts would
want to study the data after log-transforming both variables. We do this in another
example, but now we estimate the coefficients for the data as recorded.
To fit a simple linear regression model to the data, the input is:
REGRESS
USE ourworld
MODEL mil = CONSTANT + gdp_cap
ESTIMATE

The output is:


1 case(s) deleted due to missing data.

Dep Var: MIL

N: 56

Multiple R: 0.646

Adjusted squared multiple R: 0.407

Effect
CONSTANT
GDP_CAP

Effect
CONSTANT
GDP_CAP

Coefficient

Std Error

41.857
0.019

24.838
0.003

Coefficient

Lower

41.857
0.019

Squared multiple R: 0.417

Standard error of estimate: 136.154

Std Coef Tolerance


0.0
0.646

< 95%>

-7.940
0.013

.
1.000

P(2 Tail)

1.685
6.220

0.098
0.000

Upper
91.654
0.025

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
717100.891
1
717100.891
38.683
0.000
Residual
1001045.288
54
18537.876
-------------------------------------------------------------------------------

*** WARNING ***


Case
22 is an outlier
Case
30 is an outlier

Durbin-Watson D Statistic
First Order Autocorrelation

2.046
-0.032

(Studentized Residual =
(Studentized Residual =

6.956)
4.348)

I-406
Chapter 14

SYSTAT reports that data are missing for one case. In the next line, it reports that 56
cases are used (N = 56). In the regression calculations, SYSTAT uses only the cases
that have complete data for the variables in the model. However, when only the
dependent variable is missing, SYSTAT computes a predicted value, its standard error,
and a leverage diagnostic for the case. In this sample, Afghanistan did not report
military spending.
When there is only one independent variable, Multiple R (0.646) is the simple
correlation between MIL and GDP_CAP. Squared multiple R (0.417) is the square of
this value, and it is the proportion of the total variation in the military expenditures
accounted for by GDP_CAP (GDP_CAP explains 41.7% of the variability of MIL).
Use Sum-of-Squares in the analysis of variance table to compute it:
717100.891 / (717100.891 + 1001045.288)

Adjusted squared multiple R is of interest for models with more than one independent
variable. Standard error of estimate (136.154) is the square root of the residual mean
square (18537.876) in the ANOVA table.
The estimates of the regression coefficients are 41.857 and 0.019, so the equation is:
mil = 41.857 + 0.019 * gdp_cap

The standard errors (Std Error) of the estimated coefficients are in the next column and
the standardized coefficients (Std Coef) follow. The latter are called beta weights by
some social scientists. Tolerance is not relevant when there is only one predictor.

I-407
Linear Models I: Linear Regression

Next are t statistics (t)the first (1.685) tests the significance of the difference of
the constant from 0 and the second (6.220) tests the significance of the slope, which is
equivalent to testing the significance of the correlation between military spending and
GDP_CAP.
F-ratio in the analysis of variance table is used to test the hypothesis that the slope
is 0 (or, for multiple regression, that all slopes are 0). The F is large when the
independent variable(s) helps to explain the variation in the dependent variable. Here,
there is a significant linear relation between military spending and GDP_CAP. Thus,
we reject the hypothesis that the slope of the regression line is zero (F-ratio = 38.683,
p value (P) < 0.0005).
It appears from the results above that GDP_CAP is useful for predicting spending
on the militarythat is, countries that are financially sound tend to spend more on the
military than poorer nations. These numbers, however, do not provide the complete
picture. Notice that SYSTAT warns us that two countries (Iraq and Libya) with
unusual values could be distorting the results. We recommend that you consider
transforming the data and that you save the residuals and other diagnostic statistics.

Example 2
Transformations
The data in the scatterplot in the simple linear regression example are not well suited
for linear regression, as the heavy concentration of points in the lower left corner of the
graph shows. Here are the same data plotted in log units:
REGRESS
USE ourworld
PLOT MIL*GDP_CAP / SMOOTH=LOWESS TENSION =0.500,
XLABEL=GDP per capita,
XLOG=10 YLABEL=Military Spending YLOG=10,
SYMBOL=4,2,3,
SIZE= 1.250 LABEL=COUNTRY$ CSIZE=1.450

I-408
Chapter 14

The scatterplot is:

Except possibly for Iraq and Libya, the configuration of these points is better for linear
modeling than that for the untransformed data.
We now transform both the y and x variables and refit the model. The input is:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
MODEL log_mil = CONSTANT + log_gdp
ESTIMATE

The output follows:


1 case(s) deleted due to missing data.
Dep Var: LOG_MIL

N: 56

Multiple R: 0.857

Adjusted squared multiple R: 0.729


Effect
CONSTANT
LOG_GDP

Effect
CONSTANT
LOG_GDP

Standard error of estimate: 0.346

Coefficient

Std Error

-1.308
0.909

0.257
0.075

Coefficient
-1.308
0.909

Squared multiple R: 0.734

Lower

< 95%>

-1.822
0.760

Std Coef Tolerance


0.0
0.857

Upper
-0.793
1.058

.
1.000

t
-5.091
12.201

P(2 Tail)
0.000
0.000

I-409
Linear Models I: Linear Regression

Analysis of Variance
Source

Sum-of-Squares

Regression
17.868
Residual
6.481
*** WARNING ***
Case
22 is an outlier
Durbin-Watson D Statistic
First Order Autocorrelation

df

Mean-Square

F-ratio

1
54

17.868
0.120

148.876

0.000

(Studentized Residual =

4.004)

1.810
0.070

The Squared multiple R for the variables in log units is 0.734 (versus 0.417 for the
untransformed values). That is, we have gone from explaining 41.7% of the variability
of military spending to 73.4% by using the log transformations. The F-ratio is now
148.876it was 38.683. Notice that we now have only one outlier (Iraq).

The Calculator
But what is the estimated model now?
log_mil = 1.308 + 0.909 log_gdp

However, many people dont think in log units. Lets transform this equation
(exponentiate each side of the equation):

I-410
Chapter 14

10^ log_ mil = 10 ^ (1.308 + 0.909 * log_ gdp )


mil = 10 1.308 + 0.909*log( gdp )
mil = 10 1.308 *10 0.909*log( gdp )
mil = 0.049 * (gdp _ cap) 0.909

We used the calculator to compute 0.049. Type:


CALC 10^-1.308

and SYSTAT returns 0.049.

Example 3
Residuals and Diagnostics for Simple Linear Regression
In this example, we continue with the transformations example and save the residuals
and diagnostics along with the data. Using the saved statistics, we create stem-and-leaf
plots of the residuals and Studentized residuals. In addition, lets plot the Studentized
residuals (to identify outliers in the y space) against leverage (to identify outliers in the
x space) and use Cooks distance measure to scale the size of each plot symbol. In a
second plot, we display the corresponding country names. The input is:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
MODEL log_mil = CONSTANT + log_gdp
SAVE myresult / DATA RESID
ESTIMATE
USE myresult
STATS
STEM residual student
PLOT STUDENT*LEVERAGE / SYMBOL=4,2,3 SIZE=cook
PLOT student*leverage / LABEL=country$ SYMBOL=4,2,3

I-411
Linear Models I: Linear Regression

The output is:


Stem and Leaf Plot of variable:
RESIDUAL, N = 56
Minimum:
-0.644
Lower hinge:
-0.246
Median:
-0.031
Upper hinge:
0.203
Maximum:
1.216

Stem and Leaf Plot of variable:


STUDENT, N = 56
Minimum:
-1.923
Lower hinge:
-0.719
Median:
-0.091
Upper hinge:
0.591
Maximum:
4.004

-6
42
-5
6
-4
42
-3
554000
-2 H 65531
-1
9876433
-0 M 98433200
0
222379
1
1558
2 H 009
3
0113369
4
27
5
1
6
7
7
* * * Outside Values * * *
12
1
1 cases with missing values excluded from plot.

-1
986
-1
32000
-0 H 88877766555
-0 M 443322111000
0 M 000022344
0 H 555889999
1
0223
1
5
2
3
* * * Outside Values * * *
4
0
1 cases with missing values excluded from plot.

In the stem-and-leaf plots, Iraqs residual is 1.216 and is identified as an Outside Value.
The value of its Studentized residual is 4.004, which is very extreme for the t
distribution.
The case with the most influence on the estimates of the regression coefficients
stands out at the top left (that is, it has the largest plot symbol). From the second plot,
we identify this country as Iraq. Its value of Cooks distance measure is large because
its Studentized residual is extreme. On the other hand, Ethiopia (furthest to the right),

I-412
Chapter 14

the case with the next most influence, has a large value of Cooks distance because its
value of leverage is large. Gambia has the third largest Cook value, and Libya, the
fourth.

Deleting an Outlier
Residual plots identify Iraq as the case with the greatest influence on the estimated
coefficients. Lets remove this case from the analysis and check SYSTATs warnings.
The input is:
REGRESS
USE ourworld
LET log_mil = L10(mil)
LET log_gdp = L10(gdp_cap)
SELECT mil < 700
MODEL log_mil = CONSTANT + log_gdp
ESTIMATE
SELECT

The output follows:


Dep Var: LOG_MIL

N: 55

Multiple R: 0.886

Adjusted squared multiple R: 0.781


Effect
CONSTANT
LOG_GDP

Squared multiple R: 0.785

Standard error of estimate: 0.306

Coefficient

Std Error

-1.353
0.916

0.227
0.066

Std Coef Tolerance


0.0
0.886

.
1.000

P(2 Tail)

-5.949
13.896

0.000
0.000

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
18.129
1
18.129
193.107
0.000
Residual
4.976
53
0.094
------------------------------------------------------------------------------Durbin-Watson D Statistic
First Order Autocorrelation

1.763
0.086

Now there are no warnings about outliers.

I-413
Linear Models I: Linear Regression

Printing Residuals and Diagnostics


Lets look at some of the values in the MYRESULT file. We use the country name as
the ID variable for the listing. The input is:
USE myresult
IDVAR = country$
FORMAT 10 3
LIST cook leverage

student

mil

gdp_cap

The output is:


* Case ID *
Ireland
Austria
Belgium
Denmark

COOK
LEVERAGE
0.013
0.032
0.023
0.043
0.000
0.044
0.000
0.045

STUDENT
-0.891
-1.011
-0.001
-0.119

MIL
95.833
127.237
283.939
269.608

GDP_CAP
8970.885
13500.299
13724.502
14363.064

2.348
0.473
.

640.513
8.846
.

4738.055
201.798
189.128

(etc.)
Libya
Somalia
Afghanistan

0.056
0.009
.

0.022
0.072
0.075

(etc.)
The value of MIL for Afghanistan is missing, so Cooks distance measure and
Studentized residuals are not available (periods are inserted for these values in the
listing).

Example 4
Multiple Linear Regression
In this example, we build a multiple regression model to predict total employment
using values of six independent variables. The data were originally used by Longley
(1967) to test the robustness of least-squares packages to multicollinearity and other
sources of ill-conditioning. SYSTAT can print the estimates of the regression
coefficients with more correct digits than the solution provided by Longley himself
if you adjust the number of decimal places. By default, the first three digits after the
decimal are displayed. After the output is displayed, you can use General Linear Model
to test hypotheses involving linear combinations of regression coefficients.

I-414
Chapter 14

The input is:


REGRESS
USE longley
PRINT = LONG
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE

The output follows:


Eigenvalues of unit scaled XX
1
6.861

2
0.082

6
0.000

7
0.000

1
1.000

2
9.142

6
1048.080

7
43275.046

CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

1
0.000
0.000
0.000
0.000
0.000
0.000
0.000

2
0.000
0.000
0.000
0.014
0.092
0.000
0.000

CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

6
0.000
0.505
0.328
0.225
0.000
0.831
0.000

7
1.000
0.038
0.655
0.689
0.302
0.160
1.000

3
0.046

4
0.011

5
0.000

3
12.256

4
25.337

5
230.424

3
0.000
0.000
0.000
0.001
0.064
0.000
0.000

4
0.000
0.000
0.001
0.065
0.427
0.000
0.000

5
0.000
0.457
0.016
0.006
0.115
0.010
0.000

Condition indices

Variance proportions

Dep Var: TOTAL

N: 16

Multiple R: 0.998

Adjusted squared multiple R: 0.992

Effect
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

Squared multiple R: 0.995

Standard error of estimate: 304.854

Coefficient

Std Error

-3482258.635
15.062
-0.036
-2.020
-1.033
-0.051
1829.151

890420.384
84.915
0.033
0.488
0.214
0.226
455.478

Std Coef Tolerance


0.0
0.046
-1.014
-0.538
-0.205
-0.101
2.480

.
0.007
0.001
0.030
0.279
0.003
0.001

t
-3.911
0.177
-1.070
-4.136
-4.822
-0.226
4.016

P(2 Tail)
0.004
0.863
0.313
0.003
0.001
0.826
0.003

I-415
Linear Models I: Linear Regression

Effect
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

Coefficient

Lower

< 95%>

Upper

-3482258.635 -5496529.488 -1467987.781


15.062
-177.029
207.153
-0.036
-0.112
0.040
-2.020
-3.125
-0.915
-1.033
-1.518
-0.549
-0.051
-0.563
0.460
1829.151
798.788
2859.515

Correlation matrix of regression coefficients


CONSTANT
1.000
-0.205
0.816
0.836
0.550
-0.411
-1.000

DEFLATOR

GNP

UNEMPLOY

ARMFORCE

CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

1.000
-0.649
-0.555
-0.349
0.659
0.186

1.000
0.946
0.469
-0.833
-0.802

1.000
0.619
-0.758
-0.824

1.000
-0.189
-0.549

POPULATN
1.000
0.388

TIME

POPULATN
TIME

1.000

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
1.84172E+08
6 3.06954E+07
330.285
0.000
Residual
836424.056
9
92936.006
------------------------------------------------------------------------------Durbin-Watson D Statistic
First Order Autocorrelation

2.559
-0.348

SYSTAT computes the eigenvalues by scaling the columns of the X matrix so that the
diagonal elements of XX are 1s and then factoring the XX matrix. In this example,
most of the eigenvalues of XX are nearly 0, showing that the predictor variables
comprise a relatively redundant set.
Condition indices are the square roots of the ratios of the largest eigenvalue to each
successive eigenvalue. A condition index greater than 15 indicates a possible problem,
and an index greater than 30 suggests a serious problem with collinearity (Belsley,
Kuh, and Welsh, 1980). The condition indices in the Longley example show a
tremendous collinearity problem.
Variance proportions are the proportions of the variance of the estimates accounted
for by each principal component associated with each of the above eigenvalues. You
should begin to worry about collinearity when a component associated with a high
condition index contributes substantially to the variance of two or more variables. This
is certainly the case with the last component of the Longley data. TIME, GNP, and
UNEMPLOY load highly on this component. See Belsley, Kuh, and Welsch (1980) for
more information about these diagnostics.

I-416
Chapter 14

Adjusted squared multiple R is 0.992. The formula for this statistic is:
adj. sq. multiple R = R 2

( p 1)
* (1 R 2 )
( n p)

where n is the number of cases and p is the number of predictors, including the
constant.
Notice the extremely small tolerances in the output. Tolerance is 1 minus the
multiple correlation between a predictor and the remaining predictors in the model.
These tolerances signal that the predictor variables are highly intercorrelateda
worrisome situation. This multicollinearity can inflate the standard errors of the
coefficients, thereby attenuating the associated F statistics, and can threaten
computational accuracy.
Finally, SYSTAT produces the Correlation matrix of regression coefficients. In the
Longley data, these estimates are highly correlated, further indicating that there are too
many correlated predictors in the equation to provide stable estimates.

Scatterplot Matrix
Examining a scatterplot matrix of the variables in the model is often a beneficial first
step in any multiple regression analysis. Nonlinear relationships and correlated
predictors, both of which cause problems for multiple linear regression, can be
uncovered before fitting the model. The input is:
USE longley
SPLOM DEFLATOR GNP UNEMPLOY ARMFORCE POPULATN TIME TOTAL / HALF
DENSITY=HIST

I-417
Linear Models I: Linear Regression

TOTAL

TIME

POPULATN ARMFORCE UNEMPLOY

GNP

DEFLATOR

The plot follows:

DEFLATOR

GNP

UNEMPLOY ARMFORCE POPULATN

TIME

TOTAL

Notice the severely nonlinear distributions of ARMFORCE with the other variables, as
well as the near perfect correlations among several of the predictors. There is also a
sharp discontinuity between post-war and 1950s behavior on ARMFORCE.

Example 5
Automatic Stepwise Regression
Following is an example of forward automatic stepping using the LONGLEY data. The
input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
START / FORWARD
STEP / AUTO
STOP

I-418
Chapter 14

The output is:


Step # 0 R =
Effect

0.000 R-Square =
Coefficient

0.000
Std Error

Std Coef

Tol.

df

In
___
1 Constant
Out
Part. Corr.
___
2 DEFLATOR
0.971
.
.
1.00000
1 230.089
0.000
3 GNP
0.984
.
.
1.00000
1 415.103
0.000
4 UNEMPLOY
0.502
.
.
1.00000
1
4.729
0.047
5 ARMFORCE
0.457
.
.
1.00000
1
3.702
0.075
6 POPULATN
0.960
.
.
1.00000
1 166.296
0.000
7 TIME
0.971
.
.
1.00000
1 233.704
0.000
-------------------------------------------------------------------------------

Step # 1 R = 0.984 R-Square =


Term entered: GNP

0.967

Effect

Coefficient

Std Error

0.035

0.002

In
___
1 Constant
3 GNP

Std Coef

Tol.

0.984 1.00000

df

1 415.103

0.000

Out
Part. Corr.
___
2 DEFLATOR
-0.187
.
.
0.01675
1
0.473
0.504
4 UNEMPLOY
-0.638
.
.
0.63487
1
8.925
0.010
5 ARMFORCE
0.113
.
.
0.80069
1
0.167
0.689
6 POPULATN
-0.598
.
.
0.01774
1
7.254
0.018
7 TIME
-0.432
.
.
0.00943
1
2.979
0.108
------------------------------------------------------------------------------Step # 2 R = 0.990 R-Square =
Term entered: UNEMPLOY

0.981

Effect

Coefficient

Std Error

0.038
-0.544

0.002
0.182

In
___
1 Constant
3 GNP
4 UNEMPLOY

Std Coef

Tol.

1.071 0.63487
-0.145 0.63487

df

1 489.314
1
8.925

0.000
0.010

Out
Part. Corr.
___
2 DEFLATOR
-0.073
.
.
0.01603
1
0.064
0.805
5 ARMFORCE
-0.479
.
.
0.48571
1
3.580
0.083
6 POPULATN
-0.164
.
.
0.00563
1
0.334
0.574
7 TIME
0.308
.
.
0.00239
1
1.259
0.284
-------------------------------------------------------------------------------

I-419
Linear Models I: Linear Regression

Step # 3 R = 0.993 R-Square =


Term entered: ARMFORCE

0.985

Effect

Coefficient

Std Error

0.041
-0.797
-0.483

0.002
0.213
0.255

In
___
1
3
4
5

Constant
GNP
UNEMPLOY
ARMFORCE

Std Coef

Tol.

df

1.154 0.31838
-0.212 0.38512
-0.096 0.48571

1 341.684
1 13.942
1
3.580

0.000
0.003
0.083

Out
Part. Corr.
___
2 DEFLATOR
0.163
.
.
0.01318
1
0.299
0.596
6 POPULATN
-0.376
.
.
0.00509
1
1.813
0.205
7 TIME
0.830
.
.
0.00157
1 24.314
0.000
-------------------------------------------------------------------------------

Step # 4 R = 0.998 R-Square =


Term entered: TIME

0.995

Effect

Coefficient

Std Error

Std Coef

Tol.

df

-0.040
-2.088
-1.015
1887.410

0.016
0.290
0.184
382.766

-1.137
-0.556
-0.201
2.559

0.00194
0.07088
0.31831
0.00157

1
1
1
1

5.953
51.870
30.496
24.314

0.033
0.000
0.000
0.000

In
___
1
3
4
5
7

Constant
GNP
UNEMPLOY
ARMFORCE
TIME

Out
Part. Corr.
___
2 DEFLATOR
0.143
.
.
0.01305
1
0.208
0.658
6 POPULATN
-0.150
.
.
0.00443
1
0.230
0.642
------------------------------------------------------------------------------Dep Var: TOTAL

N: 16

Multiple R: 0.998

Adjusted squared multiple R: 0.994


Effect
CONSTANT
GNP
UNEMPLOY
ARMFORCE
TIME

Squared multiple R: 0.995

Standard error of estimate: 279.396

Coefficient

Std Error

-3598729.374
-0.040
-2.088
-1.015
1887.410

740632.644
0.016
0.290
0.184
382.766

Std Coef Tolerance


0.0
-1.137
-0.556
-0.201
2.559

.
0.002
0.071
0.318
0.002

P(2 Tail)

-4.859
-2.440
-7.202
-5.522
4.931

0.001
0.033
0.000
0.000
0.000

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
1.84150E+08
4 4.60375E+07
589.757
0.000
Residual
858680.406
11
78061.855
-------------------------------------------------------------------------------

I-420
Chapter 14

The steps proceed as follows:


n At step 0, no variables are in the model. GNP has the largest simple correlation and

F, so SYSTAT enters it at step 1. Note at this step that the partial correlation, Part.
Corr., is the simple correlation of each predictor with TOTAL.
n With GNP in the equation, UNEMPLOY is now the best candidate.
n The F for ARMFORCE is 3.58 when GNP and UNEMPLOY are included in the

model.
n SYSTAT finishes by entering TIME.

In four steps, SYSTAT entered four predictors. None was removed, resulting in a final
equation with a constant and four predictors. For this final model, SYSTAT uses all
cases with complete data for GNP, UNEMPLOY, ARMFORCE, and TIME. Thus, when
some values in the sample are missing, the sample size may be larger here than for the
last step in the stepwise process (there, cases are omitted if any value is missing among
the six candidate variables). If you dont want to stop here, you could move more
variables in (or out) using interactive stepping.

Example 6
Interactive Stepwise Regression
Interactive stepping helps you to explore model building in more detail. With data that
are as highly intercorrelated as the LONGLEY data, interactive stepping reveals the
dangers of thinking that the automated result is the only acceptable subset model. In
this example, we use interactive stepping to explore the LONGLEY data further. That
is, after specifying a model that includes all of the candidate variables available, we
request backward stepping by selecting Stepwise, Backward, and Interactive in the
Regression Options dialog box. After reviewing the results at each step, we use Step
to move a variable in (or out) of the model. When finished, we select Stop for the final
model. To begin interactive stepping, the input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
START / BACK

I-421
Linear Models I: Linear Regression

The output is:


Step # 0 R =
Effect
In
___
1
2
3
4
5
6
7

Constant
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

Out
___

0.998 R-Square =

0.995

Coefficient

Std Error

Std Coef

Tol.

df

15.062
-0.036
-2.020
-1.033
-0.051
1829.151

84.915
0.033
0.488
0.214
0.226
455.478

0.046
-1.014
-0.538
-0.205
-0.101
2.480

0.00738
0.00056
0.02975
0.27863
0.00251
0.00132

1
1
1
1
1
1

0.031
1.144
17.110
23.252
0.051
16.127

0.863
0.313
0.003
0.001
0.826
0.003

Part. Corr.

none
-------------------------------------------------------------------------------

We begin with all variables in the model. We remove DEFLATOR because it has an
unusually low tolerance and F value.
Type:
STEP deflator

The output is:


Dependent Variable TOTAL
Minimum tolerance for entry into model = 0.000000
Backward stepwise with Alpha-to-Enter=0.150 and Alpha-to-Remove=0.150
Step # 1 R = 0.998 R-Square =
Term removed: DEFLATOR

0.995

Effect

Coefficient

Std Error

Std Coef

Tol.

df

-0.032
-1.972
-1.020
-0.078
1814.101

0.024
0.386
0.191
0.162
425.283

-0.905
-0.525
-0.202
-0.154
2.459

0.00097
0.04299
0.31723
0.00443
0.00136

1
1
1
1
1

1.744
26.090
28.564
0.230
18.196

0.216
0.000
0.000
0.642
0.002

In
___
1
3
4
5
6
7

Constant
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

Out
Part. Corr.
___
2 DEFLATOR
0.059
.
.
0.00738
1
0.031
0.863
-------------------------------------------------------------------------------

I-422
Chapter 14

POPULATN has the lowest F statistic and, again, a low tolerance.


Type:
STEP populatn

The output is:


Step # 2 R = 0.998 R-Square =
Term removed: POPULATN

0.995

Effect

Coefficient

Std Error

Std Coef

Tol.

df

-0.040
-2.088
-1.015
1887.410

0.016
0.290
0.184
382.766

-1.137
-0.556
-0.201
2.559

0.00194
0.07088
0.31831
0.00157

1
1
1
1

5.953
51.870
30.496
24.314

0.033
0.000
0.000
0.000

In
___
1
3
4
5
7

Constant
GNP
UNEMPLOY
ARMFORCE
TIME

Out
Part. Corr.
___
2 DEFLATOR
0.143
.
.
0.01305
1
0.208
6 POPULATN
-0.150
.
.
0.00443
1
0.230
0.642
-------------------------------------------------------------------------------

0.658

GNP and TIME both have low tolerance values. They could be highly correlated with
one another, so we will take each out and examine the behavior of the other when we do.
Type:
STEP time
STEP time
STEP gnp

The output is:


Step # 3 R = 0.993 R-Square =
Term removed: TIME

0.985

Effect

Coefficient

Std Error

0.041
-0.797
-0.483

0.002
0.213
0.255

In
___
1
3
4
5

Constant
GNP
UNEMPLOY
ARMFORCE

Std Coef

Tol.

1.154 0.31838
-0.212 0.38512
-0.096 0.48571

df

1 341.684
1 13.942
1
3.580

0.000
0.003
0.083

Out
Part. Corr.
___
2 DEFLATOR
0.163
.
.
0.01318
1
0.299
0.596
6 POPULATN
-0.376
.
.
0.00509
1
1.813
0.205
7 TIME
0.830
.
.
0.00157
1 24.314
0.000
-------------------------------------------------------------------------------

I-423
Linear Models I: Linear Regression

Step # 4 R = 0.998 R-Square =


Term entered: TIME

0.995

Effect

Coefficient

Std Error

Std Coef

Tol.

df

-0.040
-2.088
-1.015
1887.410

0.016
0.290
0.184
382.766

-1.137
-0.556
-0.201
2.559

0.00194
0.07088
0.31831
0.00157

1
1
1
1

5.953
51.870
30.496
24.314

0.033
0.000
0.000
0.000

In
___
1
3
4
5
7

Constant
GNP
UNEMPLOY
ARMFORCE
TIME

Out
Part. Corr.
___
2 DEFLATOR
0.143
.
.
0.01305
1
0.208
0.658
6 POPULATN
-0.150
.
.
0.00443
1
0.230
0.642
------------------------------------------------------------------------------Step # 5 R = 0.996 R-Square =
Term removed: GNP

0.993

Effect

Coefficient

Std Error

-1.470
-0.772
956.380

0.167
0.184
35.525

In
___
1
4
5
7

Constant
UNEMPLOY
ARMFORCE
TIME

Std Coef

Tol.

df

-0.391 0.30139
-0.153 0.44978
1.297 0.25701

1 77.320
1 17.671
1 724.765

0.000
0.001
0.000

Out
Part. Corr.
___
2 DEFLATOR
-0.031
.
.
0.01385
1
0.011
0.920
3 GNP
-0.593
.
.
0.00194
1
5.953
0.033
6 POPULATN
-0.505
.
.
0.00889
1
3.768
0.078
-------------------------------------------------------------------------------

We are comfortable with the tolerance values in both models with three variables. With
TIME in the model, the smallest F is 17.671, and with GNP in the model, the smallest
F is 3.580. Furthermore, with TIME, the squared multiple correlation is 0.993, and with
GNP, it is 0.985. Lets stop the stepping and view more information about the last
model.
Type:
STOP

The output is:


Dep Var: TOTAL

N: 16

Multiple R: 0.996

Adjusted squared multiple R: 0.991


Effect
CONSTANT
UNEMPLOY
ARMFORCE
TIME

Squared multiple R: 0.993

Standard error of estimate: 332.084

Coefficient

Std Error

-1797221.112
-1.470
-0.772
956.380

68641.553
0.167
0.184
35.525

Std Coef Tolerance


0.0
-0.391
-0.153
1.297

.
0.301
0.450
0.257

t
-26.183
-8.793
-4.204
26.921

P(2 Tail)
0.000
0.000
0.001
0.000

I-424
Chapter 14

Effect
CONSTANT
UNEMPLOY
ARMFORCE
TIME

Coefficient

Lower

< 95%>

Upper

-1797221.112 -1946778.208 -1647664.016


-1.470
-1.834
-1.106
-0.772
-1.173
-0.372
956.380
878.978
1033.782
Analysis of Variance

Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
1.83685E+08
3 6.12285E+07
555.209
0.000
Residual
1323360.743
12
110280.062
-------------------------------------------------------------------------------

Our final model includes only UNEMPLOY, ARMFORCE, and TIME. Notice that its
multiple correlation (0.996) is not significantly smaller than that for the automated
stepping (0.998). Following are the commands we used:
REGRESS
USE longley
MODEL total=constant + deflator + gnp + unemploy +,
armforce + populatn + time
START / BACK
STEP deflator
STEP populatn
STEP time
STEP time
STEP gnp
STOP

Example 7
Testing whether a Single Coefficient Equals Zero
Most regression programs print tests of significance for each coefficient in an equation.
SYSTAT has a powerful additional featurepost hoc tests of regression coefficients.
To demonstrate these tests, we use the LONGLEY data and examine whether the
DEFLATOR coefficient differs significantly from 0. The input is:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
EFFECT = deflator
TEST

I-425
Linear Models I: Linear Regression

The output is:


Dep Var: TOTAL

N: 16

Multiple R: 0.998

Adjusted squared multiple R: 0.992


Effect
CONSTANT
DEFLATOR
GNP
UNEMPLOY
ARMFORCE
POPULATN
TIME

Squared multiple R: 0.995

Standard error of estimate: 304.854

Coefficient

Std Error

-3482258.635
15.062
-0.036
-2.020
-1.033
-0.051
1829.151

890420.384
84.915
0.033
0.488
0.214
0.226
455.478

Std Coef Tolerance


0.0
0.046
-1.014
-0.538
-0.205
-0.101
2.480

.
0.007
0.001
0.030
0.279
0.003
0.001

P(2 Tail)

-3.911
0.177
-1.070
-4.136
-4.822
-0.226
4.016

0.004
0.863
0.313
0.003
0.001
0.826
0.003

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
1.84172E+08
6 3.06954E+07
330.285
0.000
Residual
836424.056
9
92936.006
------------------------------------------------------------------------------Test for effect called:

DEFLATOR

Test of Hypothesis
Source
Hypothesis
Error

SS
2923.976
836424.056

df
1
9

MS
2923.976
92936.006

F
0.031

P
0.863

-------------------------------------------------------------------------------

Notice that the error sum of squares (836424.056) is the same as the output residual
sum of squares at the bottom of the ANOVA table. The probability level (0.863) is the
same also. This probability level (> 0.05) indicates that the regression coefficient for
DEFLATOR does not differ from 0.
You can test all of the coefficients in the equation this way, individually, or choose
All to generate separate hypothesis tests for each predictor or type:
HYPOTHESIS
ALL
TEST

I-426
Chapter 14

Example 8
Testing whether Multiple Coefficients Equal Zero
You may wonder why you need to bother with testing when the regression output gives
you hypothesis test results. Try the following hypothesis test:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
EFFECT = deflator & gnp
TEST

The hypothesis output is:


Test for effect called:

DEFLATOR
and
GNP

A Matrix
1
2

1
0.0
0.0

2
1.000
0.0

1
2

6
0.0
0.0

7
0.0
0.0

3
0.0
1.000

4
0.0
0.0

5
0.0
0.0

Test of Hypothesis
Source
Hypothesis
Error

SS
149295.592
836424.056

df
2
9

MS
74647.796
92936.006

F
0.803

P
0.478

-------------------------------------------------------------------------------

Here, the error sum of squares is the same as that for the model, but the hypothesis sum
of squares is different. We just tested the hypothesis that the DEFLATOR and GNP
coefficients simultaneously are 0.
The A matrix printed above the test specifies the hypothesis that we tested. It has
two degrees of freedom (see the F statistic) because the A matrix has two rowsone
for each coefficient. If you know some matrix algebra, you can see that the matrix
product AB using this A matrix and B as a column matrix of regression coefficients
picks up only two coefficients: DEFLATOR and GNP. Notice that our hypothesis had
the following matrix equation: AB = 0, where 0 is a null matrix.
If you dont know matrix algebra, dont worry; the ampersand method is equivalent.
You can ignore the A matrix in the output.

I-427
Linear Models I: Linear Regression

Two Coefficients with an A Matrix


If you are experienced with matrix algebra, however, you can specify your own matrix
by using AMATRIX. When typing the matrix, be sure to separate cells with spaces and
press Enter between rows. The following simultaneously tests that DEFLATOR = 0 and
GNP = 0:
HYPOTHESIS
AMATRIX [0 1 0 0 0 0 0;
0 0 1 0 0 0 0]
TEST

You get the same output as above.


Why bother with AMATRIX when the you can use EFFECT? Because in the A matrix,
you can use any numbers, not just 0s and 1s. Here is a bizarre matrix:
1.0 3.0 0.5 64.3 3.0 2.0 0.0

You may not want to test this kind of hypothesis on the LONGLEY data, but there are
important applications in the analysis of variance where you might.

Example 9
Testing Nonzero Null Hypotheses
You can test nonzero null hypotheses with a D matrix, often in combination using
CONTRAST or AMATRIX. Here, we test whether the DEFLATOR coefficient
significantly differs from 30:
REGRESS
USE longley
MODEL total = CONSTANT + deflator + gnp + unemploy +,
armforce + populatn + time
ESTIMATE / TOL=.00001
HYPOTHESIS
AMATRIX [0 1 0 0 0 0 0]
DMATRIX [30]
TEST

I-428
Chapter 14

The output is:


Hypothesis.
A Matrix
1
0.0

2
1.000

6
0.0
Null hypothesis value for D
30.000
Test of Hypothesis
Source
Hypothesis
Error

SS

4
0.0

5
0.0

7
0.0

df

2876.128
836424.056

3
0.0

1
9

MS

2876.128
92936.006

0.031

0.864

The commands that test whether DEFLATOR differs from 30 can be performed more
efficiently using SPECIFY:
HYPOTHESIS
SPECIFY deflator=30
TEST

Example 10
Regression with Ecological or Grouped Data
If you have aggregated data, weight the regression by a count variable. This variable
should represent the counts of observations (n) contributing to the ith case. If n is not
an integer, SYSTAT truncates it to an integer before using it as a weight. The regression
results are identical to those produced if you had typed in each case.
We use, for this example, an ecological or grouped data file, PLANTS. The input is:
REGRESS
USE plants
FREQ=count
MODEL co2 = CONSTANT + species
ESTIMATE

The output is:


Dep Var: CO2

N: 76

Multiple R: 0.757

Adjusted squared multiple R: 0.567


Effect
CONSTANT
SPECIES

Squared multiple R: 0.573

Standard error of estimate: 0.729

Coefficient

Std Error

13.738
-0.466

0.204
0.047

Std Coef Tolerance


0.0
-0.757

.
1.000

t
67.273
-9.961

P(2 Tail)
0.000
0.000

I-429
Linear Models I: Linear Regression

Effect
CONSTANT
SPECIES

Coefficient

Lower

13.738
-0.466

< 95%>

13.331
-0.559

Upper
14.144
-0.372

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio

Regression
52.660
1
52.660
99.223
0.000
Residual
39.274
74
0.531
-------------------------------------------------------------------------------

Example 11
Regression without the Constant
To regress without the constant (intercept) term, or through the origin, remove the
constant from the list of independent variables. REGRESS adjusts accordingly. The
input is:
REGRESS
MODEL dependent = var1 + var2
ESTIMATE

Some users are puzzled when they see a model without a constant having a higher
multiple correlation than a model that includes a constant. How can a regression with
fewer parameters predict better than another? It doesnt. The total sum of squares
must be redefined for a regression model with zero intercept. It is no longer centered
about the mean of the dependent variable. Other definitions of sums of squares can lead
to strange results, such as negative multiple correlations. If your constant is actually
near 0, then including or excluding the constant makes little difference in the output.
Kvlseth (1985) discusses the issues involved in summary statistics for zero-intercept
2
regression models. The definition of R used in SYSTAT is Kvlseths formula 7. This
was chosen because it retains its PRE (percentage reduction of error) interpretation and
is guaranteed to be in the (0,1) interval.
How, then, do you test the significance of a constant in a regression model? Include
a constant in the model as usual and look at its test of significance.
If you have a zero-intercept model where it is appropriate to compute a coefficient
of determination and other summary statistics about the centered data, use General
Linear Model and select Mixture model. This option provides Kvlseths formula 1 for
2
R and uses centered total sum of squares for other summary statistics.

I-430
Chapter 14

References
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980). Regression diagnostics: Identifying
influential data and sources of collinearity. New York: John Wiley & Sons, Inc.
Flack, V. F. and Chang, P. C. (1987). Frequency of selecting noise variables in subset
regression analysis: A simulation study. The American Statistician, 41, 8486.
Freedman, D. A. (1983). A note on screening regression equations. The American
Statistician, 37, 152155.
Hocking, R. R. (1983). Developments in linear regression methodology: 195982.
Technometrics, 25, 219230.
Lovell, M. C. (1983). Data Mining. The Review of Economics and Statistics, 65, 112.
Rencher, A. C. and Pun, F. C. (1980). Inflation of R-squared in best subset regression.
Technometrics, 22, 4954.
Velleman, P. F. and Welsch, R. E. (1981). Efficient computing of regression diagnostics.
The American Statistician, 35, 234242.
Weisberg, S. (1985). Applied linear regression. New York: John Wiley & Sons, Inc.
Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin,
86, 168174.
Wilkinson, L. and Dallal, G. E. (1982). Tests of significance in forward selection
regression with an F-to-enter stopping rule. Technometrics, 24, 2528.

Chapter

15
Linear Models II:
Analysis of Variance
Leland Wilkinson and Mark Coward

SYSTAT handles a wide variety of balanced and unbalanced analysis of variance


designs. The Analysis of Variance (ANOVA) procedure includes all interactions in the
model and tests them automatically; it also provides analysis of covariance, and
repeated measures designs. After you have estimated your ANOVA model, it is easy
to test post hoc pairwise differences in means or to test any contrast across cell means,
including simple effects.
For models with fixed and random effects, you can define error terms for specific
hypotheses. You can also do stepwise ANOVA (that is, Type I sums of squares).
Categorical variables are entered or deleted in blocks, and you can examine
interactively or automatically all combinations of interactions and main effects.
The General Linear Model (GLM) procedure is used for randomized block
designs, incomplete block designs, fractional factorials, Latin square designs, and
analysis of covariance with one or more covariates. GLM also includes repeated
measures, split plot, and crossover designs. It includes both univariate and
multivariate approaches to repeated measures designs.
Moreover, GLM also features the means model for missing cells designs. Widely
favored for this purpose by statisticians (Searle, 1987; Hocking, 1985; Milliken and
Johnson, 1984), the means model allows tests of hypotheses in missing cells designs
(using what are often called Type IV sums of squares). Furthermore, the means model
allows direct tests of simple hypotheses (for example, within levels of other factors).
Finally, the means model allows easier use of population weights to reflect
differences in subclass sizes.
For both ANOVA and GLM, group sizes can be unequal for combinations of
grouping factors; but for repeated measures designs, each subject must have complete
data. You can use numeric or character values to code grouping variables.
You can store results of the analysis (predicted values and residuals) for further
study and graphical display. In ANCOVA, you can save adjusted cell means.

I-431

I-432
Chapter 15

Analysis of Variance in SYSTAT


ANOVA: Estimate Model
To obtain an analysis of variance, from the menus choose:
Statistics
Analysis of Variance (ANOVA)
Estimate Model

Dependent. The variable(s) you want to examine. The dependent variable(s) should be
continuous and numeric (for example, INCOME). For MANOVA (multivariate
analysis of variance), select two or more dependent variables.
Factor. One or more categorical variables (grouping variables) that split your cases into
two or more groups.
Missing values. Includes a separate category for cases with a missing value for the
variable(s) identified with Factor.
Covariates. A covariate is a quantitative independent variable that adds unwanted
variability to the dependent variable. An analysis of covariance (ANCOVA) adjusts or
removes the variability in the dependent variable due to the covariate (for example,
variability in cholesterol level might be removed by using AGE as a covariate).

I-433
Linear Models II: Analysis of Variance

Post hoc Tests. Post hoc tests determine which pairs of means differ significantly. The
following alternatives are available:
n Bonferroni. Multiple comparison test based on Students t statistic. Adjusts the

observed significance level for the fact that multiple comparisons are made.
n Tukey. Uses the Studentized range statistic to make all pairwise comparisons

between groups and sets the experimentwise error rate to the error rate for the
collection for all pairwise comparisons. When testing a large number of pairs of
means, Tukey is more powerful than Bonferroni. For a small number of pairs,
Bonferroni is more powerful.
n LSD. Least significant difference pairwise multiple comparison test. Equivalent to

multiple t tests between all pairs of groups. The disadvantage of this test is that no
attempt is made to adjust the observed significance level for multiple comparisons.
n Scheff. The significance level of Scheffs test is designed to allow all possible

linear combinations of group means to be tested, not just pairwise comparisons


available in this feature. The result is that Scheffs test is more conservative than
other tests, meaning that a larger difference between means is required for
significance.
Save file. You can save residuals and other data to a new data file. The following
alternatives are available:
n Residuals. Saves predicted values, residuals, Studentized residuals, leverages,

Cooks D, and the standard error of predicted values. Only the predicted values and
residuals are appropriate for ANOVA.
n Residuals/Data. Saves the statistics given by Residuals plus all of the variables in

the working data file, including any transformed data values.


n Adjusted. Saves adjusted cell means from analysis of covariance.
n Adjusted/Data. Saves adjusted cell means plus all of the variables in the working

data file, including any transformed data values.


n Model. Saves statistics given in Residuals and the variables used in the model.
n Coefficients. Saves estimates of the regression coefficients.

I-434
Chapter 15

ANOVA: Hypothesis Test


Contrasts are used to test relationships among cell means. The Post hoc Tests on the
ANOVA dialog box are the most simple form because they compare two means at a
time. Use Specify or Contrast to define contrasts involving two or more meansfor
example, contrast the average responses for two treatment groups against that for a
control group; or test if average income increases linearly across cells ordered by
education (dropouts, high school graduates, college graduates). The coefficients for
the means of the first contrast might be (1,1,2) for a contrast of 1* Treatment A plus
1* Treatment B minus 2 * Control. The coefficients for the second contrast would be
(1,0,1).
To define contrasts among the cell means, from the menus choose:
Statistics
Analysis of Variance (ANOVA)
Hypothesis Test

An ANOVA model must be estimated before any hypothesis tests can be performed.

Contrasts can be defined across the categories of a grouping factor or across the levels
of a repeated measure.
n Effects. Specify the factor (that is, grouping variable) to which the contrast applies.

Selecting All yields a separate test of the effect of each factor in the ANOVA model,
as well as tests of all interactions between those factors.
n Within. Use when specifying a contrast across the levels of a repeated measures

factor. Enter the name assigned to the set of repeated measures.

I-435
Linear Models II: Analysis of Variance

Specify
To specify hypothesis test coefficients, click Specify in the ANOVA Hypothesis Test
dialog box.

To specify coefficients for a hypothesis test, use cell identifiers. Common hypothesis
tests include contrasts across marginal means or tests of simple effects. For a two-way
factorial ANOVA design with DISEASE (four categories) and DRUG (three
categories), you could contrast the marginal mean for the first level of drug against the
third level by specifying:
DRUG[1] = DRUG[3]

Note that square brackets enclose the value of the category (for example, for
GENDER$, specify GENDER$[male]). For the simple contrast of the first and third
levels of DRUG for the second disease only:
DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]

The syntax also allows statements like:


-3*DRUG[1] - 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4]

You have two error term options for hypothesis tests:


n Pooled. Uses the error term from the current model.
n Separate. Generates a separate variances error term.

I-436
Chapter 15

Contrast
To specify contrasts, click Contrast in the ANOVA Hypothesis Test dialog box.

Contrast generates a contrast for a grouping factor or a repeated measures factor.


SYSTAT offers six types of contrasts.
n Custom. Enter your own custom coefficients. If your factor has, say, four ordered

categories (or levels), you can specify your own coefficients, such as 3 1 1 3, by
typing these values in the Custom text box.
n Difference. Compare each level with its adjacent level.
n Polynomial. Generate orthogonal polynomial contrasts (to test linear, quadratic,

cubic trends across ordered categories or levels).


n Order. Enter 1 for linear, 2 for quadratic, and so on.
n Metric. Use Metric when the ordered categories are not evenly spaced. For

example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as
the metric.
n Sum. In a repeated measures ANOVA, total the values for each subject.

Repeated Measures
In a repeated measures design, the same variable is measured several times for each
subject (case). A paired-comparison t test is the most simple form of a repeated
measures design (for example, each subject has a before and after measure).

I-437
Linear Models II: Analysis of Variance

SYSTAT derives values from your repeated measures and uses them in analysis of
variance computations to test changes across the repeated measures (within subjects)
as well as differences between groups of subjects (between subjects). Tests of the
within-subjects values are called polynomial test of order 1, 2, ..., up to k, where k is
one less than the number of repeated measures. The first polynomial is used to test
linear changes: do the repeated responses increase (or decrease) around a line with a
significant slope? The second polynomial tests whether the responses fall along a
quadratic curve, and so on.
To obtain a repeated measures analysis of variance, from the menus choose:
Statistics
Analysis of Variance (ANOVA)
Estimate Model

and click Repeated.

The following options are available:


Perform repeated measures analysis. Treats the dependent variables as a set of repeated
measures.

Optionally, you can assign a name for each set of repeated measures, specify the
number of levels, and specify the metric for unevenly spaced repeated measures.
n Name. Name that identifies each set of repeated measures.
n Levels. Number of repeated measures in the set. For example, if you have three

dependent variables that represent measurements at different times, the number of


levels is 3.

I-438
Chapter 15

n Metric. Metric that indicates the spacing between unevenly spaced measurements.

For example, if measurements were taken at the third, fifth, and ninth weeks, the
metric would be 3, 5, 9.

Using Commands
ANOVA
USE filename
CATEGORY / MISS
DEPEND / REPEAT NAMES
BONF or TUKEY or LSD or SCHEFFE
SAVE filename / ADJUST, MODEL, RESID, DATA
ESTIMATE

To use ANOVA for analysis of covariance, insert COVARIATE before ESTIMATE.


After estimating a model, use HYPOTHESIS to test its parameters. Begin each test with
HYPOTHESIS and end with TEST.
HYPOTHESIS
EFFECT or WITHIN
ERROR
POST / LSD or TUKEY or BONF or SCHEFFE
POOLED or SEPARATE
or CONTRAST / DIFFERENCE or POLYNOMIAL or SUM or ORDER
or METRIC
or SPECIFY / POOLED or SEPARATE
AMATRIX
CMATRIX
TEST

Usage Considerations
Types of data. ANOVA requires a rectangular data file.
Print options. If PRINT=SHORT, output includes an ANOVA table. The MEDIUM length
adds least-squares means to the output. LONG adds estimates of the coefficients.
Quick Graphs. ANOVA plots the group means against the groups.
Saving files. ANOVA can save predicted values, residuals, Studentized residuals,
leverages, Cooks D, standard error of predicted values, adjusted cell means, and
estimates of the coefficients.
BY groups. ANOVA performs separate analyses for each level of any BY variables.

I-439
Linear Models II: Analysis of Variance

Bootstrapping. Bootstrapping is available in this procedure.


Case frequencies. You can use a FREQUENCY variable to duplicate cases.
Case weights. ANOVA uses a WEIGHT variable, if present, to duplicate cases.

Examples
Example 1
One-Way ANOVA
How does equipment influence typing performance? This example uses a one-way
design to compare average typing speed for three groups of typists. Fourteen beginning
typists were randomly assigned to three types of machines and given speed tests.
Following are their typing speeds in words per minute:
Electric

Plain old

Word processor

52
47
51
49
53

52
43
47
44

67
73
70
75
64

The data are stored in the SYSTAT data file named TYPING. The average speeds for
the typists in the three groups are 50.4, 46.5, and 69.8 words per minute, respectively.
To test the hypothesis that the three samples have the same population average speed,
the input is:
USE typing
ANOVA
CATEGORY equipmnt$
DEPEND speed
ESTIMATE

The output follows:


Dep Var: SPEED

N: 14

Multiple R: 0.95

Squared multiple R: 0.91

Analysis of Variance
Source
EQUIPMNT$
Error

Sum-of-Squares
1469.36
151.00

df
2
11

Mean-Square
734.68
13.73

F-ratio
53.52

P
0.00

I-440
Chapter 15

For the dependent variable SPEED, SYSTAT reads 14 cases. The multiple correlation
(Multiple R) for SPEED with the two design variables for EQUIPMNT$ is 0.952. The
square of this correlation (Squared multiple R) is 0.907. The grouping structure
explains 90.7% of the variability of SPEED.
The layout of the ANOVA table is standard in elementary texts; you will find
formulas and definitions there. F-ratio is the Mean-Square for EQUIPMNT$ divided
by the Mean-Square for Error. The distribution of the F ratio is sensitive to the
assumption of equal population group variances. The p value is the probability of
exceeding the F ratio when the group means are equal. The p value printed here is
0.000, so it is less than 0.0005. If the population means are equal, it would be very
unusual to find sample means that differ as much as theseyou could expect such a
large F ratio fewer than five times out of 10,000.
The Quick Graph illustrates this finding. Although the typists using electric and
plain old typewriters have similar average speeds (50.4 and 46.5, respectively), the
word processor group has a much higher average speed.

Pairwise Mean Comparisons


An analysis of variance indicates whether (at least) one of the groups differs from the
others. However, you cannot determine which group(s) differ based on ANOVA
results. To examine specific group differences, use post hoc tests.

I-441
Linear Models II: Analysis of Variance

In this example, we use the Bonferroni method for the typing speed data used in the
one-way ANOVA example. As an aid in interpretation, we order the equipment
categories from least to most advanced. The input is:
USE typing
ORDER equipmnt$ / SORT=plain old electric, word process
ANOVA
CATEGORY equipmnt$
DEPEND speed / BONF
ESTIMATE

SYSTAT assigns a number to each of the three groups and uses those numbers in the
output panels that follow:
COL/
ROW EQUIPMNT$
1 plain old
2 electric
3 word process
Using least squares means.
Post Hoc test of SPEED
-----------------------------------------------------------------------Using model MSE of 13.727 with 11 df.
Matrix of pairwise mean differences:
1
2
3

1
0.0
3.90
23.30

0.0
19.40

0.0

Bonferroni Adjustment.
Matrix of pairwise comparison probabilities:
1
2
3

1
1.00
0.43
0.00

2
1.00
0.00

3
1.00

In the first column, you can read differences in average typing speed for the group
using plain old typewriters. In the second row, you see that they average 3.9 words per
minute fewer than those using electric typewriters; but in the third row, you see that
they average 23.3 minutes fewer than the group using word processors. To see whether
these differences are significant, look at the probabilities in the corresponding locations
at the bottom of the table.
The probability associated with 3.9 is 0.43, so you are unable to detect a difference
in performance between the electric and plain old groups. The probabilities in the third
row are both 0.00, indicating that the word processor group averages significantly
more words per minute than the electric and plain old groups.

I-442
Chapter 15

Example 2
ANOVA Assumptions and Contrasts
An important assumption in analysis of variance is that the population variances are
equalthat is, that the groups have approximately the same spread. When variances
differ markedly, a transformation may remedy the problem. For example, sometimes it
helps to take the square root of each value of the outcome variable (or log transform
each value) and use the transformed value in the analysis.
In this example, we use a subset of the cases from the SURVEY2 data file to address
the question, For males, does average income vary by education? We focus on those
who:
n Did not graduate from high school (HS dropout)
n Graduated from high school (HS grad)
n Attended some college (Some college)
n Graduated from college (College grad)
n Have an M.A. or Ph.D. (Degree +)

For each male subject (case) in the SURVEY2 data file, use the variables INCOME and
EDUC$. The means, standard deviations, and sample sizes for the five groups are
shown below:

mean
sd
n

HS dropout

HS grad

Some college

College grad

Degree +

$13,389

$21,231

$29,294

$30,937

$38,214

10,639

13,176

16,465

16,894

18,230

18

39

17

16

14

Visually, as you move across the groups, you see that average income increases. But
considering the variability within each group, you might wonder if the differences are
significant. Also, there is a relationship between the means and standard deviationsas
the means increase, so do the standard deviations. They should be independent. If you
take the square root of each income value, there is less variability among the standard
deviations, and the relation between the means and standard deviations is weaker:
HS dropout

HS grad

Some college

College grad

Degree +

mean

3.371

4.423

5.190

5.305

6.007

sd

1.465

1.310

1.583

1.725

1.516

I-443
Linear Models II: Analysis of Variance

A bar chart for the data will show the effect of the transformation. The input is:
USE survey2
SELECT sex$ = Male
LABEL educatn / 1,2=HS dropout, 3=HS grad
4=Some college, 5=College grad
6,7=Degree +
CATEGORY educatn
BEGIN
BAR income * educatn / SERROR FILL=.5 LOC=-3IN,0IN
BAR income * educatn / SERROR FILL=.35 YPOW=.5,
LOC=3IN,0IN
END

The charts follow:


70
70
60
50

60

40

INCOME

INCOME

50
40
30
20

30
20
10

10
0
HS

dr

ou
op

t
HS

gr

ad

So

c
me

d
e
e+
ra
eg
re
oll
eg
eg
eg
D
l
l
Co

EDUCATN

HS

o
dr

ut
po

HS

a
gr

c
me
So

oll

e
eg
lle
Co

ge

gr

ad

gr
De

ee

EDUCATN

In the chart on the left, you can see a relation between the height of the bars (means)
and the length of the error bars (standard errors). The smaller means have shorter error
bars than the larger means. After transformation, there is less difference in length
among the error bars. The transformation aids in eliminating the dependency between
the group and the standard deviation.
To test for differences among the means:
ANOVA
LET sqrt_inc = SQR(income)
DEPEND sqrt_inc
ESTIMATE

I-444
Chapter 15

The output is:


Dep Var: SQRT_INC

N: 104

Multiple R: 0.49

Squared multiple R: 0.24

Analysis of Variance
Source

Sum-of-Squares

EDUCATN
Error

df

Mean-Square

68.62

17.16

216.26

99

2.18

F-ratio
7.85

P
0.00

The ANOVA table using the transformed income as the dependent variable suggests a
significant difference among the four means (p < 0.0005).

Tukey Pairwise Mean Comparisons


Which means differ? This example uses the Tukey method to identify significant
differences in pairs of means. Hopefully, you reach the same conclusions using either
the Tukey or Bonferroni methods. However, when the number of comparisons is very
large, the Tukey procedure may be more sensitive in detecting differences; when the
number of comparisons is small, Bonferroni may be more sensitive. The input is:
ANOVA
LET sqrt_inc = SQR(income)
DEPEND sqrt_inc / TUKEY
ESTIMATE

The output follows:


COL/
ROW EDUCATN
1 HS dropout
2 HS grad
3 Some college
4 College grad
5 Degree +
Using least squares means.
Post Hoc test of SQRT_INC
------------------------------------------------------------------------------Using model MSE of 2.184 with 99 df.
Matrix of pairwise mean differences:
1
2
1
0.0
2
1.052
0.0
3
1.819
0.767
4
1.935
0.883
5
2.636
1.584
Tukey HSD Multiple Comparisons.
Matrix of pairwise comparison probabilities:
1
2
3
4
5

1
1.000
0.100
0.004
0.002
0.000

0.0
0.116
0.817

0.0
0.701

0.0

1.000
0.387
0.268
0.007

1.000
0.999
0.545

1.000
0.694

1.000

I-445
Linear Models II: Analysis of Variance

The layout of the output panels for the Tukey method is the same as that for the
Bonferroni method. Look first at the probabilities at the bottom of the table. Four of the
probabilities indicate significant differences (they are less than 0.05). In the first
column, row 3, the average income for high school dropouts differs from those with
some college (p = 0.004), from college graduates (p = 0.002), and also from those with
advanced degrees (p < 0.0005). The fifth row shows that the differences between those
with advanced degrees and the high school graduates are significant (p = 0.008).

Contrasts
In this example, the five groups are ordered by their level of education, so you use these
coefficients to test linear and quadratic contrasts:
Linear
Quadratic

Then you ask, Is there a linear increase in average income across the five ordered
levels of education? A quadratic change? The input follows:
HYPOTHESIS
NOTE Test of linear contrast,
across ordered group means
EFFECT = educatn
CONTRAST [2 1 0 1 2]
TEST
HYPOTHESIS
NOTE 'Test of quadratic contrast',
'across ordered group means'
EFFECT = educatn
CONTRAST [2 1 2 1 2]
TEST
SELECT

I-446
Chapter 15

The resulting output is:


Test of linear contrast
across ordered group means
Test for effect called:

EDUCATN

A Matrix
1

2
-4.00

0.0

3
-3.00

4
-2.00

5
-1.00

Test of Hypothesis
Source
Hypothesis
Error

SS

df

63.54
216.26

MS

1
99

63.54
2.18

29.09

0.00

------------------------------------------------------------------------------Test of quadratic contrast


across ordered group means
Test for effect called:

EDUCATN

A Matrix
1

2
0.0

0.0

3
-3.00

4
-4.00

5
-3.00

Test of Hypothesis
Source
Hypothesis
Error

SS
2.20
216.26

df
1
99

MS
2.20
2.18

1.01

0.32

The F statistic for testing the linear contrast is 29.089 (p value < 0.0005); for testing
the quadratic contrast, it is 1.008 (p value = 0.32). Thus, you can report that there is a
highly significant linear increase in average income across the five levels of education
and that you have not found a quadratic component in this increase.

I-447
Linear Models II: Analysis of Variance

Example 3
Two-Way ANOVA
Consider the following two-way analysis of variance design from Afifi and Azen
(1972), cited in Kutner (1974), and reprinted in BMDP manuals. The dependent
variable, SYSINCR, is the change in systolic blood pressure after administering one of
four different drugs to patients with one of three different diseases. Patients were
assigned randomly to one of the possible drugs. The data are stored in the SYSTAT file
AFIFI.
To obtain a least-squares two-way analysis of variance:
USE afifi
ANOVA
CATEGORY drug disease
DEPEND sysincr
SAVE myresids / RESID
ESTIMATE

DATA

Because this is a factorial design, ANOVA automatically generates an interaction term


(DRUG * DISEASE). The output follows:
Dep Var: SYSINCR

N: 58

Multiple R: 0.675

Source

Sum-of-Squares

Squared multiple R: 0.456

Analysis of Variance
df

Mean-Square

F-ratio
9.046
1.883
1.067

DRUG
DISEASE
DRUG*DISEASE

2997.472
415.873
707.266

3
2
6

999.157
207.937
117.878

Error

5080.817

46

110.453

P
0.000
0.164
0.396

I-448
Chapter 15

q
Least Squares Means
2

41

41

30

30

SYSINCR

SYSINCR

19

-3

19

2
3
DRUG$

-3

2
3
DRUG$

3
41

SYSINCR

30

19

-3

2
3
DRUG$

In two-way ANOVA, begin by examining the interaction. If the interaction is significant,


you must condition your conclusions about a given factors effects on the level of the
other factor. The DRUG * DISEASE interaction is not significant (p = 0.396), so shift
your focus to the main effects.
The DRUG effect is significant (p < 0.0005), but the DISEASE effect is not (p = 0.164).
Thus, at least one of the drugs differs from the others with respect to blood pressure
change, but blood pressure change does not vary significantly across diseases.
For each factor, SYSTAT produces a plot of the average value of the dependent
variable for each level of the factor. For the DRUG plot, drugs 1 and 2 yield similar
average blood pressure changes. However, the average blood pressure change for

I-449
Linear Models II: Analysis of Variance

drugs 3 and 4 are much lower. ANOVA tests for significance the differences illustrated
in this plot.
For the DISEASE plot, we see a gradual decrease in blood pressure change across
the three diseases. However, this effect is not significant; there is not enough variation
among these means to overcome the variation due to individual differences.
In addition the plot for each factor, SYSTAT also produces plots of the average
blood pressure change at each level of DRUG for each level of disease. Use these plots
to illustrate interaction effects. Although the interaction effect is not significant in this
example, we can still examine these plots.
In general, we see a decline in blood pressure change across drugs. (Keep in mind
that the drugs are only artificially ordered. We could reorder the drugs, and although
the ANOVA results wouldnt change, the plots would differ.) The similarity of the
plots illustrates the nonsignificant interaction.
A close correspondence exists between the factor plots and the interaction plots. The
means plotted in the factor plot for DISEASE correspond to the weighted average of
the four points in each of the interaction plots. Similarly, each mean plotted in the
DRUG factor plot corresponds to the weighted average of the three corresponding
points across interaction plots. Consequently, the significant DRUG effect can be seen
in the differing means in each interaction plot. Can you see the nonsignificant
DISEASE effect in the interaction plots?

Least-Squares ANOVA
If you have an orthogonal design (equal number of cases in every cell), you will find
that the ANOVA table is the same one you get with any standard program. SYSTAT
can handle non-orthogonal designs, however (as in the present example). To
understand the sources for sums of squares, you must know something about leastsquares ANOVA.
As with one-way ANOVA, your specifying factor levels causes SYSTAT to create
dummy variables out of the classifying input variable. SYSTAT creates one fewer
dummy variables than categories specified.
Coding of the dummy variables is the classic analysis of variance parameterization,
in which the sum of effects estimated for a classifying variable is 0 (Scheff, 1959). In

I-450
Chapter 15

our example, DRUG has four categories; therefore, SYSTAT creates three dummy
variables with the following scores for subjects at each level:
1
0
0
1

0
1
0
1

0
0
1
1

for DRUG = 1 subject


for DRUG = 2 subjects
for DRUG = 3 subjects
for DRUG = 4 subjects

Because DISEASE has three categories, SYSTAT creates two dummy variables to be
coded as follows:
1
0
1

0
1
1

for DISEASE = 1 subject


for DISEASE = 2 subjects
for DISEASE = 3 subjects

Now, because there are no continuous predictors in the model (unlike the analysis of
covariance), you have a complete design matrix of dummy variables as follows (DRUG
is labeled with an a, DISEASE with a b, and the grand mean with an m):
Treatment
A
B
1
1
1
2
1
3
2
1
2
2
2
3
3
1
3
2
3
3
4
1
4
2
4
3

Mean
m
1
1
1
1
1
1
1
1
1
1
1
1

a1
1
1
1
0
0
0
0
0
0
1
1
1

DRUG
a2 a3
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
1
0
1
1 1
1 1
1 1

DISEASE
Interaction
b1
b2 a1b1 a1b2 a2b1 a2b2 a3b1 a3b2
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
1
1
0
0
1
1
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
1
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
1

This example is used to explain how SYSTAT gets an error term for the ANOVA table.
Because it is a least-squares program, the error term is taken from the residual sum of
squares in the regression onto the above dummy variables. For non-orthogonal designs,
this choice is identical to that produced by BMDP2V and SPSS GLM with Type III sums
of squares. These, in general, will be the hypotheses you want to test on unbalanced

I-451
Linear Models II: Analysis of Variance

experimental data. You can construct other types of sums of squares by using an A matrix
or by running your ANOVA model using the Stepwise options in GLM. Consult the
references if you do not already know what these sums of squares mean.

Post Hoc Tests


It is evident that only the main effect for DRUG is significant; therefore, you might
want to test some contrasts on the DRUG effects. A simple way would be to use the
Bonferroni method to test all pairwise comparisons of marginal drug means. However,
to compare three or more means, you must specify the particular contrast of interest.
Here, we compare the first and third drugs, the first and fourth drugs, and the first two
drugs with the last two drugs. The input is:
HYPOTHESIS
EFFECT = drug
CONTRAST [1 0 1 0]
TEST
HYPOTHESIS
EFFECT = drug
CONTRAST [1 0
TEST

0 1]

HYPOTHESIS
EFFECT = drug
CONTRAST [1 1
TEST

-1 1]

You need four numbers in each contrast because DRUG has four levels. You cannot use
CONTRAST to specify coefficients for interaction terms. It creates an A matrix only for
main effects. Following are the results of the above hypothesis tests:
Test for effect called:

DRUG

A Matrix
1
0.0

2
1.000

3
0.0

4
-1.000

6
0.0

7
0.0

8
0.0

9
0.0

11
0.0

12
0.0

5
0.0
10
0.0

Test of Hypothesis
Source
Hypothesis
Error

SS
1697.545
5080.817

df
1
46

MS
1697.545
110.453

F
15.369

P
0.000

-------------------------------------------------------------------------------

I-452
Chapter 15

Test for effect called:

DRUG

A Matrix
1
0.0

2
2.000

3
1.000

4
1.000

6
0.0

7
0.0

8
0.0

9
0.0

11
0.0

12
0.0

5
0.0
10
0.0

Test of Hypothesis
Source
Hypothesis
Error

SS

df

1178.892
5080.817

1
46

MS

1178.892
110.453

10.673

0.002

------------------------------------------------------------------------------Test for effect called:

DRUG

A Matrix
1
0.0

2
2.000

3
2.000

4
0.0

5
0.0

6
0.0

7
0.0

8
0.0

9
0.0

10
0.0

11
0.0

12
0.0

Test of Hypothesis
Source
Hypothesis
Error

SS
2982.934
5080.817

df
1
46

MS
2982.934
110.453

F
27.006

P
0.000

Notice the A matrices in the output. SYSTAT automatically takes into account the
degree of freedom lost in the design coding. Also, notice that you do not need to
normalize contrasts or rows of the A matrix to unit vector length, as in some ANOVA
programs. If you use (2 0 2 0) or (0.707 0 0.707 0) instead of (1 0 1 0), you get the
same sum of squares.
For the comparison of the first and third drugs, the F statistic is 15.369 (p value
< 0.0005), indicating that these two drugs differ. Looking at the Quick Graphs
produced earlier, we see that the change in blood pressure was much smaller for the
third drug.
Notice that in the A matrix created by the contrast of the first and fourth drugs, you
get (2 1 1) in place of the three design variables corresponding to the appropriate
columns of the A matrix. Because you selected the reduced form for coding of design
variables in which sums of effects are 0, you have the following restriction for the
DRUG effects:

1 + 2 + 3 + 4 = 0

I-453
Linear Models II: Analysis of Variance

where each value is the effect for that level of DRUG. This means that

4 = ( 1 + 2 + 3 )
and the contrast DRUG(1) DRUG(4) is equivalent to
1 [ ( 1 + 2 + 3 )] = 0

which is

2 1 + 2 + 3
For the final contrast, SYSTAT transforms the (1 1 1 1) specification into contrast
coefficients of (2 2 0) for the dummy coded variables. The p value (< 0.0005) indicates
that the first two drugs differ from the last two drugs.

Simple Effects
You can do simple contrasts between drugs within levels of disease (although the lack
of a significant DRUG * DISEASE interaction does not justify it). To show how it is
done, consider a contrast between the first and third levels of DRUG for the first
DISEASE only. You must specify the contrast in terms of the cell means. Use the
terminology:
MEAN (DRUG index, DISEASE index) = M{i,j}

You want to contrast cell means M{1,1} and M{3,1}. These are composed of:
M{11
, } = + 1 + 1 + 11
M{31
, } = + 3 + 1 + 31

Therefore the difference between the two means is:


M{11
, } M{31
, } = 1 3 + 11 31

Now, if you consider the coding of the variables, you can construct an A matrix that
picks up the appropriate columns of the design matrix. Here are the column labels of

I-454
Chapter 15

the design matrix (a means DRUG and b means DISEASE) to serve as a column ruler
over the A matrix specified in the hypothesis.
m

a1

a2

a3

b1

b2

a1b1

a1b2

a2b1

a2b2

a3b1

a3b2

The corresponding input is:


HYPOTHESIS
AMATRIX [0
TEST

0 1

0 1

0]

The output follows:


Hypothesis.
A Matrix
1
0.0

2
1.000

3
0.0

6
0.0

7
1.000

8
0.0

11
-1.000

4
-1.000

5
0.0

9
0.0

10
0.0

12
0.0

Test of Hypothesis
Source
Hypothesis
Error

SS
338.000
5080.817

df
1
46

MS
338.000
110.453

F
3.060

P
0.087

After you understand how SYSTAT codes design variables and how the model
sentence orders them, you can take any standard ANOVA text like Winer (1971) or
Scheff (1959) and construct an A matrix for any linear contrast.

Contrasting Marginal and Cell Means


Now look at how to contrast cell means directly without being concerned about how
they are coded. Test the first level of DRUG against the third (contrasting the marginal
means) with the following input:
HYPOTHESIS
SPECIFY drug[1] = drug[3]
TEST

I-455
Linear Models II: Analysis of Variance

To contrast the first against the fourth:


HYPOTHESIS
SPECIFY drug[1] = drug[4]
TEST

Finally, here is the simple contrast of the first and third levels of DRUG for the first
DISEASE only:
HYPOTHESIS
SPECIFY drug[1] disease[1] = drug[3] disease[1]
TEST

Screening Results
Lets examine the AFIFI data in more detail. To analyze the residuals to examine the
ANOVA assumptions, first plot the residuals against estimated values (cell means) to
check for homogeneity of variance. Use the Studentized residuals to reference them
against a t distribution. In addition, stem-and-leaf plots of the residuals and boxplots of
the dependent variable aid in identifying outliers. The input is:
USE afifi
ANOVA
CATEGORY drug disease
DEPEND sysincr
SAVE myresids / RESID DATA
ESTIMATE
DENSITY sysincr * drug / BOX
USE myresids
PLOT student*estimate / SYM=1 FILL=1
STATISTICS
STEM student

I-456
Chapter 15

Stem and Leaf Plot of variable:


Minimum:
-2.647
Lower hinge:
-0.761
Median:
0.101
Upper hinge:
0.698
Maximum:
1.552

STUDENT, N = 58

-2
6
-2
-1
987666
-1
410
-0 H 9877765
-0
4322220000
0 M 001222333444
0 H 55666888
1
011133444
1
55

The plots suggest the presence of an outlier. The smallest value in the stem-and-leaf
plot seems to be out of line. A t statistic value of 2.647 corresponds to p < 0.01, and
you would not expect a value this small to show up in a sample of only 58 independent
values. In the scatterplot, the point corresponding to this value appears at the bottom
and badly skews the data in its cell (which happens to be DRUG1, DISEASE3). The
outlier in the first group clearly stands out in the boxplot, too.
To see the effect of this outlier, delete the observation with the outlying Studentized
residual. Then, run the analysis again. Following is the ANOVA output for the revised
data:
Dep Var: SYSINCR

N:

57

Multiple R:

.710

Squared Multiple R:

.503

Analysis of Variance
Source

Sum-of-Squares

DF

Mean-Square

F-Ratio

DRUG
DISEASE
DRUG*DISEASE

3344.064
232.826
676.865

3
2
6

1114.688
116.413
112.811

11.410
1.192
1.155

Error

4396.367

45

97.697

P
0.000
0.313
0.347

The differences are not substantial. Nevertheless, notice that the DISEASE effect is
substantially attenuated when only one case out of 58 is deleted. Daniel (1960) gives
an example in which one outlying case alters the fundamental conclusions of a
designed experiment. The F test is robust to certain violations of assumptions, but
factorial ANOVA is not robust against outliers. You should routinely do these plots for
ANOVA.

I-457
Linear Models II: Analysis of Variance

Example 4
Single-Degree-of-Freedom Designs
The data in the REACT file involve yields of a chemical reaction under various
combinations of four binary factors (A, B, C, and D). Two reactions were observed
under each combination of experimental factors, so the number of cases per cell is two.
To analyze these data in a four-way ANOVA:
USE react
ANOVA
USE react
CATEGORY a, b, c, d
DEPEND yield
ESTIMATE

You can see the advantage of ANOVA over GLM when you have several factors; you
have to select only the main effects. With GLM, you have to specify the interactions
and identify which variables are categorical (that is, A, B, C, and D). The following
example is the full model using GLM:
MODEL yield = CONSTANT + a + b + c + d +,
a*b + a*c + a*d + b*c + b*d + c*d +,
a*b*c + a*b*d + a*c*d + b*c*d +,
a*b*c*d

The ANOVA output follows:


Dep Var: YIELD

N: 32

Multiple R: 0.755

Squared multiple R: 0.570

Analysis of Variance
Source
A
B
C
D
A*B
A*C
A*D
B*C
B*D
C*D
A*B*C
A*B*D
A*C*D
B*C*D
A*B*C*D
Error

Sum-of-Squares

df

Mean-Square

F-ratio

369800.000
1458.000
5565.125
172578.125
87153.125
137288.000
328860.500
61952.000
3200.000
3160.125
81810.125
4753.125
415872.000
4.500
15051.125

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1

369800.000
1458.000
5565.125
172578.125
87153.125
137288.000
328860.500
61952.000
3200.000
3160.125
81810.125
4753.125
415872.000
4.500
15051.125

4.651
0.018
0.070
2.170
1.096
1.727
4.136
0.779
0.040
0.040
1.029
0.060
5.230
0.000
0.189

1272247.000

16

79515.437

P
0.047
0.894
0.795
0.160
0.311
0.207
0.059
0.390
0.844
0.844
0.326
0.810
0.036
0.994
0.669

The output shows a significant main effect for the first factor (A) plus one significant
interaction (A*C*D).

I-458
Chapter 15

Assessing Normality
Lets look at the study more closely. Because this is a single-degree-of-freedom study
(a 2n factorial), each effect estimate is normally distributed if the usual assumptions for
the experiment are valid. All of the effects estimates, except the constant, have zero
mean and common variance (because dummy variables were used in their
computation). Thus, you can compare them to a normal distribution. SYSTAT
remembers your last selections, so the input is:
SAVE effects / COEF
ESTIMATE

This reestimates the model and saves the regression coefficients (effects). The file has
one case with 16 variables (CONSTANT plus 15 effects). The effects are labeled X(1),
X(2), and so on because they are related to the dummy variables, not the original
variables A, B, C, and D. Lets transpose this file into a new file containing only the 15
effects and create a probability plot of the effects. The input is:
USE effects
DROP constant
TRANSPOSE
SELECT case > 1
PPLOT col(1) / FILL=1 SYMBOL=1,
XLABEL=Estimates of Effects

The resulting plot is:

These effects are indistinguishable from a random normal variable. They plot almost
on a straight line. What does it mean for the study and for the significant F tests?

I-459
Linear Models II: Analysis of Variance

Its time to reveal that the data were produced by a random number generator.
n If you are doing a factorial analysis of variance, the p values you see on the output

are not adjusted for the number of factors. If you do a three-way design, look at
seven tests (excluding the constant). For a four-way design, examine 15 tests. Out
of 15 F tests on random data, expect to find at least one test approaching
significance. You have two significant and one almost significant, which is not far
out of line. The probabilities for each separate F test need to be corrected for the
experimentwise error rate. Some authors devote entire chapters to fine distinctions
between multiple comparison procedures and then illustrate them within a
multifactorial design not corrected for the experimentwise error rate just
demonstrated. Remember that a factorial design is a multiple comparison. If you
have a single-degree-of-freedom study, use the procedure you used to draw a
probability plot of the effects. Any effect that is really significant will become
obvious.
n If you have a factorial study with more degrees of freedom on some factors, use the

Bonferroni critical value for deciding which effects are significant. It guarantees
that the Type I error rate for the study will be no greater than the level you choose.
In the above example, this value is 0.05 / 15 (that is, 0.003).
n Multiple F tests based on a common denominator (mean-square error in this

example) are correlated. This complicates the problem further. In general, the
greater the discrepancy between numerator and denominator degrees of freedom
and the smaller the denominator degrees of freedom, the greater the dependence of
the tests. The Bonferroni tests are best in this situation, although Feingold and
Korsog (1986) offer some useful alternatives.

Example 5
Mixed Models
Mixed models involve combinations of fixed and random factors in an ANOVA. Fixed
factors are assumed to be composed of an exhaustive set of categories (for example,
males and females), while random factors have category levels that are assumed to
have been randomly sampled from a larger population of categories (for example,
classrooms or word stems). Because of the mixing of fixed and random components,
expected mean squares for certain effects are different from those for fully fixed or
fully random designs. SYSTAT can handle mixed models because you can specify
error terms for specific hypotheses.

I-460
Chapter 15

For example, lets analyze the AFIFI data with a mixed model instead of a fully
fixed factorial. Here, you are interested in the four drugs as wide-spectrum disease
killers. Because each drug is now thought to be effective against diseases in general,
you have sampled three random diseases to assess the drugs. This implies that
DISEASE is a random factor and DRUG remains a fixed factor. In this case, the error
term for DRUG is the DRUG * DISEASE interaction. To begin, run the same analysis
we performed in the two-way example to get the ANOVA table. To test for the DRUG
effect, specify drug * disease as the error term in a hypothesis test. The input is:
USE afifi
ANOVA
CATEGORY drug, disease
DEPEND sysincr
ESTIMATE
HYPOTHESIS
EFFECT = drug
ERROR = drug*disease
TEST

The output is:


Dep Var: SYSINCR

N: 58

Multiple R: 0.675

Squared multiple R: 0.456

Analysis of Variance
Source

df

Mean-Square

F-ratio

DRUG
DISEASE
DRUG*DISEASE

Sum-of-Squares
2997.472
415.873
707.266

3
2
6

999.157
207.937
117.878

9.046
1.883
1.067

Error

5080.817

46

110.453

Test for effect called:

P
0.000
0.164
0.396

DRUG

Test of Hypothesis
Source
Hypothesis
Error

SS
2997.472
707.266

df
3
6

MS
999.157
117.878

F
8.476

P
0.014

Notice that the SS, df, and MS for the error term in the hypothesis test correspond to the
values for the interaction in the ANOVA table.

I-461
Linear Models II: Analysis of Variance

Example 6
Separate Variance Hypothesis Tests
The data in the MJ20 data file are from Milliken and Johnson (1984). They are the
results of a paired-associate learning task. GROUP describes the type of drug
administered; LEARNING is the amount of material learned during testing. First we
perform Levenes test (Levene, 1960) to determine if the variances are equal across
cells. The input is:
USE mj20
ANOVA
SAVE mjresids / RESID DATA
DEPEND learning
CATEGORY group
ESTIMATE
USE mjresids
LET residual = ABS(residual)
CATEGORY group
DEPEND residual
ESTIMATE

Following is the ANOVA table of the absolute residuals:


Dep Var: RESIDUAL

N: 29

Multiple R: 0.675

Squared multiple R: 0.455

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio
6.966

GROUP

30.603

10.201

Error

36.608

25

1.464

P
0.001

Notice that the F is significant, indicating that the separate variances test is advisable.
Lets do several single-degree-of-freedom tests, following Milliken and Johnson. The
first is for comparing all drugs against the control; the second tests the hypothesis that
groups 2 and 3 together are not significantly different from group 4. The input is:
USE mj20
ANOVA
CATEGORY group
DEPEND learning
ESTIMATE
HYPOTHESIS
SPECIFY 3*group[1] = group[2] +group[3] + group[4] / SEPARATE
TEST
HYPOTHESIS
SPECIFY 2*group[4] = group[2] +group[3] / SEPARATE
TEST

I-462
Chapter 15

Following is the output. The ANOVA table has been omitted because it is not valid
when variances are unequal.
Using separate variances estimate for error term.
Hypothesis.
A Matrix
1
0.0
Null hypothesis value for D
0.0
Test of Hypothesis
Source
Hypoth
Error

SS
242.720
95.085

2
-4.000

df
1
7.096

3
0.0

MS

4
0.0

242.720
13.399

18.115

0.004

------------------------------------------------------------------------------Using separate variances estimate for error term.


>
TEST
Hypothesis.
A Matrix
1
0.0
Null hypothesis value for D
0.0
Test of Hypothesis
Source
Hypoth
Error

SS
65.634
61.852

2
2.000

df
1
16.792

3
3.000

MS
65.634
3.683

4
3.000

F
17.819

P
0.001

Example 7
Analysis of Covariance
Winer (1971) uses the COVAR data file for an analysis of covariance in which X is the
covariate and TREAT is the treatment. Cases do not need to be ordered by the grouping
variable TREAT.
Before analyzing the data with an analysis of covariance model, be sure there is no
significant interaction between the covariate and the treatment. The assumption of no
interaction is often called the homogeneity of slopes assumption because it is
tantamount to saying that the slope of the regression line of the dependent variable onto
the covariate should be the same in all cells of the design.

I-463
Linear Models II: Analysis of Variance

Parallelism is easy to test with a preliminary model. Use GLM to estimate this
model with the interaction between treatment (TREAT) and covariate (X) in the model.
The input is:
USE covar
GLM
CATEGORY treat
MODEL y = CONSTANT + treat + x + treat*x
ESTIMATE

The output follows:


Dep Var: Y

N: 21

Multiple R: 0.921

Squared multiple R: 0.849

Analysis of Variance
Source

Sum-of-Squares

TREAT
X
TREAT*X

df

Mean-Square

F-ratio

6.693
15.672
0.667

2
1
2

3.346
15.672
0.334

5.210
24.399
0.519

9.635

15

0.642

Error

P
0.019
0.000
0.605

The probability value for the treatment by covariate interaction is 0.605, so the
assumption of homogeneity of slopes is plausible.
Now, fit the usual analysis of covariance model by specifying:
USE covar
ANOVA
PRINT=MEDIUM
CATEGORY treat
DEPEND y
COVARIATE x
ESTIMATE

For incomplete factorials and similar designs, you still must specify a model (using
GLM) to do analysis of covariance.
The output follows:
Dep Var: Y

N: 21

Multiple R: 0.916

Squared multiple R: 0.839

Analysis of Variance
Source

df

Mean-Square

F-ratio

TREAT
X

Sum-of-Squares
16.932
16.555

2
1

8.466
16.555

13.970
27.319

Error

10.302

17

0.606

P
0.000
0.000

-------------------------------------------------------------------------------

I-464
Chapter 15

Adjusted least squares means.


Adj. LS Mean
TREAT
=1
4.888
TREAT
=2
7.076
TREAT
=3
6.750

SE
0.307
0.309
0.294

N
7
7
7

The treatment adjusted for the covariate is significant. There is a significant difference
among the three treatment groups. Also, notice that the coefficient for the covariate is
significant (F = 27.319, p < 0.0005). If it were not, the analysis of covariance could be
taking away a degree of freedom without reducing mean-square error enough to help you.
SYSTAT computes the adjusted cell means the same way it computes estimates
when saving residuals. Model terms (main effects and interactions) that do not contain
categorical variables (covariates) are incorporated into the equation by adding the
product of the coefficient and the mean of the term for computing estimates. The grand
mean (CONSTANT) is included in computing the estimates.

Example 8
One-Way Repeated Measures
In this example, six rats were weighed at the end of each of five weeks. A plot of each
rats weight over the duration of the experiment follows:
12

Measure

10
8
6
4
2
0

)
)
)
)
)
(4 T(5
(2 T(3
(1
HT
HT
H
HT GH
G
G
G
G
EI
EI
EI
EI
EI
W
W
W
W
W

Trial

ANOVA is the simplest way to analyze this one-way model. Because we have no
categorical variable(s), SYSTAT generates only the constant (grand mean) in the

I-465
Linear Models II: Analysis of Variance

model. To obtain individual single-degree-of-freedom orthogonal polynomials, the


input is:
USE rats
ANOVA
DEPEND weight(1 .. 5) / REPEAT NAME=Time
PRINT MEDIUM
ESTIMATE

The output follows:


Number of cases processed: 6
Dependent variable means
WEIGHT(1)
2.500

WEIGHT(2)
5.833

WEIGHT(3)
7.167

WEIGHT(4)
8.000

WEIGHT(5)
8.333

------------------------------------------------------------------------------Univariate and Multivariate Repeated Measures Analysis


Within Subjects
--------------Source
Time
Error

SS
134.467
41.933

df

MS

4
20

33.617
2.097

F
16.033

G-G

H-F

0.000

0.004

0.002

Greenhouse-Geisser Epsilon:
0.3420
Huynh-Feldt Epsilon
:
0.4273
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
Time
Error

SS
114.817
14.883

df
1
5

MS
114.817
2.977

F
38.572

P
0.002

Polynomial Test of Order 2 (Quadratic)


Source
Time
Error

SS
18.107
12.821

df
1
5

MS
18.107
2.564

F
7.061

P
0.045

Polynomial Test of Order 3 (Cubic)


Source
Time
Error

SS
1.350
9.950

df

MS

1
5

1.350
1.990

F
0.678

P
0.448

I-466
Chapter 15

Polynomial Test of Order 4


Source
Time
Error

SS
0.193
4.279

df

MS

1
5

0.193
0.856

F
0.225

P
0.655

------------------------------------------------------------------------------Multivariate Repeated Measures Analysis


Test of: Time
Wilks Lambda=
Pillai Trace =
H-L Trace
=

0.011
0.989
86.014

Hypoth. df
4
4
4

Error df
2
2
2

F
43.007
43.007
43.007

P
0.023
0.023
0.023

The Huynh-Feldt p value (0.002) does not differ from the p value for the F statistic to
any significant degree. Compound symmetry appears to be satisfied and weight
changes significantly over the five trials.
The polynomial tests indicate that most of the trials effect can be accounted for by
a linear trend across time. In fact, the sum of squares for TIME is 134.467, and the sum
of squares for the linear trend is almost as large (114.817). Thus, the linear polynomial
accounts for roughly 85% of the change across the repeated measures.

Unevenly Spaced Polynomials


Sometimes the underlying metric of the profiles is not evenly spaced. Lets assume that
the fifth weight was measured after the tenth week instead of the fifth. In that case, the
default polynomials have to be adjusted for the uneven spacing. These adjustments do
not affect the overall repeated measures tests of each effect (univariate or multivariate),
but they partition the sums of squares differently for the single-degree-of-freedom
tests. The input is:
USE rats
ANOVA
DEPEND weight(1 .. 5) / REPEAT=5(1 2 3 4 10) NAME=Time
PRINT MEDIUM
ESTIMATE

Alternatively, you could request a hypothesis test, specifying the metric for the
polynomials:
HYPOTHESIS
WITHIN='Time'
CONTRAST / POLYNOMIAL
TEST

METRIC=1,2,3,4,10

I-467
Linear Models II: Analysis of Variance

The last point has been spread out further to the right. The output follows:
Univariate and Multivariate Repeated Measures Analysis
Within Subjects
--------------Source
Time
Error

SS
134.467
41.933

df

MS

4
20

33.617
2.097

F
16.033

G-G

H-F

0.000

0.004

0.002

Greenhouse-Geisser Epsilon:
0.3420
Huynh-Feldt Epsilon
:
0.4273
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
Time
Error

SS
67.213
14.027

df
1
5

MS
67.213
2.805

F
23.959

P
0.004

Polynomial Test of Order 2 (Quadratic)


Source
Time
Error

SS
62.283
2.887

df
1
5

MS
62.283
0.577

F
107.867

P
0.000

(We omit the cubic and quartic polynomial output.)


------------------------------------------------------------------------------Multivariate Repeated Measures Analysis
Test of: Time
Wilks Lambda=
Pillai Trace =
H-L Trace
=

Hypoth. df
0.011
4
0.989
4
86.014
4

Error df
2
2
2

F
43.007
43.007
43.007

P
0.023
0.023
0.023

The significance tests for the linear and quadratic trends differ from those for the
evenly spaced polynomials. Before, the linear trend was strongest; now, the quadratic
polynomial has the most significant results (F = 107.9, p < 0.0005).
You may have noticed that although the univariate F tests for the polynomials are
different, the multivariate test is unchanged. The latter measures variation across all
components. The ANOVA table for the combined components is not affected by the
metric of the polynomials.

I-468
Chapter 15

Difference Contrasts
If you do not want to use polynomials, you can specify a C matrix that contrasts
adjacent weeks. After estimating the model, input the following:
HYPOTHESIS
WITHIN=Time
CONTRAST / DIFFERENCE
TEST

The output is:


Hypothesis.
C Matrix
1
1.000
0.0
0.0
0.0

1
2
3
4

2
-1.000
1.000
0.0
0.0

3
0.0
-1.000
1.000
0.0

4
0.0
0.0
-1.000
1.000

5
0.0
0.0
0.0
-1.000

Univariate F Tests
Effect

SS

MS

66.667
19.333

1
5

66.667
3.867

17.241

0.009

10.667
1.333

1
5

10.667
0.267

40.000

0.001

4.167
36.833

1
5

4.167
7.367

0.566

0.486

0.667
1.333

1
5

0.667
0.267

2.500

0.175

Error
Error
Error
Error

df

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.011
43.007

df =

4,

Prob =

0.023

Pillai Trace =
F-Statistic =

0.989
43.007

df =

4,

Prob =

0.023

Hotelling-Lawley Trace =
F-Statistic =

86.014
43.007

df =

4,

Prob =

0.023

Notice the C matrix that this command generates. In this case, each of the univariate F
tests covers the significance of the difference between the adjacent weeks indexed by
the C matrix. For example, F = 17.241 shows that the first and second weeks differ
significantly. The third and fourth weeks do not differ (F = 0.566). Unlike polynomials,
these contrasts are not orthogonal.

I-469
Linear Models II: Analysis of Variance

Summing Effects
To sum across weeks:
HYPOTHESIS
WITHIN=Time
CONTRAST / SUM
TEST

The output is:


Hypothesis.
C Matrix
1
1.000

2
1.000

3
1.000

4
1.000

5
1.000

Test of Hypothesis
Source
Hypothesis
Error

SS

df

6080.167
102.833

1
5

MS

6080.167
20.567

295.632

P
0.000

In this example, you are testing whether the overall weight (across weeks) significantly
differs from 0. Naturally, the F value is significant. Notice the C matrix that is
generated. It is simply a set of 1s that, in the equation BC' = 0, sum all the coefficients
in B. In a group-by-trials design, this C matrix is useful for pooling trials and analyzing
group effects.

Custom Contrasts
To test any arbitrary contrast effects between dependent variables, you can use C
matrix, which has the same form (without a column for the CONSTANT) as A matrix.
The following commands test a linear trend across the five trials:
HYPOTHESIS
CMATRIX [2 1 0 1 2]
TEST

The output is:


Hypothesis.
C Matrix
1
-2.000

2
-1.000

3
0.0

4
1.000

5
2.000

Test of Hypothesis
Source
Hypothesis
Error

SS
1148.167
148.833

df
1
5

MS
1148.167
29.767

F
38.572

P
0.002

I-470
Chapter 15

Example 9
Repeated Measures ANOVA for One Grouping Factor and
One Within Factor with Ordered Levels
The following example uses estimates of population for 1983, 1986, and 1990 and
projections for 2020 for 57 countries from the OURWORLD data file. The data are log
transformed before analysis. Here you compare trends in population growth for
European and Islamic countries. The variable GROUP$ contains codes for these
groups plus a third code for New World countries (we exclude these countries from this
analysis). To create a bar chart of the data after using YLOG to log transform them:
USE ourworld
SELECT group$ <> NewWorld
BAR pop_1983 .. pop_2020 / REPEAT OVERLAY YLOG,
GROUP=group$ SERROR
FILL=.35, .8

Measure

100.0

10.0
GROUP$
Islamic
Europe

1.0
90
83
86
20
19
19
19
20
P_
P_
P_
P_
O
O
O
O
P
P
P
P

Trial

To perform a repeated measures analysis:


USE ourworld
ANOVA
SELECT group$ <> NewWorld
CATEGORY group$
LET(pop_1983, pop_1986, pop_1990, pop_2020) = L10(@)
DEPEND pop_1983 pop_1986 pop_1990 pop_2020 / REPEAT=4 NAME=Time
ESTIMATE

I-471
Linear Models II: Analysis of Variance

The output follows:


Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
Time
Time*GROUP$
Error

SS
0.675
0.583
0.062

df
1
1
34

MS
0.675
0.583
0.002

F
370.761
320.488

P
0.000
0.000

Polynomial Test of Order 2 (Quadratic)


Source
Time
Time*GROUP$
Error

SS
0.132
0.128
0.049

df
1
1
34

MS
0.132
0.128
0.001

F
92.246
89.095

P
0.000
0.000

Polynomial Test of Order 3 (Cubic)


Source
Time
Time*GROUP$
Error

SS
0.028
0.027
0.010

df
1
1
34

MS
0.028
0.027
0.000

F
96.008
94.828

P
0.000
0.000

------------------------------------------------------------------------------Multivariate Repeated Measures Analysis


Test of: Time
Wilks Lambda=
Pillai Trace =
H-L Trace
=

Hypoth. df
0.063
3
0.937
3
14.781
3

Error df
32
32
32

F
157.665
157.665
157.665

P
0.000
0.000
0.000

Test of: Time*GROUP$


Wilks Lambda=
Pillai Trace =
H-L Trace
=

Hypoth. df
0.076
3
0.924
3
12.219
3

Error df
32
32
32

F
130.336
130.336
130.336

P
0.000
0.000
0.000

The within-subjects results indicate highly significant linear, quadratic, and cubic
changes across time. The pattern of change across time for the two groups also differs
significantly (that is, the TIME * GROUP$ interactions are highly significant for all
three tests).
Notice that there is a larger gap in time between 1990 and 2020 than between the
other values. Lets incorporate real time in the analysis with the following
specification:
DEPEND pop_1983 pop_1986 pop_1990 pop_2020 / REPEAT=4(83,86,90,120),
NAME=TIME
ESTIMATE

I-472
Chapter 15

The results for the orthogonal polynomials are shown below:


Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
TIME
TIME*GROUP$
Error

SS
0.831
0.737
0.089

df
1
1
34

MS
0.831
0.737
0.003

F
317.273
281.304

P
0.000
0.000

Polynomial Test of Order 2 (Quadratic)


Source
TIME
TIME*GROUP$
Error

SS
0.003
0.001
0.025

df
1
1
34

MS
0.003
0.001
0.001

F
4.402
1.562

P
0.043
0.220

Polynomial Test of Order 3 (Cubic)


Source
TIME
TIME*GROUP$
Error

SS
0.000
0.000
0.006

df
1
1
34

MS
0.000
0.000
0.000

F
1.653
1.733

P
0.207
0.197

When the values for POP_2020 are positioned on a real time line, the tests for
quadratic and cubic polynomials are no longer significant. The test for the linear
TIME * GROUP$ interaction, however, remains highly significant, indicating that the
slope across time for the Islamic group is significantly steeper than that for the
European countries.

Example 10
Repeated Measures ANOVA for Two Grouping Factors and
One Within Factor
Repeated measures enables you to handle grouping factors automatically. The
following example is from Winer (1971). There are two grouping factors (ANXIETY
and TENSION) and one trials factor in the file REPEAT1. Following is a dot display of
the average responses across trials for each of the four combinations of ANXIETY and
TENSION.

I-473
Linear Models II: Analysis of Variance

1,1

1,2

1,2

20

20

15

15
Measure

Measure

1,1

10

10

0
1)
2)
3)
4)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR

Trial

Trial

2,1

2,2

20

20

15

15
Measure

Measure

4)
1)
2)
3)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR

10

10

0
4)
1)
2)
3)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR

1)
2)
3)
4)
L(
L(
L(
L(
IA
IA
IA
IA
TR
TR
TR
TR

Trial

Trial

The input is:


USE repeat1
ANOVA
DOT trial(1..4) / Group=anxiety,tension, LINE,REPEAT,SERROR
CATEGORY anxiety tension
DEPEND trial(1 .. 4) / REPEAT NAME=Trial
PRINT MEDIUM
ESTIMATE

The model also includes an interaction between the grouping factors (ANXIETY *
TENSION). The output follows:
Univariate and Multivariate Repeated Measures Analysis
Between Subjects
---------------Source
ANXIETY
TENSION
ANXIETY
*TENSION
Error

SS

df

MS

10.083
8.333

1
1

10.083
8.333

0.978
0.808

0.352
0.395

80.083
82.500

1
8

80.083
10.313

7.766

0.024

I-474
Chapter 15

Within Subjects
--------------Source
Trial
Trial
*ANXIETY
Trial
*TENSION
Trial
*ANXIETY
*TENSION
Error

SS

df

MS

G-G

H-F

991.500

330.500

152.051

0.000

0.000

0.000

8.417

2.806

1.291

0.300

0.300

0.301

12.167

4.056

1.866

0.162

0.197

0.169

12.750
52.167

3
24

4.250
2.174

1.955

0.148

0.185

0.155

Greenhouse-Geisser Epsilon:
0.5361
Huynh-Feldt Epsilon
:
0.9023
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source
Trial
Trial
*ANXIETY
Trial
*TENSION
Trial
*ANXIETY
*TENSION
Error

SS

df

984.150

MS

984.150

247.845

0.000

1.667

1.667

0.420

0.535

10.417

10.417

2.623

0.144

9.600
31.767

1
8

9.600
3.971

2.418

0.159

Polynomial Test of Order 2 (Quadratic)


Source
Trial
Trial
*ANXIETY
Trial
*TENSION
Trial
*ANXIETY
*TENSION
Error

SS

df

MS

6.750

6.750

3.411

0.102

3.000

3.000

1.516

0.253

0.083

0.083

0.042

0.843

0.333
15.833

1
8

0.333
1.979

0.168

0.692

Polynomial Test of Order 3 (Cubic)


Source
Trial
Trial
*ANXIETY
Trial
*TENSION
Trial
*ANXIETY
*TENSION
Error

df

MS

0.600

SS

0.600

1.051

0.335

3.750

3.750

6.569

0.033

1.667

1.667

2.920

0.126

2.817
4.567

1
8

2.817
0.571

4.934

0.057

-------------------------------------------------------------------------------

I-475
Linear Models II: Analysis of Variance

Multivariate Repeated Measures Analysis


Test of: Trial
Wilks Lambda=
Pillai Trace =
H-L Trace
=

Hypoth. df
0.015
3
0.985
3
63.843
3

Error df
6
6
6

Hypoth. df

Error df

Test of: Trial


*ANXIETY
Wilks Lambda=
Pillai Trace =
H-L Trace
=

0.244
0.756
3.091

Test of: Trial


*TENSION
Wilks Lambda=
Pillai Trace =
H-L Trace
=

0.361
0.639
1.773

Test of: Trial


*ANXIETY
*TENSION
Wilks Lambda=
Pillai Trace =
H-L Trace
=

3
3
3

Hypoth. df
3
3
3

Hypoth. df
0.328
0.672
2.050

3
3
3

6
6
6
Error df
6
6
6
Error df
6
6
6

F
127.686
127.686
127.686

P
0.000
0.000
0.000

6.183
6.183
6.183

0.029
0.029
0.029

3.546
3.546
3.546

0.088
0.088
0.088

4.099
4.099
4.099

0.067
0.067
0.067

In the within-subjects table, you see that the trial effect is highly significant (F = 152.1, p <
0.0005). Below that table, we see that the linear trend across trials (Polynomial Order 1)
is highly significant (F = 247.8, p < 0.0005). The hypothesis sums of squares for the linear,
quadratic, and cubic polynomials sum to the total hypothesis sum of squares for trials (that
is, 984.15 + 6.75 + 0.60 = 991.5). Notice that the total sum of squares is 991.5, while that
for the linear trend is 984.15. This means that the linear trend accounts for more than 99%
of the variability across the four trials. The assumption of compound symmetry is not
required for the test of linear trendso you can report that there is a highly significant
linear decrease across the four trials (F = 247.8, p < 0.0005).

Example 11
Repeated Measures ANOVA for Two Trial Factors
Repeated measures enables you to handle several trials factors, so we include an
example with two trial factors. It is an experiment from Winer (1971), which has one
grouping factor (NOISE) and two trials factors (PERIODS and DIALS). The trials
factors must be sorted into a set of dependent variables (one for each pairing of the two
factors groups). It is useful to label the levels with a convenient mnemonic. The file is
set up with variables P1D1 through P3D3. Variable P1D2 indicates a score in the
PERIODS = 1, DIALS = 2 cell. The data are in the file REPEAT2.

I-476
Chapter 15

The input is:


USE repeat2
ANOVA
CATEGORY noise
DEPEND p1d1 .. p3d3 / REPEAT=3,3 NAMES=period,dial
PRINT MEDIUM
ESTIMATE

Notice that REPEAT specifies that the two trials factors have three levels each. ANOVA
assumes the subscript of the first factor will vary slowest in the ordering of the
dependent variables. If you have two repeated factors (DAY with four levels and AMPM
with two levels), you should select eight dependent variables and type Repeat=4,2. The
repeated measures are selected in the following order:
DAY1_AM

DAY1_PM

DAY2_AM

DAY2_PM

DAY3_AM

DAY3_PM

DAY4_AM

DAY4_PM

From this indexing, it generates the proper main effects and interactions. When more
than one trial factor is present, ANOVA lists each dependent variable and the
associated level on each factor. The output follows:
Dependent variable means
P1D1
48.000

P1D2
52.000

P1D3
63.000

P2D1
37.167

P2D3
54.167

P3D1
27.000

P3D2
32.500

P3D3
42.500

P2D2
42.167

------------------------------------------------------------------------------Univariate and Multivariate Repeated Measures Analysis


Between Subjects
---------------Source
NOISE
Error

SS

df

468.167
2491.111

MS
1
4

468.167
622.778

0.752

0.435

Within Subjects
--------------Source
period
period*NOISE
Error

SS
3722.333
333.000
234.889

Greenhouse-Geisser Epsilon:
Huynh-Feldt Epsilon
:
dial
2370.333
dial*NOISE
50.333
Error
105.556

df

MS
2
2
8

1861.167
166.500
29.361

2
2
8

0.6476
1.0000
1185.167
25.167
13.194

G-G

H-F

63.389
5.671

0.000
0.029

0.000
0.057

0.000
0.029

89.823
1.907

0.000
0.210

0.000
0.215

0.000
0.210

I-477
Linear Models II: Analysis of Variance

Greenhouse-Geisser Epsilon:
Huynh-Feldt Epsilon
:
period*dial
10.667
period*dial
*NOISE
11.333
Error
127.111

0.9171
1.0000
4

2.667

0.336

0.850

0.729

0.850

4
16

2.833
7.944

0.357

0.836

0.716

0.836

Greenhouse-Geisser Epsilon:
0.5134
Huynh-Feldt Epsilon
:
1.0000
------------------------------------------------------------------------------Single Degree of Freedom Polynomial Contrasts
--------------------------------------------Polynomial Test of Order 1 (Linear)
Source

SS

df

MS

period
period*NOISE
Error

3721.000
225.000
202.667

1
1
4

3721.000
225.000
50.667

73.441
4.441

0.001
0.103

dial
dial*NOISE
Error

2256.250
6.250
37.333

1
1
4

2256.250
6.250
9.333

241.741
0.670

0.000
0.459

0.375

0.375

0.045

0.842

1.042
33.333

1
4

1.042
8.333

0.125

0.742

period*dial
period*dial
*NOISE
Error

Polynomial Test of Order 2 (Quadratic)


Source

SS

df

MS

period
period*NOISE
Error

1.333
108.000
32.222

1
1
4

1.333
108.000
8.056

0.166
13.407

0.705
0.022

dial
dial*NOISE
Error

114.083
44.083
68.222

1
1
4

114.083
44.083
17.056

6.689
2.585

0.061
0.183

3.125

3.125

0.815

0.418

0.125
15.333

1
4

0.125
3.833

0.033

0.865

period*dial
period*dial
*NOISE
Error

Polynomial Test of Order 3 (Cubic)


Source
period*dial
period*dial
*NOISE
Error

df

MS

6.125

SS

6.125

0.750

0.435

3.125
32.667

1
4

3.125
8.167

0.383

0.570

df

MS

Polynomial Test of Order 4


Source

SS

period*dial
1.042
1
1.042
0.091
0.778
period*dial
*NOISE
7.042
1
7.042
0.615
0.477
Error
45.778
4
11.444
-------------------------------------------------------------------------------

I-478
Chapter 15

Multivariate Repeated Measures Analysis


Test of: period
Wilks Lambda=
Pillai Trace =
H-L Trace
=

0.051
0.949
18.764

Hypoth. df
2
2
2

Error df
3
3
3

F
28.145
28.145
28.145

P
0.011
0.011
0.011

0.156
0.844
5.407

Hypoth. df
2
2
2

Error df
3
3
3

F
8.111
8.111
8.111

P
0.062
0.062
0.062

0.016
0.984
60.971

Hypoth. df
2
2
2

Error df
3
3
3

F
91.456
91.456
91.456

P
0.002
0.002
0.002

0.565
0.435
0.770

Hypoth. df
2
2
2

Error df
3
3
3

F
1.155
1.155
1.155

P
0.425
0.425
0.425

Test of: period*dial


Hypoth. df
Wilks Lambda=
0.001
4
Pillai Trace =
0.999
4
H-L Trace
=
1325.780
4

Error df
1
1
1

F
331.445
331.445
331.445

P
0.041
0.041
0.041

Test of: period*dial


Hypoth. df
*NOISE
Wilks Lambda=
0.000
4
Pillai Trace =
1.000
4
H-L Trace
=
2327.500
4

Error df

Test of: period*NOISE


Wilks Lambda=
Pillai Trace =
H-L Trace
=
Test of: dial
Wilks Lambda=
Pillai Trace =
H-L Trace
=
Test of: dial*NOISE
Wilks Lambda=
Pillai Trace =
H-L Trace
=

1
1
1

F
581.875
581.875
581.875

P
0.031
0.031
0.031

Using GLM, the input is:


GLM
USE repeat2
CATEGORY noise
MODEL p1d1 .. p3d3 = CONSTANT + noise / REPEAT=3,3,
NAMES=period,dial
PRINT MEDIUM
ESTIMATE

Example 12
Repeated Measures Analysis of Covariance
To do repeated measures analysis of covariance, where the covariate varies within
subjects, you would have to set up your model like a split plot with a different record
for each measurement.
This example is from Winer (1971). This design has two trials (DAY1 and DAY2),
one covariate (AGE), and one grouping factor (SEX). The data are in the file WINER.

I-479
Linear Models II: Analysis of Variance

The input follows:


USE winer
ANOVA
CATEGORY sex
DEPEND day(1 .. 2) / REPEAT
COVARIATE age
ESTIMATE

NAME=day

The output is:


Dependent variable means
DAY(1)
16.500

DAY(2)
11.875

------------------------------------------------------------------------------Univariate Repeated Measures Analysis


Between Subjects
---------------Source

SS

df

SEX
AGE
Error

44.492
166.577
61.298

MS

1
1
5

44.492
166.577
12.260

1
1
1
5

22.366
0.494
0.127
1.250

3.629
13.587

0.115
0.014

Within Subjects
--------------Source
day
day*SEX
day*AGE
Error

SS
22.366
0.494
0.127
6.248

Greenhouse-Geisser Epsilon:
Huynh-Feldt Epsilon
:

df

MS

F
17.899
0.395
0.102

P
0.008
0.557
0.763

G-G

H-F

.
.
.

.
.
.

.
.

The F statistics for the covariate and its interactions, namely AGE (13.587) and
DAY * AGE (0.102), are not ordinarily published; however, they help you
understand the adjustment made by the covariate.
This analysis did not test the homogeneity of slopes assumption. If you want to test
the homogeneity of slopes assumption, run the following model in GLM first:
MODEL day(1 .. 2) = CONSTANT + sex + age + sex*age / REPEAT

Then check to see if the SEX * AGE interaction is significant.

I-480
Chapter 15

To use GLM:
GLM
USE winer
CATEGORY sex
MODEL day(1 .. 2) = CONSTANT + sex + age / REPEAT
ESTIMATE

NAME=day

Example 13
Multivariate Analysis of Variance
The data in the file MANOVA comprise a hypothetical experiment on rats assigned
randomly to one of three drugs. Weight loss in grams was observed for the first and
second weeks of the experiment. The data were analyzed in Morrison (1976) with a
two-way multivariate analysis of variance (a two-way MANOVA.)
You can use ANOVA to set up the MANOVA model for complete factorials:
USE manova
ANOVA
CATEGORY sex, drug
DEPEND week(1 .. 2)
ESTIMATE

Notice that the only difference between an ANOVA and MANOVA model is that the
latter has more than one dependent variable. The output includes:
Dependent variable means
WEEK(1)
9.750
Estimates of effects

B = (XX)

WEEK(2)
8.667
-1
XY

WEEK(1)
CONSTANT

WEEK(2)
9.750

8.667

SEX

0.167

0.167

DRUG

-2.750

-1.417

DRUG

-2.250

-0.167

SEX
DRUG

1
1

-0.667

-1.167

SEX
DRUG

1
2

-0.417

-0.417

I-481
Linear Models II: Analysis of Variance

Notice that each column of the B matrix is now assigned to a separate dependent
variable. It is as if we had done two runs of an ANOVA. The numbers in the matrix are
the analysis of variance effects estimates.
You can also use GLM to set up the MANOVA model. With this approach, the
design does not have to be a complete factorial. With commands:
GLM
USE manova
CATEGORY sex, drug
MODEL week(1 .. 2) = CONSTANT + sex + drug + sex*drug
ESTIMATE

Testing Hypotheses
With more than one dependent variable, you do not get a single ANOVA table; instead,
each hypothesis is tested separately. Here are three hypotheses. Extended output for the
second hypothesis is used to illustrate the detailed output.
HYPOTHESIS
EFFECT = sex
TEST
PRINT = LONG
HYPOTHESIS
EFFECT = drug
TEST
PRINT = SHORT
HYPOTHESIS
EFFECT = sex*drug
TEST

Following are the collected results:


Test for effect called:

SEX

Univariate F Tests
Effect

SS

df

MS

WEEK(1)
Error

0.667
94.500

1
18

0.667
5.250

0.127

0.726

WEEK(2)
Error

0.667
114.000

1
18

0.667
6.333

0.105

0.749

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.993
0.064

df =

2,

17

Prob =

0.938

Pillai Trace =
F-Statistic =

0.007
0.064

df =

2,

17

Prob =

0.938

Hotelling-Lawley Trace =
0.008
F-Statistic =
0.064
df =
2, 17
Prob =
0.938
-------------------------------------------------------------------------------

I-482
Chapter 15

Test for effect called:

DRUG

Null hypothesis contrast AB


WEEK(1)
-2.750
-2.250

1
2

WEEK(2)
-1.417
-0.167

-1
Inverse contrast A(XX)

A
1
0.083
-0.042

1
2

2
0.083

Hypothesis sum of product matrix

H = BA(A(XX)

WEEK(1)
301.000
97.500

WEEK(1)
WEEK(2)

-1
-1
A) AB

WEEK(2)
36.333

Error sum of product matrix G = EE


WEEK(1)
94.500
76.500

WEEK(1)
WEEK(2)

WEEK(2)
114.000

Univariate F Tests
Effect

SS

df

MS

WEEK(1)
Error

301.000
94.500

2
18

150.500
5.250

28.667

0.000

WEEK(2)
Error

36.333
114.000

2
18

18.167
6.333

2.868

0.083

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.169
12.199

df =

4,

34

Prob =

0.000

Pillai Trace =
F-Statistic =

0.880
7.077

df =

4,

36

Prob =

0.000

Hotelling-Lawley Trace =
F-Statistic =

4.640
18.558

df =

4,

32

Prob =

0.000

2, M =-0.5, N =

7.5 Prob =

0.000

THETA =

0.821 S =

Test of Residual Roots


Roots 1 through 2
Chi-Square Statistic =

36.491

df = 4

Roots 2 through 2
Chi-Square Statistic =

1.262

df = 1

I-483
Linear Models II: Analysis of Variance

Canonical Correlations
1
2
0.906
0.244
Dependent variable canonical coefficients standardized
by conditional (within groups) standard deviations
1
1.437
-0.821

WEEK(1)
WEEK(2)

2
-0.352
1.231

Canonical loadings (correlations between conditional


dependent variables and dependent canonical factors)
1
0.832
0.238

WEEK(1)
WEEK(2)

2
0.555
0.971

------------------------------------------------------------------------------Test for effect called:


SEX*DRUG
Univariate F Tests
Effect

SS

df

MS

WEEK(1)
Error

14.333
94.500

2
18

7.167
5.250

1.365

0.281

WEEK(2)
Error

32.333
114.000

2
18

16.167
6.333

2.553

0.106

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.774
1.159

df =

4,

34

Prob =

0.346

Pillai Trace =
F-Statistic =

0.227
1.152

df =

4,

36

Prob =

0.348

Hotelling-Lawley Trace =
F-Statistic =

0.290
1.159

df =

4,

32

Prob =

0.347

2, M =-0.5, N =

7.5 Prob =

0.295

THETA =

0.221 S =

Matrix formulas (that are something long) make explicit the hypothesis being tested.
For MANOVA, hypotheses are tested with sums-of-squares and cross-products
matrices. Before printing the multivariate tests, however, SYSTAT prints the univariate
tests. Each of these F statistics is constructed in the same way as the ANOVA model.
The sums of squares for hypothesis and error are taken from the diagonals of the
respective sum of product matrices. The univariate F test for the WEEK(1) DRUG
effect, for example, is computed from 301.0 2 over 94.5 18 , or hypothesis mean
square divided by error mean square.
The next statistics printed are for the multivariate hypothesis. Wilks lambda
(likelihood-ratio criterion) varies between 0 and 1. Schatzoff (1966) has tables for its
percentage points. The following F statistic is Raos approximate (sometimes exact) F
statistic corresponding to the likelihood-ratio criterion (see Rao, 1973). Pillais trace
and its F approximation are taken from Pillai (1960). The Hotelling-Lawley trace and

I-484
Chapter 15

its F approximation are documented in Morrison (1976). The last statistic is the largest
root criterion for Roys union-intersection test (see Morrison, 1976). Charts of the
percentage points of this statistic, found in Morrison and other multivariate texts, are
taken from Heck (1960).
The probability value printed for THETA is not an approximation. It is what you find
in the charts. In the first hypothesis, all the multivariate statistics have the same value
for the F approximation because the approximation is exact when there are only two
2
groups (see Hotellings T in Morrison, 1976). In these cases, THETA is not printed
because it has the same probability value as the F statistic.
Because we requested extended output for the second hypothesis, we get additional
material.

Bartletts Residual Root (Eigenvalue) Test


The chi-square statistics follow Bartlett (1947). The probability value for the first chisquare statistic should correspond to that for the approximate multivariate F statistic in
large samples. In small samples, they might be discrepant, in which case you should
generally trust the F statistic more. The subsequent chi-square statistics are
recomputed, leaving out the first and later roots until the last root is tested. These are
sequential tests and should be treated with caution, but they can be used to decide how
many dimensions (roots and canonical correlations) are significant. The number of
significant roots corresponds to the number of significant p values in this ordered list.

Canonical Coefficients
Dimensions with insignificant chi-square statistics in the prior tests should be ignored
in general. Corresponding to each canonical correlation is a canonical variate, whose
coefficients have been standardized by the within-groups standard deviations (the
default). Standardization by the sample standard deviation is generally used for
canonical correlation analysis or multivariate regression when groups are not present
to introduce covariation among variates. You can standardize these variates by the total
(sample) standard deviations with:
STANDARDIZE = TOTAL

inserted prior to TEST. Continue with the other test specifications described earlier.
Finally, the canonical loadings are printed. These are correlations and, thus, provide
information different from the canonical coefficients. In particular, you can identify

I-485
Linear Models II: Analysis of Variance

suppressor variables in the multivariate system by looking for differences in sign


between the coefficients and the loadings (which is the case with these data). See Bock
(1975) and Wilkinson (1975, 1977) for an interpretation of these variates.

Computation
Algorithms
Centered sums of squares and cross-products are accumulated using provisional
algorithms. Linear systems, including those involved in hypothesis testing, are solved
by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved
with Householder tridiagonalization and implicit QL iterations. For further
information, see Wilkinson and Reinsch (1971) or Chambers (1977).

References
Afifi, A. A. and Azen, S. P. (1972). Statistical analysis: A computer-oriented approach.
New York: Academic Press.
Affifi, A. A. and Clark, V. (1984). Computer-aided multivariate analysis. Belmont, Calif.:
Lifetime Learning Publications.
Bartlett, M. S. (1947). Multivariate analysis. Journal of the Royal Statistical Society, Series
B, 9, 176197.
Bock, R. D. (1975). Multivariate statistical methods in behavioral research. New York:
McGraw-Hill.
Cochran and Cox. (1957). Experimental designs. 2nd ed. New York: John Wiley & Sons,
Inc.
Daniel, C. (1960). Locating outliers in factorial experiments. Technometrics, 2, 149156.
Feingold, M. and Korsog, P. E. (1986). The correlation and dependence between two f
statistics with the same denominator. The American Statistician, 40, 218220.
Heck, D. L. (1960). Charts of some upper percentage points of the distribution of the largest
characteristic root. Annals of Mathematical Statistics, 31, 625642.
Hocking, R. R. (1985). The analysis of linear models. Monterey, Calif.: Brooks/Cole.
John, P. W. M. (1971). Statistical design and analysis of experiments. New York:
MacMillan, Inc.
Kutner, M. H. (1974). Hypothesis testing in linear models (Eisenhart Model I). The
American Statistician, 28, 98100.

I-486
Chapter 15

Levene, H. (1960). Robust tests for equality of variance. I. Olkin, ed., Contributions to
Probability and Statistics. Palo Alto, Calif.: Stanford University Press, 278292.
Miller, R. (1985). Multiple comparisons. Kotz, S. and Johnson, N. L., eds., Encyclopedia
of Statistical Sciences, vol. 5. New York: John Wiley & Sons, Inc., 679689.
Milliken, G. A. and Johnson, D. E. (1984). Analysis of messy data, Vol. 1: Designed
Experiments. New York: Van Nostrand Reinhold Company.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Neter, J., Wasserman, W., and Kutner, M. (1985). Applied linear statistical models, 2nd ed.
Homewood, Ill.: Richard D. Irwin, Inc.
Pillai, K. C. S. (1960). Statistical table for tests of multivariate hypotheses. Manila: The
Statistical Center, University of Phillipines.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John
Wiley & Sons, Inc.
Schatzoff, M. (1966). Exact distributions of Wilks likelihood ratio criterion. Biometrika,
53, 347358.
Scheff, H. (1959). The analysis of variance. New York: John Wiley & Sons, Inc.
Searle, S. R. (1971). Linear models. New York: John Wiley & Sons, Inc.
Searle, S. R. (1987). Linear models for unbalanced data. New York: John Wiley & Sons,
Inc.
Speed, F. M. and Hocking, R. R. (1976). The use of the r( )- notation with unbalanced data.
The American Statistician, 30, 3033.
Speed, F. M., Hocking, R. R., and Hackney, O. P. (1978). Methods of analysis of linear
models with unbalanced data. Journal of the American Statistical Association, 73,
105112.
Wilkinson, L. (1975). Response variable hypotheses in the multivariate analysis of
variance. Psychological Bulletin, 82, 408412.
Wilkinson, L. (1977). Confirmatory rotation of MANOVA canonical variates. Multivariate
Behavioral Research, 12, 487494.
Winer, B. J. (1971). Statistical principles in experimental design, 2nd ed. New York:
McGraw-Hill.

Chapter

16
Linear Models III:
General Linear Models
Leland Wilkinson and Mark Coward

General Linear Model (GLM) can estimate and test any univariate or multivariate
general linear model, including those for multiple regression, analysis of variance or
covariance, and other procedures such as discriminant analysis and principal
components. With the general linear model, you can explore randomized block
designs, incomplete block designs, fractional factorial designs, Latin square designs,
split plot designs, crossover designs, nesting, and more. The model is:
Y = XB + e

where Y is a vector or matrix of dependent variables, X is a vector or matrix of


independent variables, B is a vector or matrix of regression coefficients, and e is a
vector or matrix of random errors. See Searle (1971), Winer (1971), Neter,
Wasserman, and Kutner (1985), or Cohen and Cohen (1983) for details.
In multivariate models, Y is a matrix of continuous measures. The X matrix can be
either continuous or categorical dummy variables, according to the type of model. For
discriminant analysis, X is a matrix of dummy variables, as in analysis of variance.
For principal components analysis, X is a constant (a single column of 1s). For
canonical correlation, X is usually a matrix of continuous right-hand variables (and Y
is the matrix of left-hand variables).
For some multivariate models, it may be easier to use ANOVA, which can handle
models with multiple dependent variables and zero, one, or more categorical
independent variables (that is, only the constant is present in the former). ANOVA
automatically generates interaction terms for the design factor.
After the parameters of a model have been estimated, they can be tested by any
general linear hypothesis of the following form:

I-487

I-488
Chapter 16

ABC = D

where A is a matrix of linear weights on coefficients across the independent variables


(the rows of B), C is a matrix of linear weights on the coefficients across dependent
variables (the columns of B), B is the matrix of regression coefficients or effects, and
D is a null hypothesis matrix (usually a null matrix).
For the multivariate models described in this chapter, the C matrix is an identity
matrix, and the D matrix is null. The A matrix can have several different forms, but
these are all submatrices of an identity matrix and are easily formed.

General Linear Models in SYSTAT


Model Estimation (in GLM)
To specify a general linear model using GLM, from the menus choose:
Statistics
General Linear Model (GLM)
Estimate Model

You can specify any multivariate linear model with General Linear Model. You must
select the variables to include in the desired model.

I-489
Linear Models III: Genera l Linea r Models

Dependent(s). The variable(s) you want to examine. The dependent variable(s) should
be continuous numeric variables (for example, income).
Independent(s). Select one or more continuous or categorical variables (grouping
variables). Independent variables that are not denoted as categorical are considered
covariates. Unlike ANOVA, GLM does not automatically include and test all
interactions. With GLM, you have to build your model. If you want interactions or
nested variables in your model, you need to build these components.
Model. The following model options allow you to include a constant in your model, do
a means model, specify the sample size, and weight cell means:
n Include constant. The constant is an optional parameter. Deselect Include constant

to obtain a model through the origin. When in doubt, include the constant.
n Means. Specifies a fully factorial design using means coding.
n Cases. When your data file is a symmetric matrix, specify the sample size that

generated the matrix.


n Weight. Weights cell means by the cell counts before averaging.

In addition, you can save residuals and other data to a new data file. The following
alternatives are available:
n Residuals. Saves predicted values, residuals, Studentized residuals, and the

standard error of predicted values.


n Residuals/Data. Saves the statistics given by Residuals, plus all the variables in the

working data file, including any transformed data values.


n Adjusted. Saves adjusted cell means from analysis of covariance.
n Adjusted/Data. Saves adjusted cell means plus all the variables in the working data

file, including any transformed data values.


n Partial. Saves partial residuals.
n Partial/Data. Saves partial residuals plus all the variables in the working data file,

including any transformed data values.


n Model. Saves statistics given in Residuals and the variables used in the model.
n Coefficients. Saves the estimates of the regression coefficients.

I-490
Chapter 16

Categorical Variables
You can specify numeric or character-valued categorical (grouping) variables that
define cells. You want to categorize an independent variable when it has several
categories such as education levels, which could be divided into the following
categories: less than high school, some high school, finished high school, some
college, finished bachelors degree, finished masters degree, and finished doctorate.
On the other hand, a variable such as age in years would not be categorical unless age
were broken up into categories such as under 21, 2165, and over 65.
To specify categorical variables, click the Categories button in the General Linear
Model dialog box.

Types of Categories. You can elect to use one of two different coding methods:
n Effect. Produces parameter estimates that are differences from group means.
n Dummy. Produces dummy codes for the design variables instead of effect codes.

Coding of dummy variables is the classic analysis of variance parameterization, in


which the sum of effects estimated for a classifying variable is 0. If your categorical
variable has k categories, k 1 dummy variables are created.

I-491
Linear Models III: Genera l Linea r Models

Repeated Measures
In a repeated measures design, the same variable is measured several times for each
subject (case). A paired-comparison t test is the most simple form of a repeated
measures design (for example, each subject has a before and after measure).
SYSTAT derives values from your repeated measures and uses them in general
linear model computations to test changes across the repeated measures (within
subjects) as well as differences between groups of subjects (between subjects). Tests
of the within-subjects values are called Polynomial Test Of Order 1, 2,..., up to k,
where k is one less than the number of repeated measures. The first polynomial is used
to test linear changes: Do the repeated responses increase (or decrease) around a line
with a significant slope? The second polynomial tests if the responses fall along a
quadratic curve, etc.
To open the Repeated Measures dialog box, click Repeated in the General Linear
Model dialog box.

If you select Perform repeated measures analysis, SYSTAT treats the dependent
variables as a set of repeated measures. Optionally, you can assign a name for each set
of repeated measures, specify the number of levels, and specify the metric for unevenly
spaced repeated measures.
Name. Name that identifies each set of repeated measures.
Levels. Number of repeated measures in the set. For example, if you have three
dependent variables that represent measurements at different times, the number of
levels is 3.
Metric. Metric that indicates the spacing between unevenly spaced measurements. For
example, if measurements were taken at the third, fifth, and ninth weeks, the metric
would be 3, 5, 9.

I-492
Chapter 16

General Linear Model Options


General Linear Model Options allows you to specify a tolerance level, select complete
or stepwise entry, and specify entry and removal criteria.
To open the Options dialog box, click Options in the General Linear Model dialog box.

The following options can be specified:


Tolerance. Prevents the entry of a variable that is highly correlated with the
independent variables already included in the model. Enter a value between 0 and 1.
Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the
correlation required to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Mixture model. Constrains the independent variables to sum to a constant.
n Stepwise. Variables are entered into or removed from the model, one at a time.

Stepwise Options. The following alternatives are available for stepwise entry and
removal:
n Backward. Begins with all candidate variables in the model. At each step, SYSTAT

removes the variable with the largest Remove value.

I-493
Linear Models III: Genera l Linea r Models

n Forward. Begins with no variables in the model. At each step, SYSTAT adds the

variable with the smallest Enter value.


n Automatic. For Backward, at each step, SYSTAT automatically removes a variable

from your model. For Forward, at each step, SYSTAT automatically adds a
variable to the model.
n Interactive. At each step in the model building, you select the variable to enter into

or remove from the model.


You can also control the criteria used to enter and remove variables from the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified

value. Enter a value between 0 and 1(for example, 0.025).


n Remove. Removes a variable from the model if its alpha value is greater than the

specified value. Enter a value between 0 and 1(for example, 0.025).


n Force. Forces the first n variables listed in your model to remain in the equation.
n FEnter. F-to-enter limit. Variables with F greater than the specified value are

entered into the model if Tolerance permits.


n FRemove. F-to-remove limit. Variables with F less than the specified value are

removed from the model.


n Max step. Maximum number of steps.

Pairwise Comparisons
Once you determine that your groups are different, you may want to compare pairs of
groups to determine which pairs differ.
To open the Pairwise Comparisons dialog box, from the menus choose:
Statistics
General Linear Model (GLM)
Pairwise Comparisons

I-494
Chapter 16

Groups. You must specify the variable that defines the groups.
Test. General Linear Model provides several post hoc tests to compare levels of this
variable.
n Bonferroni. Multiple comparison test based on Students t statistic. Adjusts the

observed significance level for the fact that multiple comparisons are made.
n Tukey. Uses the Studentized range statistic to make all pairwise comparisons

between groups and sets the experimentwise error rate to the error rate for the
collection for all pairwise comparisons. When testing a large number of pairs of
means, Tukey is more powerful than Bonferroni. For a small number of pairs,
Bonferroni is more powerful.
n Dunnett. The Dunnett test is available only with one-way designs. Dunnett

compares a set of treatments against a single control mean that you specify. You
can choose a two-sided or one-sided test. To test that the mean at any level (except
the control category) of the experimental groups is not equal to that of the control
category, select 2-sided. To test if the mean at any level of the experimental groups
is smaller (or larger) than that of the control category, select 1-sided.
n Fishers LSD. Least significant difference pairwise multiple comparison test.

Equivalent to multiple t tests between all pairs of groups. The disadvantage of this
test is that no attempt is made to adjust the observed significance level for multiple
comparisons.
n Scheff. The significance level of Scheffs test is designed to allow all possible

linear combinations of group means to be tested, not just pairwise comparisons


available in this feature. The result is that Scheffs test is more conservative than
other tests, meaning that a larger difference between means is required for
significance.

I-495
Linear Models III: Genera l Linea r Models

Error Term. You can either use the mean square error specified by the model or you can
enter the mean square error.
n Model MSE. Uses the mean square error from the general linear model that you ran.
n MSE and df. You can specify your own mean square error term and degrees of

freedom for mixed models with random factors, split-plot designs, and crossover
designs with carry-over effects.

Hypothesis Tests
Contrasts are used to test relationships among cell means. The post hoc tests in GLM
Pairwise Comparison are the most simple form because they compare two means at a
time. However, general contrasts can involve any number of means in the analysis.
To test hypotheses, from the menus choose:
Statistics
General Linear Model (GLM)
Hypothesis Test

Contrasts can be defined across the categories of a grouping factor or across the levels
of a repeated measure.

I-496
Chapter 16

Effects. Specify the factor (grouping variable) to which the contrast applies. For
principal components, specify the grouping variable for within-groups components (if
any). For canonical correlation, select All to test all of the effects in the model.
Within. Use when specifying a contrast across the levels of a repeated measures factor.
Enter the name assigned to the set of repeated measures in the Repeated Measures
subdialog box.
Error Term. You can specify which error term to use for the hypothesis tests.
n Model MSE. Uses the mean square error from the general linear model that you ran.
n MSE and df. You can specify your own mean square error and degrees of freedom

if you know them from a previous model.


n Between Subject(s) Effect(s). Select this option to use main effect error terms or

interaction error terms in all tests. Specify interactions using an asterisk between
variables.
Priors. Prior probabilities for discriminant analysis. Type a value for each group,
separated by spaces. These probabilities should add to 1. For example, if you have
three groups, priors might be 0.5, 0.3, and 0.2.
Standardize. You can standardize canonical coefficients using the total sample or a
within-groups covariance matrix.
n Within groups is usually used in discriminant analysis to make comparisons easier

when measures are on different scales.


n Sample is used in canonical correlation.

Rotate. Specify the number of components to rotate.


Factor. In a factor analysis with grouping variables, factor the Hypothesis (betweengroups) matrix or the Error (within-groups) matrix. This allows you to compute
principal components on the hypothesis or error matrix separately, offering a direct
way to compute principal components on residuals of any linear model you wish to fit.
You can specify the matrix type as Correlations, SSCP, or Covariance.
Save scores and results. You can save the results to a SYSTAT data file. Exactly what
is saved depends on the analysis, When you save scores and results, extended output
is automatically produced. This enables you to see more detailed output when
computing these statistics.

I-497
Linear Models III: Genera l Linea r Models

Specify
To specify contrasts for between-subjects effects, click Specify in the Hypothesis Test
dialog box.

You can use GLMs cell means language to define contrasts across the levels of a
grouping variable in a multivariate model. For example, for a two-way factorial
ANOVA design with DISEASE (four categories) and DRUG (three categories), you
could contrast the marginal mean for the first level of drug against the third level by
specifying:
DRUG[1] = DRUG[3]

Note that square brackets enclose the value of the category (for example, for
GENDER$, specify GENDER$[male]). For the simple contrast of the first and third
levels of DRUG for the second disease only, specify:
DRUG[1] DISEASE[2] = DRUG[3] DISEASE[2]

The syntax also allows statements like:


3*DRUG[1] 1*DRUG[2] + 1*DRUG[3] + 3*DRUG[4]

In addition, you can specify the error term to use for the contrasts.
Pooled. Uses the error term from the current model.
Separate. Generates a separate variances error term.

I-498
Chapter 16

Contrast
Contrast generates a contrast for a grouping factor or a repeated measures factor. To
open the Contrast dialog box, click Contrast in the Hypothesis Test dialog box.

SYSTAT offers several types of contrasts:


Custom. Enter your own custom coefficients. For example, if your factor has four
ordered categories (or levels), you can specify your own coefficients, such as 3 1 1
3, by typing these values in the Custom text box.
Difference. Compares each level with its adjacent level.
Polynomial. Generates orthogonal polynomial contrasts (to test linear, quadratic, or
cubic trends across ordered categories or levels).
n Order. Enter 1 for linear, 2 for quadratic, etc.
n Metric. Use Metric when the ordered categories are not evenly spaced. For

example, when repeated measures are collected at weeks 2, 4, and 8, enter 2,4,8 as
the metric.
Sum. In a repeated measures ANOVA, totals the values for each subject.

I-499
Linear Models III: Genera l Linea r Models

A Matrix, C Matrix, and D Matrix


A matrix, C matrix, and D matrix are available for hypothesis testing in multivariate
models. You can test parameters of the multivariate model estimated or factor the
quadratic form of your model into orthogonal components. Linear hypotheses have the
form:
ABC = D

These matrices (A, C, and D) may be specified in several alternative ways; if they are
not specified, they have default values. To specify an A matrix, click A matrix in the
Hypothesis Test dialog box.

A is a matrix of linear weights contrasting the coefficient estimates (the rows of B).
The A matrix has as many columns as there are regression coefficients (including the
constant) in your model. The number of rows in A determine how many degrees of
freedom your hypothesis involves. The A matrix can have several different forms, but
these are all submatrices of an identity matrix and are easily formed using Hypothesis
Test.
To specify a C matrix, click C matrix in the Hypothesis Test dialog box.

I-500
Linear Models III: General Linear Models

Post hoc Tests for Repeated Measures


After performing analysis of variance, we just have an F-statistic, which tells us
that means are not equal - we still do not know exactly which means are
significantly different from which other ones. Post hoc tests can only be used
when the "omnibus" ANOVA found a significant effect. If the F-value for a factor
turns out non-significant, you cannot go further with the analysis. This protects the
post hoc test from being used too liberally.
The main problem that designers of post hoc test try to deal with is alpha inflation.
This refers to the fact that the more tests you conduct at alpha=0.5, the more likely
you are to claim you have significant result when you shouldn't have. The overall
chance of a type I error rate in a particular experiment is referred to as the "experiment
wise error rate" (or family wise error rate).
Bonferroni Correction. If you want to keep the experiment-wise error rate to a
specified level (alpha=0.5) a simple way of doing this is to divide the acceptable
alpha level by number of comparisons we intend to make. That is, for any one
comparison to be considered significant, the obtained p-value would have to be less
than alpha/num of comparisons. Select this option if you would like to perform a
Bonferroni correction.
Sidak Correction. The same above said experiment-wise error is kept in control by a
use of the formula.
sidak_alpha = 1-(1-alpha)**(1/c), where c is number of paired comparisons.
Select this option if you would like to perform a Sidak correction.
Factor Name. This is the name given to the set of repeated measures in GLM.

I-501
Chapter 16

The C matrix is used to test hypotheses for repeated measures analysis of variance
designs and models with multiple dependent variables. C has as many columns as there
are dependent variables. For most multivariate models, C is an identity matrix.
To specify a D matrix, click D matrix in the Hypothesis Test dialog box.

D is a null hypothesis matrix (usually a null matrix). The D matrix, if you use it, must
have the same number of rows as A. For univariate multiple regression, D has only one
column. For multivariate models (multiple dependent variables), the D matrix has one
column for each dependent variable.
A matrix and D matrix are often used to test hypotheses in regression. Linear
hypotheses in regression have the form A = D, where A is the matrix of linear weights
on coefficients across the independent variables (the rows of ), is the matrix of
regression coefficients, and D is a null hypothesis matrix (usually a null matrix). The
A and D matrices can be specified in several alternative ways, and if they are not
specified, they have default values.

I-502
Linear Models III: General Linear Models

Using Commands
Select the data with USE filename and continue with:
GLM
MODEL varlist1 = CONSTANT + varlist2 + var1*var2 + ,
var3(var4) / REPEAT=m,n, REPEAT=m(x1,x2,),
n(y1,y2,) NAMES=name1,name2, , MEANS,
WEIGHT N=n
CATEGORY grpvarlist / MISS EFFECT or DUMMY
SAVE filename / COEF MODEL RESID DATA PARTIAL ADJUSTED
ESTIMATE / MIX TOL=n

For stepwise model building, use START in place of ESTIMATE:


START / FORWARD or BACKWARD TOL=n ENTER=p REMOVE=p ,
FENTER=n FREMOVE=n FORCE=n MAXSTEP=n
STEP no argument or var or index / AUTO ENTER=p,
REMOVE=p FENTER=n FREMOVE=n
STOP

To perform hypothesis tests:


HYPOTHESIS
EFFECT varlist, var1&var2,
WITHIN name
CONTRAST [matrix] / DIFFERENCE or POLYNOMIAL or SUM
ORDER=n METRIC=m,n,
SPECIFY hypothesis lang / POOLED or SEPARATE
AMATRIX [matrix]
CMATRIX [matrix]
DMATRIX [matrix]
ALL
POST varlist / LSD or TUKEY or BONF=n or SCHEFFE or,
DUNNETT ONE or TWO CONTROL=levelname,
POOLED or SEPARATE
ROTATE=n
TYPE=CORR or COVAR or SSCP
STAND = TOTAL or WITHIN
FACTOR = HYPOTHESIS or ERROR
ERROR varlist or var1&var2 or value(df) or matrix
PRIORS m n p
TEST

Usage Considerations
Types of data. Normally, you analyze raw cases-by-variables data with General Linear
Model. You can, however, use a symmetric matrix data file (for example, a covariance
matrix saved in a file from Correlations) as input. If you use a matrix as input, you must
specify a value for Cases when estimating the model (under Group in the General

I-503
Chapter 16

Linear Model dialog box) to specify the sample size of the data file that generated the
matrix. The number you specify must be an integer greater than 2.
Be sure to include the dependent as well as independent variables in your matrix.
SYSTAT picks out the dependent variable you name in your model.
SYSTAT uses the sample size to calculate degrees of freedom in hypothesis tests.
SYSTAT also determines the type of matrix (SSCP, covariance, and so on) and adjusts
appropriately. With a correlation matrix, the raw and standardized coefficients are the
same; therefore, you cannot include a constant when using SSCP, covariance, or
correlation matrices. Because these matrices are centered, the constant term has
already been removed.
The triangular matrix input facility is useful for meta-analysis of published data
and missing value computations; however, you should heed the following warnings:
First, if you input correlation matrices from textbooks or articles, you may not get the
same regression coefficients as those printed in the source. Because of round-off error,
printed and raw data can lead to different results. Second, if you use pairwise deletion
with Correlations, the degrees of freedom for hypotheses will not be appropriate. You
may not even be able to estimate the regression coefficients because of singularities.
In general, correlation matrices containing missing data produce coefficient
estimates and hypothesis tests that are optimistic. You can correct for this by
specifying a sample size smaller than the number of actual observations (preferably set
it equal to the smallest number of cases used for any pair of variables), but this is a
guess that you can refine only by doing Monte Carlo simulations. There is no simple
solution. Beware, especially, of multivariate regressions (MANOVA and others) with
missing data on the dependent variables. You can usually compute coefficients, but
hypothesis testing produces results that are suspect.
Print options. General Linear Model produces extended output if you set the output length
to LONG or if you select Save scores and results in the Hypothesis Test dialog box.
For model estimation, extended output adds the following: total sum of product
matrix, residual (or pooled within groups) sum of product matrix, residual (or pooled
within groups) covariance matrix, and the residual (or pooled within groups)
correlation matrix.
For hypothesis testing, extended output adds A, C, and D matrices, the matrix of
contrasts, and the inverse of the cross products of contrasts, hypothesis and error sum
of product matrices, tests of residual roots, canonical correlations, coefficients, and
loadings.

I-504
Linear Models III: General Linear Models

Quick Graphs. If no variables are categorical, GLM produces Quick Graphs of residuals
versus predicted values. For categorical predictors, GLM produces graphs of the least
squares means for the levels of the categorical variable(s).
Saving files. Several sets of output can be saved to a file. The actual contents of the
saved file depend on the analysis. Files may include estimated regression coefficients,
model variables, residuals, predicted values, diagnostic statistics, canonical variable
scores, and posterior probabilities (among other statistics).
BY groups. Each level of any BY variables yields a separate analysis.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. GLM uses the FREQUENCY variable, if present, to duplicate cases.
Case weights. GLM uses the values of any WEIGHT variables to weight each case.

Examples
Example 1
One-Way ANOVA
The following data, KENTON, are from Neter, Wasserman, and Kutner (1985). The
data comprise unit sales of a cereal product under different types of package designs.
Ten stores were selected as experimental units. Each store was randomly assigned to
sell one of the package designs (each design was sold at two or three stores).
PACKAGE

SALES

1
1
2
2
2
3
3
3
4
4

12
18
14
12
13
19
17
21
24
30

I-505
Chapter 16

Numbers are used to code the four types of package designs; alternatively, you could
have used words. Neter, Wasserman, and Kutner report that cartoons are part of
designs 1 and 3 but not designs 2 and 4; designs 1 and 2 have three colors; and designs
3 and 4 have five colors. Thus, string codes for PACKAGE$ might have been Cart 3,
NoCart 3, Cart 5, and NoCart 5. Notice that the data does not need to be ordered
by PACKAGE as shown here. The input for a one-way analysis of variance is:
USE kenton
GLM
CATEGORY package
MODEL sales=CONSTANT + package
GRAPH NONE
ESTIMATE

The output follows:


Categorical values encountered during processing are:
PACKAGE (4 levels)
1,
2,
3,
4
Dep Var: SALES

N: 10

Multiple R: 0.921

Squared multiple R: 0.849

Analysis of Variance
Source
PACKAGE
Error

Sum-of-Squares

df

Mean-Square

F-ratio

258.000

86.000

11.217

46.000

7.667

P
0.007

This is the standard analysis of variance table. The F ratio (11.217) appears significant,
so you could conclude that the package designs differ significantly in their effects on
sales, provided the assumptions are valid.

Pairwise Multiple Comparisons


SYSTAT offers five methods for comparing pairs of means: Bonferroni, TukeyKramer HSD, Scheff, Fischers LSD, and Dunnetts test.
The Dunnett test is available only with one-way designs. Dunnett requires the value
of a control group against which comparisons are made. By default, two-sided tests are
computed. One-sided Dunnett tests are also available. Incidentally, for Dunnetts tests
on experimental data, you should use the one-sided option unless you cannot predict
from theory whether your experimental groups will have higher or lower means than
the control.

I-506
Linear Models III: General Linear Models

Comparisons for the pairwise methods are made across all pairs of least-squares
group means for the design term that is specified. For a multiway design, marginal cell
means are computed for the effects specified before the comparisons are made.
To determine significant differences, simply look for pairs with probabilities below
your critical value (for example, 0.05 or 0.01). All multiple comparison methods
handle unbalanced designs correctly.
After you estimate your ANOVA model, it is easy to do post hoc tests. To do a
Tukey HSD test, first estimate the model, then specify these commands:
HYPOTHESIS
POST package / TUKEY
TEST

The output follows:


COL/
ROW PACKAGE
1 1
2 2
3 3
4 4
Using least squares means.
Post Hoc test of SALES
Using model MSE of 7.667 with 6 df.
Matrix of pairwise mean differences:
1
2
3
4

1
0.0
-2.000
4.000
12.000

0.0
8.000

0.0

1.000
0.130
0.006

1.000
0.071

1.000

0.0
6.000
14.000

Tukey HSD Multiple Comparisons.


Matrix of pairwise comparison probabilities:
1
2
3
4

1
1.000
0.856
0.452
0.019

Results show that sales for the fourth package design (five colors and no cartoons) are
significantly larger than those for packages 1 and 2. None of the other pairs differ
significantly.

I-507
Chapter 16

Contrasts
This example uses two contrasts:
n We compare the first and third packages using coefficients of (1, 0, 1, 0).
n We compare the average performance of the first three packages with the last,

using coefficients of (1, 1, 1, 3).


The input is:
HYPOTHESIS
EFFECT =
CONTRAST
TEST
HYPOTHESIS
EFFECT =
CONTRAST
TEST

package
[1 0 1 0]
package
[1 1 1 3]

For each hypothesis, we specify one contrast, so the test has one degree of freedom;
therefore, the contrast matrix has one row of numbers. These numbers are the same
ones you see in ANOVA textbooks, although ANOVA offers one advantageyou do
not have to standardize them so that their sum of squares is 1. The output follows:
Test for effect called:

PACKAGE

A Matrix
1
0.0

2
1.000

3
0.0

4
-1.000

Test of Hypothesis
Source
Hypothesis
Error

SS

df

19.200
46.000

1
6

MS

19.200
7.667

2.504

0.165

------------------------------------------------------------------------------Test for effect called:

PACKAGE

A Matrix
1
0.0

2
4.000

3
4.000

4
4.000

Test of Hypothesis
Source
Hypothesis
Error

SS
204.000
46.000

df
1
6

MS
204.000
7.667

F
26.609

P
0.002

For the first contrast, the F statistic (2.504) is not significant, so you cannot conclude
that the impact of the first and third package designs on sales is significantly different.

I-508
Linear Models III: General Linear Models

Incidentally, the A matrix contains the contrast. The first column (0) corresponds to the
constant in the model, and the remaining three columns (1 0 1) correspond to the
dummy variables for PACKAGE.
The last package design is significantly different from the other three taken as a
group. Notice that the A matrix looks much different this time. Because the effects sum
to 0, the last effect is minus the sum of the other three; that is, letting i denote the
effect for level i of package,
1 + 2 + 3 + 4 = 0

so
4 = (1 + 2 + 3)

and the contrast is


1 + 2 + 3 34

which is
1 + 2 + 3 3(1 2 3)

which simplifies to
4*1 + 4*2 + 4*3

Remember, SYSTAT does all this work automatically.

Orthogonal Polynomials
Constructing orthogonal polynomials for between-group factors is useful when the
levels of a factor are ordered. To construct orthogonal polynomials for your betweengroups factors:
HYPOTHESIS
EFFECT = package
CONTRAST / POLYNOMIAL ORDER=2
TEST

I-509
Chapter 16

The output is:


Test for effect called:

PACKAGE

A Matrix
1
0.0

2
0.0

3
-1.000

4
-1.000

Test of Hypothesis
Source
Hypothesis
Error

SS
60.000
46.000

df
1
6

MS
60.000
7.667

F
7.826

P
0.031

Make sure that the levels of the factorafter they are sorted by the procedure
numerically or alphabeticallyare ordered meaningfully on a latent dimension. If you
need a specific order, use LABEL or ORDER; otherwise, the results will not make sense.
In the example, the significant quadratic effect is the result of the fourth package
having a much larger sales volume than the other three.

Effect and Dummy Coding


The effects in a least-squares analysis of variance are associated with a set of dummy
variables that SYSTAT generates automatically. Ordinarily, you do not have to
concern yourself with these dummy variables; however, if you want to see them, you
can save them in to a SYSTAT file. The input is:
USE kenton
GLM
CATEGORY package
MODEL sales=CONSTANT + package
GRAPH NONE
SAVE mycodes / MODEL
ESTIMATE
USE mycodes
FORMAT 12,0
LIST SALES x(1..3)

I-510
Linear Models III: General Linear Models

The listing of the dummy variables follows:


Case Number

SALES

1
2
3
4
5
6
7
8
9
10

12
18
14
12
13
19
17
21
24
30

X(1)

X(2)

X(3)

1
1
0
0
0
0
0
0
1
1

0
0
1
1
1
0
0
0
1
1

0
0
0
0
0
1
1
1
1
1

The variables X(1), X(2), and X(3) are the effects coding dummy variables generated
by the procedure. All cases in the first cell are associated with dummy values 1 0 0;
those in the second cell with 0 1 0; the third, 0 0 1; and the fourth, 1 1 1. Other leastsquares programs use different methods to code dummy variables. The coding used by
SYSTAT is most widely used and guarantees that the effects sum to 0.
If you had used dummy coding, these dummy variables would be saved:
SALES

X(1)

X(2)

X(3)

12
18
14
12
13
19
19
17
21
24
30

1
1
0
0
0
0
0
0
0
0
0

0
0
1
1
1
0
0
0
0
0
0

0
0
0
0
0
0
1
1
1
0
0

This coding yields parameter estimates that are the differences between the mean for
each group and the mean of the last group.

I-511
Chapter 16

Example 2
Randomized Block Designs
A randomized block design is like a factorial design without an interaction term. The
following example is from Neter, Wasserman, and Kutner (1985). Five blocks of
judges were given the task of analyzing three treatments. Judges are stratified within
blocks, so the interaction of blocks and treatments cannot be analyzed. These data are
in the file BLOCK. The input is:
USE block
GLM
CATEGORY block, treat
MODEL judgment = CONSTANT + block + treat
ESTIMATE

You must use GLM instead of ANOVA because you do not want the BLOCK*TREAT
interaction in the model. The output is:
Dep Var: JUDGMENT

N: 15

Multiple R: 0.970

Squared multiple R: 0.940

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio
14.358
33.989

BLOCK
TREAT

171.333
202.800

4
2

42.833
101.400

Error

23.867

2.983

P
0.001
0.000

Example 3
Incomplete Block Designs
Randomized blocks can be used in factorial designs. Here is an example from John
(1971). The data (in the file JOHN) involve an experiment with three treatment factors
(A, B, and C) plus a blocking variable with eight levels. Notice that data were collected
on 32 of the possible 64 experimental situations.

I-512
Linear Models III: General Linear Models

BLOCK

BLOCK

1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4

1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2

1
1
2
2
1
1
2
2
1
2
1
2
2
1
2
1

1
2
2
1
2
1
1
2
1
1
2
2
1
1
2
2

101
373
398
291
312
106
265
450
106
306
324
449
272
89
407
338

5
5
5
5
6
6
6
6
7
7
7
7
8
8
8
8

1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2

1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2

1
2
1
2
2
1
2
1
1
1
2
2
2
2
1
1

87
324
279
471
323
128
423
334
131
103
445
437
324
361
302
272

The input is:


USE john
GLM
CATEGORY block, a, b, c
MODEL y = CONSTANT + block + a + b + c +,
a*b + a*c + b*c + a*b*c
ESTIMATE

The output follows:


Dep Var: Y

N: 32

Multiple R: 0.994

Squared multiple R: 0.988

Analysis of Variance
Source

df

Mean-Square

F-ratio

BLOCK
A
B
C
A*B
A*C
B*C
A*B*C

Sum-of-Squares
2638.469
3465.281
161170.031
278817.781
28.167
1802.667
11528.167
45.375

7
1
1
1
1
1
1
1

376.924
3465.281
161170.031
278817.781
28.167
1802.667
11528.167
45.375

1.182
10.862
505.209
873.992
0.088
5.651
36.137
0.142

0.364
0.004
0.000
0.000
0.770
0.029
0.000
0.711

Error

5423.281

17

319.017

I-513
Chapter 16

Example 4
Fractional Factorial Designs
Sometimes a factorial design involves so many combinations of treatments that certain
cells must be left empty to save experimental resources. At other times, a complete
randomized factorial study is designed, but loss of subjects leaves one or more cells
completely missing. These models are similar to incomplete block designs because not
all effects in the full model can be estimated. Usually, certain interactions must be left
out of the model.
The following example uses some experimental data that contain values in only 8
out of 16 possible cells. Each cell contains two cases. The pattern of nonmissing cells
makes it possible to estimate only the main effects plus three two-way interactions. The
data are in the file FRACTION.
A

1
1
2
2
2
2
1
1
2
2
1
1
1
1
2
2

1
1
2
2
1
1
2
2
1
1
2
2
1
1
2
2

1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2

1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2

7
3
1
2
12
13
14
15
8
6
12
10
6
4
6
7

The input follows:


USE fraction
GLM
CATEGORY a, b, c, d
MODEL y = CONSTANT + a + b + c + d + a*b + a*c + b*c
ESTIMATE

I-514
Linear Models III: General Linear Models

We must use GLM instead of ANOVA to omit the higher-way interactions that ANOVA
automatically generates. The output is:
Dep Var: Y

N: 16

Multiple R: 0.972

Squared multiple R: 0.944

Analysis of Variance
Source
A
B
C
D
A*B
A*C
B*C
Error

Sum-of-Squares

df

Mean-Square

F-ratio

16.000
4.000
49.000
4.000
182.250
12.250
2.250

1
1
1
1
1
1
1

16.000
4.000
49.000
4.000
182.250
12.250
2.250

8.000
2.000
24.500
2.000
91.125
6.125
1.125

16.000

2.000

P
0.022
0.195
0.001
0.195
0.000
0.038
0.320

When missing cells turn up by chance rather than by design, you may not know which
interactions to eliminate. When you attempt to fit the full model, SYSTAT informs you
that the design is singular. In that case, you may need to try several models before
finding an estimable one. It is usually best to begin by leaving out the highest-order
interaction (A*B*C*D in this example). Continue with subset models until you get an
ANOVA table.
Looking for an estimable model is not the same as analyzing the data with stepwise
regression because you are not looking at p values. After you find an estimable model,
stop and settle with the statistics printed in the ANOVA table.

Example 5
Nested Designs
Nested designs resemble factorial designs with certain cells missing (incomplete
factorials). This is because one factor is nested under another, so that not all
combinations of the two factors are observed. For example, in an educational study,
classrooms are usually nested under schools because it is impossible to have the same
classroom existing at two different schools (except as antimatter). The following

I-515
Chapter 16

example (in which teachers are nested within schools) is from Neter, Wasserman, and
Kutner (1985). The data (learning scores) look like this:
TEACHER1

TEACHER2

SCHOOL1

25
29

14
11

SCHOOL2

11
6

22
18

SCHOOL3

17
20

5
2

In the study, there are actually six teachers, not just two; thus, the design really looks
like this:
TEACHER1 TEACHER2 TEACHER3 TEACHER4 TEACHER5 TEACHER6
SCHOOL1

25
29

14
11

SCHOOL2

11
6

SCHOOL3

17
20

The data are set up in the file SCHOOLS.


TEACHER

1
1
2
2
3
3
4
4
5
5
6
6

22
18

SCHOOL

LEARNING

1
1
1
1
2
2
2
2
3
3
3
3

25
29
14
11
11
6
22
18
17
20
5
2

5
2

I-516
Linear Models III: General Linear Models

The input is:


USE schools
GLM
CATEGORY teacher, school
MODEL learning = CONSTANT + school + teacher(school)
ESTIMATE

The output follows:


Dep Var: LEARNING

N: 12

Multiple R: 0.972

Squared multiple R: 0.945

Analysis of Variance
Source

Sum-of-Squares

SCHOOL
TEACHER(SCHOOL)
Error

df

Mean-Square

F-ratio

156.500
567.500

2
3

78.250
189.167

11.179
27.024

42.000

7.000

P
0.009
0.001

Your data can use any codes for TEACHER, including a separate code for every teacher
in the study, as long as each different teacher within a given school has a different code.
GLM will use the nesting specified in the MODEL statement to determine the pattern of
nesting. You can, for example, allow teachers in different schools to share codes.
This example is a balanced nested design. Unbalanced designs (unequal number of
cases per cell) are handled automatically in SYSTAT because the estimation method
is least squares.

Example 6
Split Plot Designs
The split plot design is closely related to the nested design. In the split plot, however,
plots are often considered a random factor; therefore, you have to construct different
error terms to test different effects. The following example involves two treatments: A
(between plots) and B (within plots). The numbers in the cells are the YIELD of the
crop within plots.
A1

A2

PLOT1

PLOT2

PLOT3

PLOT4

B1

B2

B3

B4

I-517
Chapter 16

Here are the data from the PLOTS data file in the form needed by SYSTAT:
PLOT

YIELD

1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4

1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2

1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4

0
0
5
3
3
1
5
4
4
2
7
8
5
4
6
6

To analyze this design, you need two different error terms. For the between-plots
effects (A), you need plots within A. For the within-plots effects (B and A*B), you
need B by plots within A.
First, fit the saturated model with all the effects and then specify different error
terms as needed. The input is:
USE plots
GLM
CATEGORY plot, a, b
MODEL yield = CONSTANT + a + b + a*b + plot(a) + b*plot(a)
ESTIMATE

The output follows:


Dep Var: YIELD

N: 16

Multiple R: 1.000

Squared multiple R: 1.000

Analysis of Variance
Source
A
B
A*B
PLOT(A)
B*PLOT(A)
Error

Sum-of-Squares
27.563
42.688
2.188
3.125
7.375
0.0

df

Mean-Square

1
3
3
2
6

27.563
14.229
0.729
1.562
1.229

F-ratio
.
.
.
.
.

P
.
.
.
.
.

I-518
Linear Models III: General Linear Models

You do not get a full ANOVA table because the model is perfectly fit. The coefficient
of determination (Squared multiple R) is 1. Now you have to use some of the effects
as error terms.

Between-Plots Effects
Lets test for between-plots effects, namely A. The input is:
HYPOTHESIS
EFFECT = a
ERROR = plot(a)
TEST

The output is:


Test for effect called:

Test of Hypothesis
Source
Hypothesis
Error

SS

df

27.563
3.125

1
2

MS
27.563
1.562

F
17.640

P
0.052

The between-plots effect is not significant (p = 0.052).

Within-Plots Effects
To do the within-plots effects (B and A*B), the input is:
HYPOTHESIS
EFFECT = b
ERROR = b*plot(a)
TEST
HYPOTHESIS
EFFECT = a*b
ERROR = b*plot(a)
TEST

The output follows:


Test for effect called:

Test of Hypothesis
Source
Hypothesis
Error

SS
42.687
7.375

df
3
6

MS
14.229
1.229

F
11.576

P
0.007

-------------------------------------------------------------------------------

I-519
Chapter 16

Test for effect called:

A*B

Test of Hypothesis
Source
Hypothesis
Error

SS

df

MS

2.188
7.375

3
6

0.729
1.229

F
0.593

P
0.642

Here, we find a significant effect due to factor B (p = 0.007), but the interaction is not
significant (p = 0.642).
This analysis is the same as that for a repeated measures design with subjects as
PLOT, groups as A, and trials as B. Because this method becomes unwieldy for a large
number of plots (subjects), SYSTAT offers a more compact method for repeated
measures analysis as an alternative.

Example 7
Latin Square Designs
A Latin square design imposes a pattern on treatments in a factorial design to save
experimental effort or reduce within cell error. As in the nested design, not all
combinations of the square and other treatments are measured, so the model lacks
certain interaction terms between squares and treatments. GLM can analyze these
designs easily if an extra variable denoting the square is included in the file. The
following fixed effects example is from Neter, Wasserman, and Kutner (1985). The
SQUARE variable is represented in the cells of the design. For simplicity, the
dependent variable, RESPONSE, has been left out.
day1 day2 day3
week1

day4

day5

week2

week3

week4

week5

I-520
Linear Models III: General Linear Models

You would set up the data as shown below (the LATIN file).
DAY WEEK

1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5

1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5

SQUARE

D
C
A
E
B
C
B
D
A
E
A
E
B
C
D
B
A
E
D
C
E
D
C
B
A

RESPONSE

18
17
14
21
17
13
34
21
16
15
7
29
32
27
13
17
13
24
31
25
21
26
26
31
7

To do the analysis, the input is:


USE latin
GLM
CATEGORY day, week, square
MODEL response = CONSTANT + day + week + square
ESTIMATE

I-521
Chapter 16

The output follows:


Dep Var: RESPONSE

N: 25

Multiple R: 0.931

Squared multiple R: 0.867

Analysis of Variance
Source

Sum-of-Squares

df

Mean-Square

F-ratio
1.306
7.599
10.580

DAY
WEEK
SQUARE

82.000
477.200
664.400

4
4
4

20.500
119.300
166.100

Error

188.400

12

15.700

P
0.323
0.003
0.001

Example 8
Crossover and Changeover Designs
In crossover designs, an experiment is divided into periods, and the treatment of a
subject changes from one period to the next. Changeover studies often use designs
similar to a Latin square. A problem with these designs is that there may be a residual
or carry-over effect of a treatment into the following period. This can be minimized by
extending the interval between experimental periods; however, this is not always
feasible. Fortunately, there are methods to assess the magnitude of any carry-over
effects that may be present.
Two-period crossover designs can be analyzed as repeated-measures designs. More
complicated crossover designs can also be analyzed by SYSTAT, and carry-over
effects can be assessed. Cochran and Cox (1957) present a study of milk production by
cows under three different feed schedules: A (roughage), B (limited grain), and C (full
grain). The design of the study has the form of two ( 3 3 ) Latin squares:
COW
Latin square 1

Latin square 2

Period

II

III

IV

VI

I-522
Linear Models III: General Linear Models

The data are set up in the WILLIAMS data file as follows:


COW

1
1
1
2
2
2
3
3
3
4
4
4
5
5
5
6
6
6

SQUARE PERIOD

1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2

1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3

FEED CARRY RESIDUAL MILK

1
2
3
2
3
1
3
1
2
1
3
2
2
1
3
3
2
1

1
1
2
1
2
2
1
2
1
1
1
2
1
2
1
1
2
2

0
1
2
0
2
3
0
3
1
0
1
3
0
2
1
0
3
2

38
25
15
109
86
39
124
72
27
86
76
46
75
35
34
101
63
1

PERIOD is nested within each Latin square (the periods for cows in one square are
unrelated to the periods in the other). The variable RESIDUAL indicates the treatment
of the preceding period. For the first period for each cow, there is no preceding period.
The input is:
USE williams
GLM
CATEGORY cow, period, square, residual, carry, feed
MODEL milk = CONSTANT + cow + feed +,
period(square) + residual(carry)
ESTIMATE

I-523
Chapter 16

The output follows:


Dep Var: MILK

N: 18

Multiple R: 0.995

Squared multiple R: 0.990

Analysis of Variance
Source

Sum-of-Squares

COW
FEED
PERIOD(SQUARE)
RESIDUAL(CARRY)

df

Mean-Square

F-ratio

3835.950
2854.550
3873.950
616.194

5
2
4
2

767.190
1427.275
968.488
308.097

15.402
28.653
19.443
6.185

199.250

49.813

Error

P
0.010
0.004
0.007
0.060

There is a significant effect of feed on milk production and an insignificant residual or


carry-over effect in this instance.

Type I Sums-of-Squares Analysis


To replicate the Cochran and Cox Type I sums-of-squares analysis, you must fit a new
model to get their sums of squares. The following commands test the COW effect.
Notice that the Error specification uses the mean square error (MSE) from the previous
analysis. It also contains the error degrees of freedom (4) from the previous model.
USE williams
GLM
CATEGORY cow
MODEL milk = CONSTANT + cow
ESTIMATE
HYPOTHESIS
EFFECT = cow
ERROR = 49.813(4)
TEST

The output follows:


Dep Var: MILK

N: 18

Multiple R: 0.533

Squared multiple R: 0.284

Analysis of Variance
Source

Sum-of-Squares

COW
Error

df

Mean-Square

F-ratio

5781.111

1156.222

0.952

14581.333

12

1215.111

P
0.484

------------------------------------------------------------------------------Test for effect called:

COW

Test of Hypothesis
Source
Hypothesis
Error

SS
5781.111
199.252

df
5
4

MS
1156.222
49.813

F
23.211

P
0.005

I-524
Linear Models III: General Linear Models

The remaining term, PERIOD, requires a different model. PERIOD is nested within
SQUARE.
USE williams
GLM
CATEGORY period square
MODEL milk = CONSTANT + period(square)
ESTIMATE
HYPOTHESIS
EFFECT = period(square)
ERROR = 49.813(4)
TEST

The resulting output is:


Dep Var: MILK

N: 18

Multiple R: 0.751

Squared multiple R: 0.564

Analysis of Variance
Source

Sum-of-Squares

PERIOD(SQUARE)
Error

df

Mean-Square

F-ratio

11489.111

2872.278

4.208

8873.333

13

682.564

P
0.021

-------------------------------------------------------------------------------

>

HYPOTHESIS

>

EFFECT = period(square)

>

ERROR = 49.813(4)

>
TEST
Test for effect called:

PERIOD(SQUARE)

Test of Hypothesis
Source
Hypothesis
Error

SS
11489.111
199.252

df
4
4

MS
2872.278
49.813

F
57.661

P
0.001

Example 9
Missing Cells Designs (the Means Model)
When cells are completely missing in a factorial design, parameterizing a model can
be difficult. The full model cannot be estimated. GLM offers a means model
parameterization so that missing cell parameters can be dropped automatically from
the model, and hypotheses for main effects and interactions can be tested by specifying
cells directly. Examine Searle (1987), Hocking (1985), or Milliken and Johnson (1984)
for more information in this area.

I-525
Chapter 16

Widely favored for this purpose by statisticians (Searle, 1987; Hocking, 1985;
Milliken and Johnson, 1984), the means model allows:
n Tests of hypotheses in missing cells designs (using Type IV sums of squares)
n Tests of simple hypotheses (for example, within levels of other factors)
n The use of population weights to reflect differences in subclass sizes

Effects coding is the default for GLM. Alternatively, means models code predictors as
cell means rather than effects, which differ from a grand mean. The constant is omitted,
and the predictors are 1 for a case belonging to a given cell and 0 for all others. When
cells are missing, GLM automatically excludes null columns and estimates the
submodel.
The categorical variables are specified in the MODEL statement differently for a
means model than for an effects model. Here are some examples:
MODEL y = a*b / MEANS
MODEL y = group*age*school$ / MEANS

The first two models generate fully factorial designs (A by B and group by AGE by
SCHOOL$). Notice that they omit the constant and main effects parameters because the
means model does not include effects or a grand mean. Nevertheless, the number of
parameters is the same in the two models. The following are the effects model and the
means model, respectively, for a 2 3 design (two levels of A and three levels of B):
MODEL y = CONSTANT + A + B + A*B

a1

b1

b2

a1b1

a1b2

1
1
1
2
2
2

1
2
3
1
2
3

1
1
1
1
1
1

1
1
1
1
1
1

1
0
1
1
0
1

0
1
1
0
1
1

1
0
1
1
0
1

0
1
1
0
1
1

I-526
Linear Models III: General Linear Models

MODEL y = A*B / MEANS

a1b1

a1b2

a1b3

a2b1

a2b2

a2b3

1
1
1
2
2
2

1
2
3
1
2
3

1
0
0
0
0
0

0
1
0
0
0
0

0
0
1
0
0
0

0
0
0
1
0
0

0
0
0
0
1
0

0
0
0
0
0
1

Means and effects models can be blended for incomplete factorials and others designs.
All crossed terms (for example, A*B) will be coded with means design variables
(provided the MEANS option is present), and the remaining terms will be coded as
effects. The constant must be omitted, even in these cases, because it is collinear with
the means design variables. All covariates and effects that are coded factors must
precede the crossed factors in the MODEL statement.
Here is an example, assuming A has four levels, B has two, and C has three. In this
design, there are 24 possible cells, but only 12 are nonmissing. The treatment
combinations are partially balanced across the levels of B and C.
MODEL y = A + B*C / MEANS

a1

a2

a3

b1c1

b1c2

b1c3

b2c1

b2c2

b2c3

1
3
2
4
1
4
2
3
2
4
1
3

1
1
1
1
1
1
2
2
2
2
2
2

1
1
2
2
3
3
1
1
2
2
3
3

1
0
0
1
1
1
0
0
0
1
1
0

0
0
1
1
0
1
1
0
1
1
0
0

0
1
0
1
0
1
0
1
0
1
0
1

1
1
0
0
0
0
0
0
0
0
0
0

0
0
1
1
0
0
0
0
0
0
0
0

0
0
0
0
1
1
0
0
0
0
0
0

0
0
0
0
0
0
1
0
0
0
0
0

0
0
0
0
0
0
0
1
1
1
0
0

0
0
0
0
0
0
0
0
0
0
1
1

I-527
Chapter 16

Nutritional Knowledge Survey


The following example, which uses the data file MJ202, is from Milliken and Johnson
(1984). The data are from a home economics survey experiment. DIFF is the change
in test scores between pre-test and post-test on a nutritional knowledge questionnaire.
GROUP classifies whether or not a subject received food stamps. AGE designates four
age groups, and RACE$ was their term for designating Whites, Blacks, and Hispanics.
Group 0
W

Group 1
4

10

13

15

12

11

14

Empty cells denote age/race combinations for which no data were collected. Numbers
within cells refer to cell designations in the Fisher LSD pairwise mean comparisons at
the end of this example.
First, fit the model. The input is:
USE mj202
GLM
CATEGORY group age race$
MODEL diff = group*age*race$ / MEANS
ESTIMATE

The output follows:


Means Model
Dep Var: DIFF

N: 107

Multiple R: 0.538

Squared multiple R: 0.289

***WARNING***
Missing cells encountered. Tests of factors will not appear.
Ho: All means equal.
Unweighted Means Model
Analysis of Variance
Source
Model
Error

Sum-of-Squares
1068.546
2627.472

df
14
92

Mean-Square
76.325
28.559

F-ratio
2.672

P
0.003

I-528
Linear Models III: General Linear Models

We need to test the GROUP main effect. The following notation is equivalent to
Milliken and Johnsons. Because of the missing cells, the GROUP effect must be
computed over means that are balanced across the other factors.
In the drawing at the beginning of this example, notice that this specification
contrasts all the numbered cells in group 0 (except 2) with all the numbered cells in
group 1 (except 8 and 15). The input is:
HYPOTHESIS
NOTE GROUP MAIN EFFECT
SPECIFY ,
group[0] age[1] race$[W]
group[0] age[3] race$[B]
group[0] age[3] race$[W]
group[1] age[1] race$[W]
group[1] age[3] race$[B]
group[1] age[3] race$[W]
TEST

+
+
+
+
+
+

group[0]
group[0]
group[0]
group[1]
group[1]
group[1]

age[2]
age[3]
age[4]
age[2]
age[3]
age[4]

race$[W]
race$[H]
race$[B]
race$[W]
race$[H]
race$[B]

The output is:


Hypothesis.
A Matrix
1
-1.000

2
0.0

6
-1.000

7
-1.000

11
1.000
Null hypothesis value for D
0.0
Test of Hypothesis

12
1.000

Source
Hypothesis
Error

SS
75.738
2627.472

df
1
92

MS
75.738
28.559

3
-1.000
8
0.0
13
1.000

4
-1.000

5
-1.000

9
1.000

10
1.000

14
1.000

F
2.652

15
0.0

P
0.107

+,
+,
=,
+,
+,

I-529
Chapter 16

The computations for the AGE main effect are similar to those for the GROUP main
effect:
HYPOTHESIS
NOTE AGE MAIN EFFECT
SPECIFY ,
GROUP[1] AGE[1] RACE$[B] + GROUP[1] AGE[1] RACE$[W] =,
GROUP[1] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,
GROUP[0] AGE[2] RACE$[B] + GROUP[1] AGE[2] RACE$[W] =,
GROUP[0] AGE[4] RACE$[B] + GROUP[1] AGE[4] RACE$[W];,
GROUP[0]
GROUP[1]
GROUP[0]
GROUP[1]
TEST

AGE[3]
AGE[3]
AGE[4]
AGE[4]

RACE$[B] + GROUP[1] AGE[3] RACE$[B] +,


RACE$[W] =,
RACE$[B] + GROUP[1] AGE[4] RACE$[B] +,
RACE$[W]

The output follows:


Hypothesis.
A Matrix
1
2
3

1
0.0
0.0
0.0

2
0.0
-1.000
0.0

1
2
3

6
0.0
0.0
0.0

7
0.0
1.000
1.000

11
0.0
0.0
-1.000

1
2
3

12
0.0
0.0
0.0

3
0.0
0.0
0.0

4
0.0
0.0
-1.000

5
0.0
0.0
0.0

8
-1.000
0.0
0.0

9
-1.000
0.0
0.0

10
0.0
-1.000
0.0

13
0.0
0.0
-1.000

14
1.000
0.0
1.000

15
1.000
1.000
1.000

D Matrix
1
2
3

0.0
0.0
0.0

Test of Hypothesis
Source
Hypothesis
Error

SS
41.526
2627.472

df
3
92

MS
13.842
28.559

F
0.485

P
0.694

The GROUP by AGE interaction requires more complex balancing than the main
effects. It is derived from a subset of the means in the following specified combination.
Again, check Milliken and Johnson to see the correspondence.

I-530
Linear Models III: General Linear Models

The input is:


HYPOTHESIS
NOTE GROUP
SPECIFY ,
group[0]
group[1]
group[0]
group[1]
group[0]
group[1]
group[0]
group[1]

BY AGE INTERACTION
age[1]
age[1]
age[3]
age[3]

race$[W]
race$[W]
race$[B]
race$[B]

group[0]
group[1]
group[0]
group[1]

age[3]
age[3]
age[4]
age[4]

race$[W] ,
race$[W] +,
race$[B] ,
race$[B]=0.0;,

age[2]
age[2]
age[3]
age[3]

race$[W]
race$[W]
race$[B]
RACE$[B]

group[0]
group[1]
group[0]
group[1]

age[3]
age[3]
age[4]
age[4]

race$[W] ,
race$[W] +,
race$[B] ,
race$[B]=0.0;,

group[0] age[3] race$[B] group[0] age[4] race$[B] ,


group[1] age[3] race$[B] + group[1] age[4] race$[B]=0.0
TEST

The output is:


Hypothesis.
A Matrix
1
2
3

1
-1.000
0.0
0.0

1
2
3

6
1.000
1.000
0.0

1
2
3

11
1.000
1.000
1.000

1
2
3

0.0
0.0
0.0

2
0.0
0.0
0.0
7
1.000
1.000
1.000
12
0.0
0.0
0.0

3
0.0
-1.000
0.0

4
-1.000
-1.000
-1.000

8
0.0
0.0
0.0

5
0.0
0.0
0.0

9
1.000
0.0
0.0

13
-1.000
-1.000
0.0

10
0.0
1.000
0.0

14
-1.000
-1.000
-1.000

15
0.0
0.0
0.0

D Matrix

Test of Hypothesis
Source
Hypothesis
Error

SS
91.576
2627.472

df
3
92

MS
30.525
28.559

F
1.069

P
0.366

I-531
Chapter 16

The following commands are needed to produce the rest of Milliken and Johnsons
results. The remaining output is not listed.
HYPOTHESIS
NOTE RACE$
SPECIFY ,
group[0]
group[1]
group[1]
group[0]
group[1]
group[1]

MAIN EFFECT
age[2]
age[1]
age[4]
age[2]
age[1]
age[4]

group[0] age[3]
group[0] age[3]
TEST
HYPOTHESIS
NOTE GROUP*RACE$
SPECIFY ,
group[0] age[3]
group[1] age[3]
group[0] age[3]
group[1] age[3]
TEST
HYPOTHESIS
NOTE 'AGE*RACE$'
SPECIFY ,
group[1] age[1]
group[1] age[4]

race$[B] + group[0]
race$[B] + group[1]
race$[B] =,
race$[W] + group[0]
race$[W] + group[1]
race$[W];,

age[3] race$[B] +,
age[3] race$[B] +,
age[3] race$[W] +,
age[3] race$[W] +,

race$[H] + group[1] age[3] race$[H] =,


race$[W] + group[1] age[3] race$[W]

race$[B] group[0] age[3] race$[W] ,


race$[B] + group[1] age[3] race$[W]=0.0;,
race$[H] group[0] age[3] race$[W] ,
race$[H] + group[1] age[3] race$[W]=0.0

race$[B] group[1] age[1] race$[W] ,


race$[B] + group[1] age[4] race$[W]=0.0;,

group[0] age[2] race$[B] group[0] age[2] race$[W] ,


group[0] age[3] race$[B] + group[0] age[3] race$[W]=0.0;,
group[1] age[3] race$[B] group[1] age[3] race$[W] ,
group[1] age[4] race$[B] + group[1] age[4] race$[W]=0.0
TEST

Finally, Milliken and Johnson do pairwise comparisons:


HYPOTHESIS
POST group*age*race$ / LSD
TEST

I-532
Linear Models III: General Linear Models

The following is the matrix of comparisons printed by GLM. The matrix of mean
differences has been omitted.
COL/
ROW GROUP
AGE
1 0
1
2 0
2
3 0
2
4 0
3
5 0
3
6 0
3
7 0
4
8 1
1
9 1
1
10 1
2
11 1
3
12 1
3
13 1
3
14 1
4
15 1
4
Using unweighted means.
Post Hoc test of DIFF

RACE$
W
B
W
B
H
W
B
B
W
W
B
H
W
B
W

Using model MSE of 28.559 with 92 df.


Fishers Least-Significant-Difference Test.
Matrix of pairwise comparison probabilities:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
6
7
8
9
10
11
12
13
14
15
11
12
13
14
15

1
1.000
0.662
0.638
0.725
0.324
0.521
0.706
0.197
0.563
0.049
0.018
0.706
0.018
0.914
0.090
6
1.000
0.971
0.292
0.860
0.026
0.010
0.971
0.000
0.610
0.059
11
1.000
0.213
0.466
0.082
0.219

1.000
0.974
0.323
0.455
0.827
0.901
0.274
0.778
0.046
0.016
0.901
0.007
0.690
0.096
7

1.000
0.295
0.461
0.850
0.912
0.277
0.791
0.042
0.015
0.912
0.005
0.676
0.090
8

1.000
0.161
0.167
0.527
0.082
0.342
0.004
0.002
0.527
0.000
0.908
0.008
9

1.000
0.497
0.703
0.780
0.709
0.575
0.283
0.703
0.456
0.403
0.783
10

1.000
0.543
0.939
0.392
0.213
1.000
0.321
0.692
0.516
12

1.000
0.514
0.836
0.451
0.543
0.717
0.288
0.930
13

1.000
0.303
0.134
0.939
0.210
0.594
0.447
14

1.000
0.425
0.392
0.798
0.168
0.619
15

1.000
0.321
0.692
0.516

1.000
0.124
0.344

1.000
0.238

1.000

Within group 0 (cells 17), there are no significant pairwise differences in average test
score changes. The same is true within group 1 (cells 815).

I-533
Chapter 16

Example 10
Covariance Alternatives to Repeated Measures
Analysis of covariance offers an alternative to repeated measures in a pre-post design.
You can use the pre-test as a covariate in predicting the post-test. This example shows
how to do a two-group, pre-post design:
GLM
USE filename
CATEGORY group
MODEL post = CONSTANT + group + pre
ESTIMATE

When using this design, be sure to check the homogeneity of slopes assumption. Use
the following commands to check that the interaction term, GROUP*PRE, is not
significant:
GLM
USE filename
CATEGORY group
MODEL post = CONSTANT + group + pre + group*pre
ESTIMATE

Example 11
Weighting Means
Sometimes you want to weight the cell means when you test hypotheses in ANOVA.
Suppose you have an experiment in which a few rats died before its completion. You
do not want the hypotheses tested to depend upon the differences in cell sizes (which
are presumably random). Here is an example from Morrison (1976). The data
(MOTHERS) are hypothetical profiles on three scales of mothers in each of four
socioeconomic classes.
Morrison analyzes these data with the multivariate profile model for repeated
measures. Because the hypothesis of parallel profiles across classes is not rejected, you
can test whether the profiles are level. That is, do the scales differ when we pool the
classes together?
Pooling unequal classes can be done by weighting each according to sample size or
averaging the means of the subclasses. First, lets look at the model and test the
hypothesis of equality of scale parameters without weighting the cell means.

I-534
Linear Models III: General Linear Models

The input is:


USE mothers
GLM
CATEGORY class
MODEL scale(1 .. 3) = CONSTANT + class
ESTIMATE
HYPOTHESIS
EFFECT = CONSTANT
CMATRIX [1 1 0; 0 1 1]
TEST

The output is:


Dependent variable means
SCALE(1)
14.524

SCALE(2)
15.619

SCALE(3)
15.857

-1
Estimates of effects

B = (XX)

XY

SCALE(1)
CONSTANT

SCALE(2)

SCALE(3)

13.700

14.550

14.988

CLASS

4.300

5.450

4.763

CLASS

0.100

0.650

-0.787

CLASS

-0.700

-0.550

0.012

Test for effect called:

CONSTANT

C Matrix
1
1.000
0.0

1
2

2
-1.000
1.000

3
0.0
-1.000

Univariate F Tests
Effect

SS

MS

14.012
51.200

1
17

14.012
3.012

4.652

0.046

3.712
61.500

1
17

3.712
3.618

1.026

0.325

Error
Error

df

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.564
6.191

df =

2,

16

Prob =

0.010

Pillai Trace =
F-Statistic =

0.436
6.191

df =

2,

16

Prob =

0.010

Hotelling-Lawley Trace =
F-Statistic =

0.774
6.191

df =

2,

16

Prob =

0.010

I-535
Chapter 16

Notice that the dependent variable means differ from the CONSTANT. The
CONSTANT in this case is a mean of the cell means rather than the mean of all the
cases.

Weighting by the Sample Size


If you believe (as Morrison does) that the differences in cell sizes reflect population
subclass proportions, then you need to weight the cell means to get a grand mean; for
example:
8*(1) + 5*(2) + 4*(3) + 4*(4)

Expressed in terms of our analysis of variance parameterization, this is:


8*( + 1) + 5*( + 2) + 4*( + 3) + 4*( + 4)

Because the sum of effects is 0 for a classification and because you do not have an
independent estimate of CLASS4, this expression is equivalent to
8*( + 1) + 5*( + 2) + 4*( + 3) + 4*( - 1 - 2 - 3)

which works out to


21* + 4*(1) + 1*(2) + 0*(3)

Use AMATRIX to test this hypothesis.


HYPOTHESIS
AMATRIX [21 4 1 0]
CMATRIX [1 -1 0; 0 1 -1]
TEST

The output is:


Hypothesis.
A Matrix
1
21.000

2
4.000

3
1.000

1
1.000
0.0

2
-1.000
1.000

3
0.0
-1.000

C Matrix
1
2
Univariate F Tests

4
0.0

I-536
Linear Models III: General Linear Models

Effect
1
Error
2
Error

SS
25.190
51.200

df
1
17

1.190
61.500

1
17

MS
25.190
3.012

F
8.364

P
0.010

1.190
3.618

0.329

0.574

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.501
7.959

df =

2,

16

Prob =

0.004

Pillai Trace =
F-Statistic =

0.499
7.959

df =

2,

16

Prob =

0.004

Hotelling-Lawley Trace =
F-Statistic =

0.995
7.959

df =

2,

16

Prob =

0.004

This is the multivariate F statistic that Morrison gets. For these data, we prefer the
weighted means analysis because these differences in cell frequencies probably reflect
population base rates. They are not random.

Example 12
Hotellings T-Square
You can use General Linear Model to calculate Hotellings T-square statistic.

One-Sample Test
For example, to get a one-sample test for the variables X and Y, select both X and Y as
dependent variables.
GLM
USE filename
MODEL x, y = CONSTANT
ESTIMATE

The F test for CONSTANT is the statistic you want. It is the same as the Hotellings T2
for the hypothesis that the population means for X and Y are 0.
You can also test against the hypothesis that the means of X and Y have particular
nonzero values (for example, 10 and 15) by using:
HYPOTHESIS
DMATRIX [10 15]
TEST

I-537
Chapter 16

Two-Sample Test
For a two-sample test, you must provide a categorical independent variable that
represents the two groups. The input is:
GLM
CATEGORY group
MODEL x,y = CONSTANT + group
ESTIMATE

Example 13
Discriminant Analysis
This example uses the IRIS data file. Fisher used these data to illustrate his
discriminant function. To define the model:
USE iris
GLM
CATEGORY species
MODEL sepallen sepalwid
species
ESTIMATE
HYPOTHESIS
EFFECT = species
SAVE canon
TEST

petallen

petalwid = CONSTANT +,

SYSTAT saves the canonical scores associated with the hypothesis. The scores are
stored in subscripted variables named FACTOR. Because the effects involve a
categorical variable, the Mahalanobis distances (named DISTANCE) and posterior
probabilities (named PROB) are saved in the same file. These distances are computed
in the discriminant space itself. The closer a case is to a particular groups location in
that space, the more likely it is that it belongs to that group. The probability of group
membership is computed from these distances. A variable named PREDICT that
contains the predicted group membership is also added to the file.
The output follows:
Dependent variable means
SEPALLEN
5.843

SEPALWID
3.057

PETALLEN
3.758

PETALWID
1.199

I-538
Linear Models III: General Linear Models

-1
B = (XX) XY

Estimates of effects

SEPALLEN
CONSTANT

SEPALWID

PETALLEN

PETALWID

5.843

3.057

3.758

1.199

SPECIES

-0.837

0.371

-2.296

-0.953

SPECIES

0.093

-0.287

0.502

0.127

------------------------------------------------------------------------------Test for effect called:

SPECIES

Null hypothesis contrast AB


SEPALLEN
-0.837
0.093

1
2

SEPALWID
0.371
-0.287

PETALLEN
-2.296
0.502

PETALWID
-0.953
0.127

-1
Inverse contrast A(XX)

A
1
0.013
-0.007

1
2

2
0.013
-1

Hypothesis sum of product matrix


SEPALLEN
63.212
-19.953
165.248
71.279

SEPALLEN
SEPALWID
PETALLEN
PETALWID

H = BA(A(XX)

-1
A)

AB

SEPALWID

PETALLEN

PETALWID

11.345
-57.240
-22.933

437.103
186.774

80.413

SEPALWID

PETALLEN

PETALWID

16.962
8.121
4.808

27.223
6.272

6.157

Error sum of product matrix G = EE


SEPALLEN
38.956
13.630
24.625
5.645

SEPALLEN
SEPALWID
PETALLEN
PETALWID
Univariate F Tests
Effect

SS

df

MS

SEPALLEN
Error

63.212
38.956

2
147

31.606
0.265

119.265

0.000

SEPALWID
Error

11.345
16.962

2
147

5.672
0.115

49.160

0.000

PETALLEN
Error

437.103
27.223

2
147

218.551
0.185

1180.161

0.000

PETALWID
Error

80.413
6.157

2
147

40.207
0.042

960.007

0.000

I-539
Chapter 16

Multivariate Test Statistics


Wilks Lambda =
F-Statistic =

0.023
199.145

df =

8, 288

Prob =

0.000

Pillai Trace =
F-Statistic =

1.192
53.466

df =

8, 290

Prob =

0.000

Hotelling-Lawley Trace =
F-Statistic =

32.477
580.532

df =

8, 286

Prob =

0.000

THETA =

0.970 S =

2, M = 0.5, N = 71.0 Prob =

0.0

Test of Residual Roots


Roots 1 through 2
Chi-Square Statistic =

546.115

df = 8

Roots 2 through 2
Chi-Square Statistic =

36.530

df = 3

Canonical Correlations
1
2
0.985
0.471
Dependent variable canonical coefficients standardized
by conditional (within groups) standard deviations
SEPALLEN
SEPALWID
PETALLEN
PETALWID

1
0.427
0.521
-0.947
-0.575

2
0.012
0.735
-0.401
0.581

Canonical loadings (correlations between conditional


dependent variables and dependent canonical factors)
SEPALLEN
SEPALWID
PETALLEN
PETALWID

1
-0.223
0.119
-0.706
-0.633

2
0.311
0.864
0.168
0.737

Group classification function coefficients


1
2
SEPALLEN
23.544
15.698
SEPALWID
23.588
7.073
PETALLEN
-16.431
5.211
PETALWID
-17.398
6.434

3
12.446
3.685
12.767
21.079

Group classification constants


1
-86.308
Canonical scores have been saved.

2
-72.853

3
-104.368

The multivariate tests are all significant. The dependent variable canonical coefficients
are used to produce discriminant scores. These coefficients are standardized by the
within-groups standard deviations so you can compare their magnitude across
variables with different scales. Because they are not raw coefficients, there is no need
for a constant. The scores produced by these coefficients have an overall zero mean and
a unit standard deviation within groups.

I-540
Linear Models III: General Linear Models

The group classification coefficients and constants comprise the Fisher discriminant
functions for classifying the raw data. You can apply these coefficients to new data and
assign each case to the group with the largest function value for that case.

Studying Saved Results


The CANON file that was just saved contains the canonical variable scores
(FACTOR(1) and FACTOR(2)), the Mahalanobis distances to each group centroid
(DISTANCE(1), DISTANCE(2), and DISTANCE(3)), the posterior probability for each
case being assigned to each group (PROB(1), PROB(2), and PROB(3)), the predicted
group membership (PREDICT), and the original group assignment (GROUP).
To produce a classification table of the group assignment against the predicted
group membership and a plot of the second canonical variable against the first, the
input is:
USE canon
XTAB
PRINT NONE/ FREQ CHISQ
TABULATE GROUP * PREDICT
PLOT FACTOR(2)*FACTOR(1) / OVERLAY GROUP=GROUP COLOR=2,1,3 ,
FILL=1,1,1 SYMBOL=4,8,5

The output follows:


Frequencies
GROUP (rows) by PREDICT (columns)

1
2
3
Total

1
2
3
Total
+-------------------+
|
50
0
0 |
50
|
0
48
2 |
50
|
0
1
49 |
50
+-------------------+
50
49
51
150

Test statistic
Pearson Chi-square

Value
282.593

df
4.000

Prob
0.000

I-541
Chapter 16

FACTOR(2)

2
1
0
-1
GROUP
-2
-3
-10

-5

0
FACTOR(1)

10

3
2
1

However, it is much easier to use the Discriminant Analysis procedure.

Prior Probabilities
In this example, there were equal numbers of flowers in each group. Sometimes the
probability of finding a case in each group is not the same across groups. To adjust the
prior probabilities for this example, specify 0.5, 0.3, and 0.2 as the priors:
PRIORS 0.5 0.3 0.2

General Linear Model uses the probabilities you specify to compute the posterior
probabilities that are saved in the file under the variable PROB. Be sure to specify a
probability for each level of the grouping variable. The probabilities should add up to 1.

Example 14
Principal Components Analysis (Within Groups)
General Linear Model allows you to partial out effects based on grouping variables and
to factor residual correlations. If between-group variation is significant, the withingroup structure can differ substantially from the total structure (ignoring the grouping
variable). However, if you are just computing principal components on a single sample
(no grouping variable), you can obtain more detailed output using the Factor Analysis
procedure.

I-542
Linear Models III: General Linear Models

The following data (USSTATES) comprise death rates by cause from nine census
divisions of the country for that year. The divisions are in the column labeled DIV, and
the U.S. Post Office two-letter state abbreviations follow DIV. Other variables include
ACCIDENT, CARDIO, CANCER, PULMONAR, PNEU_FLU, DIABETES, LIVER,
STATE$, FSTROKE, MSTROKE.
The variation in death rates between divisions in these data is substantial. Here is a
grouped box plot of the second variable, CARDIO, by division. The other variables
show similar regional differences.
500

CARDIO

400

300

200

100
E

d fic tic
in
al
al
al
al tic
ntr ntr lan nta lan aci lan ntr ntr
Ce Ce At ou Eng P At Ce Ce
M w
S id
S
N
S
E
M
W W
Ne

DIVISION$

If you analyze these data ignoring DIVISION$, the correlations among death rates
would be due substantially to between-division differences. You might want to
examine the pooled within-region correlations to see if the structure is different when
divisional differences are statistically controlled. Accordingly, you will factor the
residual correlation matrix after regressing medical variables onto an index variable
denoting the census regions. The input is:
USE usstates
GLM
CATEGORY division
MODEL accident cardio cancer pulmonar pneu_flu,
diabetes liver fstroke mstroke = CONSTANT + division
ESTIMATE
HYPOTHESIS
EFFECT = division
FACTOR = ERROR
TYPE
= CORR
ROTATE = 2
TEST

I-543
Chapter 16

The hypothesis commands compute the principal components on the error (residual)
correlation matrix and rotate the first two components to a varimax criterion. For other
rotations, use the Factor Analysis procedure.
The FACTOR options can be used with any hypothesis. Ordinarily, when you test a
hypothesis, the matrix product INV(G)*H is factored and the latent roots of this matrix
are used to construct the multivariate test statistic. However, you can indicate which
matrixthe hypothesis (H) matrix or the error (G) matrixis to be factored. By
computing principal components on the hypothesis or error matrix separately, FACTOR
offers a direct way to compute principal components on residuals of any linear model
you wish to fit. You can use any A, C, and/or D matrices in the hypothesis you are
factoring, or you can use any of the other commands that create these matrices.
The hypothesis output follows:
Factoring Error Matrix
1
2
3
4
5
6
7
8
9

1
1.000
0.280
0.188
0.307
0.113
0.297
-0.005
0.402
0.495

6
7
8
9

6
1.000
-0.025
-0.151
-0.076

2
1.000
0.844
0.676
0.448
0.419
0.251
-0.202
-0.119
7

3
1.000
0.711
0.297
0.526
0.389
-0.379
-0.246

1.000
0.396
0.296
0.252
-0.190
-0.127

1.000
-0.225
-0.203

1.000
0.947

1.000

1
3.341

2
2.245

3
1.204

4
0.999

6
0.364

7
0.222

8
0.119

9
0.033

1.000
-0.123
-0.138
-0.110
-0.071

Latent roots
5
0.475

I-544
Linear Models III: General Linear Models

Loadings
1
2
3
4
5
6
7
8
9

1
0.191
0.870
0.934
0.802
0.417
0.512
0.391
-0.518
-0.418

2
0.798
0.259
0.097
0.247
0.146
0.218
-0.175
0.795
0.860

3
0.128
-0.097
0.112
-0.135
-0.842
0.528
0.400
0.003
0.025

4
-0.018
0.019
0.028
0.120
-0.010
-0.580
0.777
0.155
0.138

1
2
3
4
5
6
7
8
9

6
0.106
0.145
0.039
-0.499
0.216
0.093
0.154
-0.041
0.005

7
-0.100
-0.254
-0.066
0.085
0.220
0.241
0.159
0.056
0.035

8
-0.019
0.177
-0.251
0.044
-0.005
0.063
0.046
0.081
-0.101

9
-0.015
0.028
-0.058
0.015
-0.002
0.010
0.009
-0.119
0.117

5
-0.536
0.219
0.183
-0.071
-0.042
0.068
-0.044
0.226
0.204

Rotated loadings on first 2 principal components


1
2
1
0.457
0.682
2
0.906
-0.060
3
0.909
-0.234
4
0.838
-0.047
5
0.441
-0.008
6
0.556
0.027
7
0.305
-0.300
8
-0.209
0.925
9
-0.093
0.951
Sorted rotated loadings on first 2 principal components
(loadings less than .25 made 0.)
1
2
1
0.909
0.0
2
0.906
0.0
3
0.838
0.0
4
0.556
0.0
5
0.0
0.951
6
0.0
0.925
7
0.457
0.682
8
0.305
-0.300
9
0.441
0.0

Notice the sorted, rotated loadings. When interpreting these values, do not relate the
row numbers (1 through 9) to the variables. Instead, find the corresponding loading in
the Rotated Loadings table. The ordering of the rotated loadings corresponds to the
order of the model variables.
The first component rotates to a dimension defined by CANCER, CARDIO,
PULMONAR, and DIABETES; the second, by a dimension defined by MSTROKE and
FSTROKE (male and female stroke rates). ACCIDENT also loads on the second factor
but is not independent of the first. LIVER does not load highly on either factor.

I-545
Chapter 16

Example 15
Canonical Correlation Analysis
Suppose you have 10 dependent variables, MMPI(1) to MMPI(10), and 3 independent
variables, RATER(1) to RATER(3). Enter the following commands to obtain the
canonical correlations and dependent canonical coefficients:
USE datafile
GLM
MODEL mmpi(1 .. 10) = CONSTANT + rater(1) + rater(2) + rater(3)
ESTIMATE
PRINT=LONG
HYPOTHESIS
STANDARDIZE
EFFECT=rater(1) & rater(2) & rater(3)
TEST

The canonical correlations are displayed; if you want, you can rotate the dependent
canonical coefficients by using the Rotate option.
To obtain the coefficients for the independent variables, run GLM again with the model
reversed:
MODEL rater(1 .. 3) = CONSTANT + mmpi(1) + mmpi(2),
+ mmpi(3) + mmpi(4) + mmpi(5),
+ mmpi(6) + mmpi(7) + mmpi(8),
+ mmpi(9) + mmpi(10)
ESTIMATE
HYPOTHESIS
STANDARDIZE = TOTAL
EFFECT = mmpi(1) & mmpi(2) & mmpi(3) & mmpi(4) &,
mmpi(5) & mmpi(6) & mmpi(7) & mmpi(8) &,
mmpi(9) & mmpi(10)
TEST

Example 16
Mixture Models
Mixture models decompose the effects of mixtures of variables on a dependent
variable. They differ from ordinary regression models because the independent
variables sum to a constant value. The regression model, therefore, does not include a
constant, and the regression and error sums of squares have one less degree of freedom.
Marquardt and Snee (1974) and Diamond (1981) discuss these models and their
estimation.

I-546
Linear Models III: General Linear Models

Here is an example using the PUNCH data file from Cornell (1985). The study
involved effects of various mixtures of watermelon, pineapple, and orange juice on
taste ratings by judges of a fruit punch. The input is:
USE punch
GLM
MODEL taste = watrmeln + pineappl + orange + ,
watrmeln*pineappl + watrmeln*orange + ,
pineappl*orange
ESTIMATE / MIX

The output follows:


Dep Var: TASTE

N: 18

Multiple R: 0.969

Adjusted squared multiple R: 0.913


Effect
WATRMELN
PINEAPPL
ORANGE
WATRMELN
*PINEAPPL
WATRMELN
*ORANGE
PINEAPPL
*ORANGE

Squared multiple R: 0.939

Standard error of estimate: 0.232

Coefficient

Std Error

4.600
6.333
7.100

0.134
0.134
0.134

Std Coef Tolerance


3.001
4.131
4.631

0.667
0.667
0.667

34.322
47.255
52.975

P(2 Tail)
0.000
0.000
0.000

2.400

0.657

0.320

0.667

3.655

0.003

1.267

0.657

0.169

0.667

1.929

0.078

-2.200

0.657

-0.293

0.667

-3.351

0.006

Analysis of Variance
Source
Regression
Residual

Sum-of-Squares
9.929
0.647

df

Mean-Square

F-ratio

5
12

1.986
0.054

36.852

P
0.000

Not using a mixture model produces a much larger R (0.999) and an F value of
2083.371, both of which are inappropriate for these data. Notice that the Regression
Sum-of-Squares has five degrees of freedom instead of six as in the usual zero-intercept
regression model. We have lost one degree of freedom because the predictors sum to 1.

Example 17
Partial Correlations
Partial correlations are easy to compute with General Linear Model. The partial
correlation of two variables (a and b) controlling for the effects of a third (c) is the
correlation between the residuals of each (a and b) after each has been regressed on the
third (c). You can therefore use General Linear Model to compute an entire matrix of
partial correlations.

I-547
Chapter 16

For example, to compute the matrix of partial correlations for Y1, Y2, Y3, Y4, and
Y5, controlling for the effects of X, select Y1 through Y5 as dependent variables and X
as the independent variable. The input follows:
GLM
MODEL y(1 .. 5) = CONSTANT + x
PRINT=LONG
ESTIMATE

Look for the Residual Correlation Matrix in the output; it is the matrix of partial
correlations among the ys given x. If you want to compute partial correlations for
several xs, just select them (also) as independent variables.

Computation
Algorithms
Centered sums of squares and cross products are accumulated using provisional
algorithms. Linear systems, including those involved in hypothesis testing, are solved
by using forward and reverse sweeping (Dempster, 1969). Eigensystems are solved
with Householder tridiagonalization and implicit QL iterations. For further
information, see Wilkinson and Reinsch (1971) or Chambers (1977).

References
Chambers, J.M. (1977). Computational methods for data analysis. New York: John
Wiley & Sons, Inc.
Cochran, W. G. and Cox, G. M. (1957). Experimental designs, 2nd ed. New York: John
Wiley & Sons, Inc.
Cohen, J. and Cohen, P. (1983). Applied multiple regression/correlation analysis for the
behavioral sciences. 2nd ed. Hillsdale, N.J.: Lawrence Erlbaum.
Cornell, J.A. (1985). Mixture experiments. In Kotz, S., and Johnson, N.L. (Eds.),
Encyclopedia of statistical sciences, vol. 5, 569-579. New York: John Wiley & Sons,
Inc.
Dempster, A.P. (1969). Elements of continuous multivariate analysis. San Francisco:
Addison-Wesley.
Diamond, W.J. (1981). Practical experiment designs for engineers and scientists.
Belmont, CA: Lifetime Learning Publications.

I-548
Linear Models III: General Linear Models

Hocking, R. R. (1985). The analysis of linear models. Monterey, Calif.: Brooks/Cole.


John, P.W.M. (1971). Statistical design and analysis of experiments. New York:
MacMillan, Inc.
Linn, R. L., Centra, J. A., and Tucker, L. (1975). Between, within, and total group factor
analyses of student ratings of instruction. Multivariate Behavioral Research, 10,
277288.
Milliken, G. A. and Johnson, D. E. (1984). Analysis of messy data, Vol. 1: Designed
Experiments. New York: Van Nostrand Reinhold Company.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.
Neter, J., Wasserman,W., and Kutner, M. (1985). Applied linear statistical models. 2nd
ed. Homewood, Illinois: Richard E. Irwin, Inc.
Marquardt, D.W., and Snee, R.D. (1974). Test Statistics for Mixture Models.
Technometrics, 16, 533-537.
Searle, S. R. (1971). Linear models. New York: John Wiley & Sons, Inc.
Searle, S. R. (1987). Linear models for unbalanced data. New York: John Wiley &
Sons, Inc.
Wilkinson, J.H. and Reinsch, C. (Eds.). (1971). Linear Algebra, Vol. 2, Handbook for
automatic computation. New York: Springer-Verlag.
Winer, B. J. (1971). Statistical principles in experimental design. 2nd ed. New York:
McGraw-Hill.

Chapter

17
Logistic Regression
Dan Steinberg and Phillip Colla

LOGIT performs multiple logistic regression, conditional logistic regression, the

econometric discrete choice model, general linear (Wald) hypothesis testing, score
tests, odds ratios and confidence intervals, forward, backward and interactive
stepwise regression, Pregibon regression diagnostics, prediction success and
classification tables, independent variable derivatives and elasticities, model-based
simulation of response curves, deciles of risk tables, options to specify start values
and to separate data into learning and test samples, quasi-maximum likelihood
standard errors, control of significance levels for confidence interval calculations,
zero/one dependent variable coding, choice of reference group in automatic dummy
variable generation, and integrated plotting tools.
Many of the results generated by modeling, testing, or diagnostic procedures can
be saved to SYSTAT data files for subsequent graphing and display with the graphics
routines. LOGIT and PROBIT are aliases to the categorical multivariate general
modeling module called CMGLH, just as ANOVA, GLM, and REGRESSION are aliases
to the multivariate general linear module called MGLH.

Statistical Background
The LOGIT module is SYSTATs comprehensive program for logistic regression
analysis and provides tools for model building, model evaluation, prediction,
simulation, hypothesis testing, and regression diagnostics. The program is designed
to be easy for the novice and can produce the results most analysts need with just three
simple commands. In addition, many advanced features are also included for
sophisticated research projects. Beginners can skip over any unfamiliar concepts and
gradually increase their mastery of logistic regression by working through the tools
incorporated here.
I-550

I-551
Chapter 17

LOGIT will estimate binary (Cox, 1970), multinomial (Anderson, 1972), conditional
logistic regression models (Breslow and Day, 1980), and the discrete choice model
(Luce, 1959; McFadden, 1973). The LOGIT framework is designed for analyzing the
determinants of a categorical dependent variable. Typically, the dependent variable is
binary and coded as 0 or 1; however, it may be multinomial and coded as an integer
ranging from 1 to k or 0 to k 1 .
Studies you can conduct with LOGIT include bioassay, epidemiology of disease
(cohort or case-control), clinical trials, market research, transportation research (mode
of travel), psychometric studies, and voter-choice analysis. The LOGIT module can also
be used to analyze ranked choice information once the data have been suitably
transformed (Beggs, Cardell, and Hausman, 1981).
This chapter contains a brief introduction to logistic regression and a description of
the commands and features of the module. If you are unfamiliar with logistic
regression, the textbook by Hosmer and Lemeshow (1989) is an excellent place to
begin; Breslow and Day (1980) provide an introduction in the context of case-control
studies; Train (1986) and Ben-Akiva and Lerman (1985) introduce the discrete-choice
model for econometrics; Wrigley (1985) discusses the model for geographers; and
Hoffman and Duncan (1988) review discrete choice in a demographic-sociological
context. Valuable surveys appear in Amemiya (1981), McFadden (1984, 1982, 1976),
and Maddala (1983).

Binary Logit
Although logistic regression may be applied to any categorical dependent variable, it
is most frequently seen in the analysis of binary data, in which the dependent variable
takes on only two values. Examples include survival beyond five years in a clinical
trial, presence or absence of disease, responding to a specified dose of a toxin, voting
for a political candidate, and participating in the labor force. The figure below
compares the ordinary least-squares linear model to the basic binary logit model on the
same data. Notice some features of the linear model in the upper panel of the figure:
n The linear model predicts values of y from minus to plus infinity. If the prediction

is intended to be for probabilities, this model is clearly inappropriate.


n The linear model does not pass through the means of x for either value of the

response. More generally, it does not appear to approach the data values very well.
We shouldnt blame the linear model for this; it is doing its job as a regression
estimator by shrinking back toward the mean of y for all x values (0.5). The linear
model is simply not designed to come near the data.

I-552
Logistic Regression

The lower panel illustrates a logistic model. By contrast, it is designed to fit binary
dataeither when y is assumed to represent a probability distribution or when it is
taken simply as a binary measure we are attempting to predict.
Despite the difference in their graphical appearance, the linear and logit models are
only slight variants of one another. Assuming the possibility of more than one predictor
(x) variable, the linear model is:

y = Xb + e
where y is a vector of observations, X is a matrix of predictor scores, and e is a vector
of errors.
The logit model is:

y = exp ( Xb + e ) [ 1 + exp ( Xb + e ) ]
where the exponential function is applied to the vector argument. Rearranging terms,
we have:

y ( 1 y ) = exp ( Xb + e )
and logging both sides of the equation, we have:

log [ y ( 1 y ) ] = Xb + e = b 0 + b j X ij + e i for all i = 1, ..., n


This last expression is one source of the term logit. The model is linear in the logs.

I-553
Chapter 17

Multinomial Logit
Multinomial logit is a logistic regression model having a dependent variable with
more than two levels (Agresti, 1990; Santer and Duffy, 1989; Nerlove and Press,
1973). Examples of such dependent variables include political preference (Democrat,
Republican, Independent), health status (healthy, moderately impaired, seriously
impaired), smoking status (current smoker, former smoker, never smoked), and job
classification (executive, manager, technical staff, clerical, other). Outside of the
difference in the number of levels of the dependent variable, the multinomial logit is
very similar to the binary logit, and most of the standard tools of interpretation,
analysis, and model selection can be applied. In fact, the polytomous unordered logit
we discuss here is essentially a combination of several binary logits estimated
simultaneously (Begg and Gray, 1984). We use the term polytomous to differentiate
this model from the conditional logistic regression and discrete choice models
discussed below.
There are important differences between binary and multinomial models. Chiefly,
the multinomial output is more complicated than that of the binary model, and care
must be taken in the interpretation of the results. Fortunately, LOGIT provides some
new tools that make the task of interpretation much easier. There is also a difference in
dependent variable coding. The binary logit dependent variable is normally coded 0 or
1, whereas the multinomial dependent can be coded 1, 2, ..., k , (that is, it starts at 1
rather than 0) or 0, 1, 2, ..., k 1 .

Conditional Logit
The conditional logistic regression model has become a major analytical tool in
epidemiology since the work of Prentice and Breslow (1978), Breslow et al. (1978),
Prentice and Pyke (1979), and the extended treatment of case-control studies in
Breslow and Day (1980). A mathematically similar model with the same name was
introduced independently and from a rather different perspective by McFadden (1973)
in econometrics. The models have since seen widespread use in the considerably
different contexts of biomedical research and social science, with parallel literatures on
sampling, estimation techniques, and statistical results. In epidemiology, conditional
logit is used to estimate relative risks in matched sample case-control studies (Breslow,
1982), whereas in econometrics a similar likelihood function is used to model
consumer choices as a function of the attributes of alternatives. We begin this section
with a treatment of the biomedical use of the conditional logistic model. A separate

I-554
Logistic Regression

section on the discrete choice model covers the econometric version and contains
certain fine points that may be of interest to all readers. A discussion of parallels in the
two literatures appears in Steinberg (1991).
In the traditional conditional logistic regression model, you are trying to measure
the risk of disease corresponding to different levels of exposure to risk factors. The data
have been collected in the form of matched sets of cases and controls, where the cases
have the disease, the controls do not, and the sets are matched on background variables
such as age, sex, marital status, education, residential location, and possibly other
health indicators. The matching variables combine to form strata over which relative
risks are to be estimated; thus, for example, a small group of persons of a given age,
marital status, and health history will form a single stratum. The matching variables
can also be thought of as proxies for a larger set of unobserved background variables
that are assumed to be constant within strata. The logit for the jth individual in the ith
stratum can be written as:

logit ( p ij ) = a i + b j X ij
where X ij is the vector of exposure variables and a i is a parameter dedicated to the
stratum. Since case-control studies will frequently have a large number of small
matched sets, the a i are nuisance parameters that can cause problems in estimation
(Cox and Hinkley, 1974). In the example discussed below, there are 63 matched sets,
each consisting of four cases and one control, with information on seven exposure
variables for every subject.
The problem with estimating an unconditional model for these data is that we would
need to include 63 1 = 62 dummy variables for the strata. This would leave us with
possibly 70 parameters being estimated for a data set with only 315 observations.
Furthermore, increasing the sample size will not help because an additional stratum
parameter would have to be estimated for each additional matched set in the study
sample. By working with the appropriate conditional likelihood, however, the nuisance
parameters can be eliminated, simplifying estimation and protecting against potential
biases that may arise in the unconditional model (Cox, 1975; Chamberlain, 1980). The
conditional model requires estimation only of the relative risk parameters of interest.
LOGIT allows the estimation of models for matched sample case-control studies
with one case and any number of controls per set. Thus, matched pair studies, as well
as studies with varying numbers of controls per case, are easily handled. However, not
all commands discussed so far are available for conditional logistic regression.

I-555
Chapter 17

Discrete Choice Logit


Econometricians and psychometricians have developed a version of logit frequently
called the discrete choice model, or McFaddens conditional logit model
(McFadden, 1973, 1976, 1982, 1984; Hensher and Johnson, 1981; Ben-Akiva and
Lerman, 1985; Train, 1986; Luce, 1959). This multinomial model differs from the
standard polytomous logit in the interpretation of the coefficients, the number of
parameters estimated, the syntax of the model sentence, and options for data layout.
The discrete choice framework is designed specifically to model an individuals
choices in response to the characteristics of the choices. Characteristics of choices are
attributes such as price, travel time, horsepower, or calories; they are features of the
alternatives that an individual might choose from. By contrast, characteristics of the
chooser, such as age, education, income, and marital status, are attributes of a person.
The classic application of the discrete choice model has been to the choice of travel
mode to work (Domencich and McFadden, 1975). Suppose a person has three
alternatives: private auto, car pool, and commuter train. The individual is assumed to
have a utility function representing the desirability of each option, with the utility of
an alternative depending solely on its own characteristics. With travel time and travel
cost as key characteristics determining mode choice, the utility of each option could be
written as:

Ui = B 1 T i + B2 Ci + ei
where i = 1, 2, 3 represents private auto, car pool, and train, respectively. In this
random utility model, the utility U i of the ith alternative is determined by the travel
time T i , the cost C i of that alternative, and a random error term, e i . Utility of an
alternative is assumed not to be influenced by the travel times or costs of other
alternatives available, although choice will be determined by the attributes of all
available alternatives. In addition to the alternative characteristics, utility is sometimes
also determined by an alternative specific constant.
The choice model specifies that an individual will choose the alternative with the
highest utility as determined by the equation above. Because of the random
component, we are reduced to making statements concerning the probability that a
given choice is made. If the error terms are distributed as i.i.d. extreme value, it can be
shown that the probability of the ith alternative being chosen is given by the familiar
logit formula

I-556
Logistic Regression

exp ( X i b )
Prob ( U i > U j for all j i ) = -----------------------------exp ( X i b )

Suppose that for the first few cases our data are as follows:
Subject
1
2
3
4
5

Choice Auto(1) Auto(2) Pool(1) Pool(2) Train(1)


1
3
1
2
3

20
45
15
60
30

3.50
6.00
1.00
5.50
4.25

35
65
30
70
40

2.00
3.00
0.50
2.00
1.75

65
65
60
90
55

Train(2) Sex
1.10
1.00
1.00
2.00
1.50

Male
Female
Male
Male
Male

Age
27
35
22
45
52

The third record has a person who chooses to go to work by private auto (choice = 1);
when he drives, it takes 15 minutes to get to work and costs one dollar. Had he
carpooled instead, it would have taken 30 minutes to get to work and cost 50 cents. The
train would have taken an hour and cost one dollar. For this case, the utility of each
option is given by
U(private auto)= b1*15 + b2*1.00 + error13
U(car pool) = b1*30 + b2* 0.50 + error23
U(train) = b1*60 + b2*1.00 + error33

The error term has two subscripts, one pertaining to the alternative and the other
pertaining to the individual. The error is individual-specific and is assumed to be
independent of any other error or variable in the data set. The parameters b 1 and b 2
are common utility weights applicable to all individuals in the sample. In this example,
these are the only parameters, and their number does not depend on the number of
alternatives individuals can choose from. If a person also had the option of walking to
work, we would expand the model to include this alternative with
U (walking) = b1*70 + b2*0.00 + error43

and we would still be dealing with only the two regression coefficients b 1 and b 2 .
This highlights a major difference between the discrete choice and standard
polytomous logit models. In polytomous logit, the number of parameters grows with
the number alternatives; if the value of NCAT is increased from 3 to 4, a whole new
vector of parameters is estimated. By contrast, in the discrete choice model without a
constant, increasing the number of alternatives does not increase the number of
discrete choice parameters estimated.

I-557
Chapter 17

Finally, we need to look at the optional constant. Optional is emphasized because it


is perfectly legitimate to estimate without a constant, and, in certain circumstances, it
is even necessary to do so. If we were to add a constant to the travel mode model, we
would obtain the following utility equations:

U i = b oi + b 1 T i + b 2 C i + e i
where i = 1, 2, 3 represents private auto, car pool, and train, respectively. The
constant here, b oi , is alternative-specific, with a separate one estimated for each
alternative: b o1 corresponds to private auto; b o2 , to car pooling; and b o3 , to train. Like
polytomous logit, the constant pertaining to the reference group is normalized to 0 and
is not estimated.
An alternative specific CONSTANT is entered into a discrete choice model to capture
unmeasured desirability of an alternative. Thus, the first constant could reflect the
convenience and comfort of having your own car (or in some cities the inconvenience
of having to find a parking space), and the second might reflect the inflexibility of
schedule associated with shared vehicles. With NCAT=3, the third constant will be
normalized to 0.

Stepwise Logit
Automatic model selection can be extremely useful for analyzing data with a large
number of covariates for which there is little or no guidance from previous research.
For these situations, LOGIT supports stepwise regression, allowing forward, backward,
mixed, and interactive covariate selection, with full control over forcing, selection
criteria, and candidate variables (including interactions). The procedure is based on
Peduzzi, Holford, and Hardy (1980).
Stepwise regression results in a model that cannot be readily evaluated using
conventional significance criteria in hypothesis tests, but the model may prove useful
for prediction. We strongly suggest that you separate the sample into learning and test
sets for assessment of predictive accuracy before fitting a model to the full data set. See
the cautionary discussion and references in Chapter 14.

I-558
Logistic Regression

Logistic Regression in SYSTAT


Estimate Model Main Dialog Box
Logistic regression analysis provides tools for model building, model evaluation,
prediction, simulation, hypothesis testing, and regression diagnostics.
Many of the results generated by modeling, testing, or diagnostic procedures can be
saved to SYSTAT data files for subsequent graphing and display. New data handling
features for the discrete choice model allow tremendous savings in disk space when
choice attributes are constant, and in some models, performance is greatly improved.
The Logit Estimate Model dialog box is shown below.

n Dependent. Select the variable you want to examine. The dependent variable should

be a categorical numeric variable.


n Independent(s). Select one or more continuous or categorical variables. To add an

interaction to your model, use the Cross button. For example, to add the term
SEX*EDUCATION, add SEX to the Independent list and then add EDUCATION by
clicking Cross.
n Conditional(s). Select conditional variables. To add interactive conditional

variables to your model, use the Cross button. For example, to add the term
SEX*EDUCATION, add SEX to the Conditional list and then add EDUCATION by
clicking Cross.

I-559
Chapter 17

n Include constant. The constant is an optional parameter. Deselect Include constant

to obtain a model through the origin. When in doubt, include the constant.
n Prediction table. Produces a prediction-of-success table, which summarizes the

classificatory power of the model.


n Quasi maximum likelihood. Specifies that the covariance matrix will be quasi-

maximum likelihood adjusted after the first iteration. If this matrix is calculated, it
will be used during subsequent hypothesis testing and will affect t ratios for
estimated parameters.
n Save file. Saves specified statistics in filename.SYD.

Click the Options button to go to the Categories, Discrete Choice, and Estimation
Options dialog boxes.

Categories
You must specify numeric or string grouping variables that define cells. Specify for all
categorical variables for which logistic regression analysis should generate design
variables.

Categorical Variables(s). Categorize an independent variable when it has several


categories; for example, education levels, which could be divided into the following
categories: less than high school, some high school, finished high school, some
college, finished bachelors degree, finished masters degree, and finished doctorate.
On the other hand, a variable such as age in years would not be categorical unless age
were broken up into categories such as under 21, 2165, and over 65.
Effect. Produces parameter estimates that are differences from group means.

I-560
Logistic Regression

Dummy. Produces dummy codes for the design variables instead of effect codes.
Coding of dummy variables is the classic analysis of variance parameterization, in
which the sum of effects estimated for a classifying variable is 0. If your categorical
variable has k categories, k 1 dummy variables are created.

Discrete Choice
The discrete choice framework is designed specifically to model an individuals
choices in response to the characteristics of the choices. Characteristics of choices are
attributes such as price, travel time, horsepower, or calories; they are features of the
alternatives that an individual might choose from. You can define set names for groups
of variables, and create, edit, or delete variables.

Set Name. Specifies conditional variables. Enter a set name and then you can add and
cross variables. To create a new set, click New. Repeat this process until you have
defined all of your sets. You can edit existing sets by highlighting the name of the set
in the Set Name drop-down list. To delete a set, select the set in the drop-down list and
click Delete. When you click Continue, SYSTAT will check that each set name has a
definition. If a set name exists but no variables were assigned to it, the set is discarded
and the set name will not be in the drop-down list when you return to this dialog box.
Alternatives for discrete choice. Specify an alternative for discrete choice.
Characteristics of choice are features of the alternatives that an individual might
choose between. It is needed only when the number of alternatives in a choice model
varies per subject.
Number of categories. Specify the number of categories or alternatives the variable has.
This is needed only for the by-choice data layout where the values of the dependent

I-561
Chapter 17

variable are not explicitly coded. This is only enabled when the Alternatives for discrete
choice field is not empty.

Options
The Logit Options dialog box allows you to specify convergence and a tolerance level,
select complete or stepwise entry, and specify entry and removal criteria.

Converge. Specifies the largest relative change in any coordinate before iterations
terminate.
Tolerance. Prevents the entry of a variable that is highly correlated with the
independent variables already included in the model. Enter a value between 0 and 1.
Typical values are 0.01 or 0.001. The higher the value (closer to 1), the lower the
correlation required to exclude a variable.
Estimation. Controls the method used to enter and remove variables from the equation.
n Complete. All independent variables are entered in a single step.
n Stepwise. Allows forward, backward, mixed, and interactive covariates selection,

with full control over forcing, selection criteria, and candidates, including
interactions. It results in a model that can be useful for prediction.
Stepwise Options. The following alternatives are available for stepwise entry and
removal:

I-562
Logistic Regression

n Backward. Begins with all candidate variables in the model. At each step, SYSTAT

removes the variable with the largest Remove value.


n Forward. Begins with no variables in the model. At each step, SYSTAT adds the

variable with the smallest Enter value.


n Automatic. For Backward, SYSTAT automatically removes a variable from your

model at each step. For Forward, SYSTAT automatically adds a variable to the
model at each step.
n Interactive. Allows you to use your own judgment in selecting variables for

addition or deletion.
Probability. You can also control the criteria used to enter variables into and remove
variables from the model:
n Enter. Enters a variable into the model if its alpha value is less than the specified

value. Enter a value between 0 and 1(for example, 0.025).


n Remove. Removes a variable from the model if its alpha value is greater than the

specified value. Enter a value between 0 and 1(for example, 0.025).


Force. Forces the first n variables listed in your model to remain in the equation.
Max step. Specifies the maximum number of steps.

Deciles of Risk
After you successfully estimate your model using logistic regression, you can calculate
deciles of risk. This will help you make sure that your model fits the data and that the
results are not unduly influenced by a handful of unusual observations. In using the
deciles of risk table, please note that the goodness-of-fit statistics will depend on the
grouping rule specified.

Two grouping rules are available:


n Based on probability values. Probability is reallocated across the possible values of

the dependent variable as the independent variable changes. It provides a global

I-563
Chapter 17

view of covariate effects that is not easily seen when considering each binary
submodel separately. In fact, the overall effect of a covariate on the probability of
an outcome can be of the opposite sign of its coefficient estimate in the
corresponding submodel. This is because the submodel concerns only two of the
outcomes, whereas the derivative table considers all outcomes at once.
n Based on equal counts per bin. Allocates approximately equal numbers of

observations to each cell. Enter the number of cells or bins in the Number of bins
text box.

Quantiles
After estimating your model, you can calculate quantiles for any single-predictor in the
model. Quantiles of unadjusted data can be useful in assessing the suitability of a
functional form when you are interested in the unconditional distribution of the failure
times.

Covariate(s). The Covariate(s) list contains all of the variables specified in the
Independent list in the main Logit dialog box. You can set any of the covariates to a
fixed value by selecting the variable in the Covariates list and entering a value in the
Value text box. This constraint appears as variable name = value in the Fixed Value
Settings list after you click Add. The quantiles for the desired variable correspond to a
model in which the covariates are fixed at these values. Any covariates not fixed to a
value are assigned the value of 0.
Quantile Value Variable. By default, the first variable in the Independent variable list in
the main dialog box is shown in this field. You can change this to any variable from
the list. This variable name is then issued as the argument for the QNTL command.

I-564
Logistic Regression

Simulation
SYSTAT allows you to generate and save predicted probabilities and odds ratios, using
the last model estimated to evaluate a set of logits. The logits are calculated from a
combination of fixed covariate values that you specify in this dialog box.

Covariate(s). The Covariate(s) list contains all of the variables specified in the
Independent list on the main Logit dialog box. Select a covariate, enter a fixed value
for the covariate in the Value text box, and click Add.
Value. Enter the value over which the parameters of the simulation are to vary.
Fixed value settings. This box lists the fixed values on the covariates from which the
logits are calculated.

When you click OK, SYSTAT prompts you to specify a file to which the simulation
results will be saved.

Hypothesis
After you successfully estimate your model using logistic regression, you can perform
post hoc analyses.

I-565
Chapter 17

Enter the hypotheses that you would like to test. All the hypotheses that you list will
be tested jointly in a single test. To test each restriction individually, you will have to
revisit this dialog box each time. To reference dummies generated from categorical
covariates, use square brackets, as in:

RACE [ 1 ] = 0
You can reproduce the Wald version of the t ratio by testing whether a coefficient is 0:

AGE = 0
If you dont specify a sub-vector, the first is assumed; thus, the constraint above is
equivalent to:

AGE { 1 } = 0

Using Commands
After selecting a file with USE filename, continue with:
LOGIT
CATEGORY grpvarlist / MISS EFFECT DUMMY
NCAT=n
ALT var
SET parameter=condvarlist
MODEL depvar = CONSTANT + indvarexp
depvar = condvarlist;polyvarlist
ESTIMATE / PREDICT TOLERANCE=d CONVERGE=d QML MEANS CLASS
DERIVATIVE=INDIVIDUAL or AVERAGE
or
START / BACKWARD FORWARD ENTER=d REMOVE=d FORCE=n
MAXSTEP=n
STEP var or + or - / AUTO
(sequence of STEPs)
STOP
SAVE
DC / SMART=n P=p1,p2,
QNTL var / covar=d covar=d
SIMULATE var1=d1, var2=d2, / DO var1=d1,d2,d3, var2=d1,d2,d3
HYPOTHESIS
CONSTRAIN argument
TEST

I-566
Logistic Regression

Usage Considerations
Types of data. LOGIT uses rectangular data only. The dependent variable is
automatically taken to be categorical. To change the order of the categories, use the
ORDER statement. For example,
ORDER CLASS / SORT=DESCENDING
LOGIT can also handle categorical predictor variables. Use the CATEGORY statement
to create them, and use the EFFECTS or DUMMY options of CATEGORY to determine
the coding method. Use the ORDER command to change the order of the categories.

Print options. For PRINT=SHORT, the output gives N, the type of association, parameter
estimates, and associated tests. PRINT=LONG gives, in addition to the above results, a
correlation matrix of the parameter estimates.
Quick Graphs. LOGIT produces no Quick Graphs. Use the saved files from ESTIMATE
or DC to produce diagnostic plots and fitted curves. See the examples.
Saving files. LOGIT saves simulation results, quantiles, or residuals and estimated
values.
BY groups. LOGIT analyzes data by groups.
Bootstrapping. Bootstrapping is not available in this procedure.
Case frequencies. LOGIT uses the FREQ variable, if present, to weight cases. This
inflates the total degrees of freedom to be the sum of the number of frequencies. Using
a FREQ variable does not require more memory, however. Cases whose value on the
FREQ variable are less than or equal to 0 are deleted from the analysis. The FREQ
variable may take non-integer values. When the FREQ command is in effect, separate
unweighted and weighted case counts are printed.
Weighting can be used to compensate for sampling schemes that stratify on the
covariates, giving results that more accurately reflect the population. Weighting is also
useful for market share predictions from samples stratified on the outcome variable in
discrete choice models. Such samples are known as choice-based in the econometric
literature (Manski and Lerman, 1977; Manski and McFadden, 1980; Coslett, 1980) and
are common in matched-sample case-control studies where the cases are usually oversampled, and in market research studies where persons who choose rare alternatives
are sampled separately.
Case weights. LOGIT does not allow case weighting.

I-567
Chapter 17

Examples
The following examples begin with the simple binary logit model and proceed to more
complex multinomial and discrete choice logit models. Along the way, we will
examine diagnostics and other options used for applications in various fields.

Example 1
Binary Logit
To illustrate the use of binary logistic regression, we take this example from Hosmer
and Lemeshows book Applied Logistic Regression, referred to below as H&L.
Hosmer and Lemeshow consider data on low infant birth weight (LOW) as a function
of several risk factors. These include the mothers age (AGE), mothers weight during
last menstrual period (LWT), race (RACE = 1: white, RACE = 2: black, RACE = 3:
other), smoking status during pregnancy (SMOKE), history of premature labor (PTL),
hypertension (HT), uterine irritability (UI), and number of physician visits during first
trimester (FTV). The dependent variable is coded 1 for birth weights less than 2500
grams and coded 0 otherwise. These variables have previously been identified as
associated with low birth weight in the obstetrical literature.
The first model considered is the simple regression of LOW on a constant and LWD,
a dummy variable coded 1 if LWT is less than 110 pounds and coded 0 otherwise. (See
H&L, Table 3.17.) LWD and LWT are similar variable names. Be sure to note which is
being used in the models that follow.
The input is:
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD
ESTIMATE

The output begins with a listing of the dependent variable and the sample split between
0 (reference) and 1 (response) for the dependent variable. A brief iteration history
follows, showing the progress of the procedure to convergence. Finally, the parameter
estimates, standard errors, standardized coefficients (popularly called t ratios), p
values, and the log-likelihood are presented.

I-568
Logistic Regression

Variables in the SYSTAT Rectangular file are:


ID
LOW
AGE
LWT
PTL
HT
UI
FTV
CASEID
PTD
LWD

RACE
BWT

SMOKE
RACE1

Categorical values encountered during processing are:


LOW (2 levels)
0,
1
Binary LOGIT Analysis.
Dependent variable: LOW
Input records:
189
Records for analysis:
Sample split
Category choices
REF
RESP
Total
:
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 CONSTANT
2 LWD

189

59
130
189
1
2
3
4

is
-131.005
is
-113.231
is
-113.121
is
-113.121
-113.121
Estimate
-1.054
1.054

S.E.
0.188
0.362

t-ratio
-5.594
2.914

p-value
0.000
0.004

95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LWD
2.868
5.826
1.412
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
8.431 with 1 df Chi-sq p-value = 0.004
McFaddens Rho-Squared =
0.036

Coefficients
We can evaluate these results much like a linear regression. The coefficient on LWD is
large relative to its standard error (t ratio = 2.91) and so appears to be an important
predictor of low birth weight. The interpretation of the coefficient is quite different
from ordinary regression, however. The logit coefficient tells how much the logit
increases for a unit increase in the independent variable, but the probability of a 0 or 1
outcome is a nonlinear function of the logit.

Odds Ratio
The odds-ratio table provides a more intuitively meaningful quantity for each
coefficient. The odds of the response are given by p ( 1 p ) , where p is the
probability of response, and the odds ratio is the multiplicative factor by which the
odds change when the independent variable increases by one unit. In the first model,
being a low-weight mother increases the odds of a low birth weight baby by a

I-569
Chapter 17

multiplicative factor of 2.87, with lower and upper confidence bounds of 1.41 and 5.83,
respectively. Since the lower bound is greater than 1, the variable appears to represent
a genuine risk factor. See Kleinbaum, Kupper, and Chambliss (1982) for a discussion.

Example 2
Binary Logit with Multiple Predictors
The binary logit example contains only a constant and a single dummy variable. We
consider the addition of the continuous variable AGE to the model.
The input is:
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD+AGE
ESTIMATE / MEANS

The output follows:


Variables in the SYSTAT Rectangular file are:
ID
LOW
AGE
LWT
PTL
HT
UI
FTV
CASEID
PTD
LWD

RACE
BWT

Categorical values encountered during processing are:


LOW (2 levels)
0,
1
Binary LOGIT Analysis.
Dependent variable: LOW
Input records:
189
Records for analysis:
Sample split
Category choices
REF
RESP
Total
:

189

59
130
189

Independent variable MEANS


PARAMETER
1 CONSTANT
2 LWD
3 AGE

L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:

0
1.000
0.356
22.305

1
2
3
4
5

is
-131.005
is
-112.322
is
-112.144
is
-112.143
is
-112.143
-112.143

-1
1.000
0.162
23.662

OVERALL
1.000
0.222
23.238

SMOKE
RACE1

I-570
Logistic Regression

Parameter
1 CONSTANT
2 LWD
3 AGE

Estimate
-0.027
1.010
-0.044

S.E.
0.762
0.364
0.032

t-ratio
-0.035
2.773
-1.373
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LWD
2.746
5.607
1.345
3 AGE
0.957
1.019
0.898
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
10.385 with 2 df Chi-sq p-value = 0.006
McFaddens Rho-Squared =
0.044

p-value
0.972
0.006
0.170

We see the means of the independent variables overall and by value of the dependent
variable. In this sample, there is a substantial difference between the mean LWD across
birth weight groups but an apparently small AGE difference.
AGE is clearly not significant by conventional standards if we look at the
coefficient/standard-error ratio. The confidence interval for the odds ratio (0.898,
1.019) includes 1.00, indicating no effect in relative risk, when adjusting for LWD.
Before concluding that AGE does not belong in the model, H&L consider the
interaction of AGE and LWD.

Example 3
Binary Logit with Interactions
In this example, we fit a model consisting of a constant, a dummy variable, a
continuous variable, and an interaction. Note that it is not necessary to create a new
interaction variable; this is done for us automatically by writing the interaction on the
MODEL statement. Lets also add a prediction table for this model.
Following is the input:
USE HOSLEM
LOGIT
MODEL LOW=CONSTANT+LWD+AGE+LWD*AGE
ESTIMATE / PREDICTION
SAVE SIM319/SINGLE,SAVE ODDS RATIOS FOR H&L TABLE 3.19
SIMULATE CONSTANT=0,AGE=0,LWD=1 / DO LWD*AGE =15,45,5
USE SIM319
LIST

I-571
Chapter 17

The output follows:


Variables in the SYSTAT Rectangular file are:
ID
LOW
AGE
LWT
PTL
HT
UI
FTV
CASEID
PTD
LWD

RACE
BWT

SMOKE
RACE1

Categorical values encountered during processing are:


LOW (2 levels)
0,
1
Total
: 12
Binary LOGIT Analysis.
Dependent variable: LOW
Input records:
189
Records for analysis:
Sample split
Category choices
REF
RESP
Total
:
L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
CONSTANT
LWD
AGE
AGE*LWD

189

59
130
189
1
2
3
4
5

is
-131.005
is
-110.937
is
-110.573
is
-110.570
is
-110.570
-110.570
Estimate
0.774
-1.944
-0.080
0.132

S.E.
t-ratio
0.910
0.851
1.725
-1.127
0.040
-2.008
0.076
1.746
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LWD
0.143
4.206
0.005
3 AGE
0.924
0.998
0.854
4 AGE*LWD
1.141
1.324
0.984
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
13.532 with 3 df Chi-sq p-value = 0.004
McFaddens Rho-Squared =
0.058
1
2
3
4

Model Prediction Success Table


Actual Predicted Choice
Choice
Response
Reference
Response
Reference
Pred. Tot.
Correct
Success Ind.
Tot. Correct
Sensitivity:
False Reference:

Simulation Vector
Fixed Parameter
1 CONSTANT
2 LWD
3 AGE

Actual
Total

21.280
37.720

37.720
92.280

59.000
130.000

59.000
0.361
0.049
0.601

130.000
0.710
0.022

189.000

0.361
0.639

Specificity:
False Response:

Value
0.0
1.000
0.0

0.710
0.290

p-value
0.395
0.260
0.045
0.081

I-572
Logistic Regression

Loop Parameter
Minimum
4 AGE*LWD
15.000
SYSTAT save file created.
7 records written to SYSTAT save file.
Case number
1
2
3
4
5
6
7

LOGIT
ODDS
0.04
1.04
0.70
2.01
1.36
3.90
2.02
7.55
2.68
14.63
3.34
28.33
4.00
54.86

SELOGIT
ODDSL
0.66
0.28
0.40
0.91
0.42
1.71
0.69
1.95
1.03
1.94
1.39
1.85
1.76
1.75

Maximum
45.000

Increment
5.000

PROB
ODDSU
0.51
3.79
0.67
4.44
0.80
8.88
0.88
29.19
0.94
110.26
0.97
432.77
0.98
1724.15

PLOWER
LOOP(1)
0.22
15.00
0.48
20.00
0.63
25.00
0.66
30.00
0.66
35.00
0.65
40.00
0.64
45.00

PUPPER
0.79
0.82
0.90
0.97
0.99
1.00
1.00

Likelihood-Ratio Statistic
At this point, it would be useful to assess the model as a whole. One method of model
evaluation is to consider the likelihood-ratio statistic. This statistic tests the hypothesis
that all coefficients except the constant are 0, much like the F test reported below linear
regressions. The likelihood-ratio statistic (LR for short) of 13.532 is chi-squared with
three degrees of freedom and a p value of 0.004. The degrees of freedom are equal to
the number of covariates in the model, not including the constant. McFaddens rhosquared is a transformation of the LR statistic intended to mimic an R-squared. It is
always between 0 and 1, and a higher rho-squared corresponds to more significant
results. Rho-squared tends to be much lower than R-squared though, and a low number
does not necessarily imply a poor fit. Values between 0.20 and 0.40 are considered
very satisfactory (Hensher and Johnson, 1981).
Models can also be assessed relative to one another. A likelihood-ratio test is
formally conducted by computing twice the difference in log-likelihoods for any pair
of nested models. Commonly called the G statistic, it has degrees of freedom equal to
the difference in the number of parameters estimated in the two models. Comparing the
current model with the model without the interaction, we have

G = 2 * ( 112.14338 110.56997 ) = 3.14684


with one degree of freedom, which has a p value of 0.076. This result corresponds to
the bottom row of H&Ls Table 3.17. The conclusion of the test is that the interaction
approaches significance.

I-573
Chapter 17

Prediction Success Table


The output also includes a prediction success table, which summarizes the
classificatory power of the model. The rows of the table show how observations from
each level of the dependent variable are allocated to predicted outcomes. Reading
across the first (Response) row we see that of the 59 cases of low birth weight, 21.28
are correctly predicted and 37.72 are incorrectly predicted. The second row shows that
of the 130 not-LOW cases, 37.72 are incorrectly predicted and 92.28 are correctly
predicted.
By default, the prediction success table sums predicted probabilities into each cell;
thus, each observation contributes a fractional amount to both the Response and
Reference cells in the appropriate row. Column sums give predicted totals for each
outcome, and row sums give observed totals. These sums will always be equal for
models with a constant.
The table also includes additional analytic results. The Correct row is the proportion
successfully predicted, defined as the diagonal table entry divided by the column total,
and Tot.Correct is the ratio of the sum of the diagonal elements in the table to the total
number of observations. In the Response column, 21.28 are correctly predicted out of
a column total of 59, giving a correct rate of 0.3607. Overall, 21.28 + 92.28 out of a
total of 189 are correct, giving a total correct rate of 0.6009.
Success Ind. is the gain that this model shows over a purely random model that
assigned the same probability of LOW to every observation in the data. The model
produces a gain of 0.0485 over the random model for responses and 0.0220 for
reference cases. Based on these results, we would not think too highly of this model.
In the biostatistical literature, another terminology is used for these quantities. The
Correct quantity is also known as sensitivity for the Response group and specificity
for the Reference group. The False Reference rate is the fraction of those predicted to
respond that actually did not respond, while the False Response rate is the fraction of
those predicted to not respond that actually responded.
We prefer the prediction success terminology because it is applicable to the
multinomial case as well.

Simulation
To understand the implications of the interaction, we need to explore how the relative
risk of low birth weight varies over the typical child-bearing years. This changing
relative risk is evaluated by computing the logit difference for base and comparison

I-574
Logistic Regression

groups. The logit for the base group, mothers with LWD = 0, is written as L(0); the logit
for the comparison group, mothers with LWD = 1, is L(l). Thus,
L(O) = CONSTANT +
B2*AGE
L(l) = CONSTANT + B1*LWD + B2*AGE + B3*LWD*AGE
= CONSTANT + B1
+ B2*AGE + B3*AGE

since, for L(l), LWD = 1. The logit difference is


L(l)-L(0) = B1 + B3*LWD*AGE

which is the coefficient on LWD plus the interaction multiplied by its coefficient. The
difference L(l) (0) evaluated for a mother of a given age is a measure of the log relative
risk due to LWD being 1. This can be calculated simply for several ages, and converted
to odds ratios with upper and lower confidence bounds, using the SIMULATE
command.
SIMULATE calculates the predicted logit, predicted probability, odds ratio, upper
and lower bounds, and the standard error of the logit for any specified values of the
covariates. In the above command, the constant and age are set to 0, because these
coefficients do not appear in the logit difference. LWD is set to 1, and the interaction
is allowed to vary from 15 to 45 in increments of five years. The only printed output
produced by this command is a summary report.
SIMULATE does not print results when a DO LOOP is specified because of the
potentially large volume of output it can generate. To view the results, use the
commands:
USE SIM319
LIST

The results give the effect of low maternal weight (LWD) on low birth weight as a
function of age, where LOOP(1) is the value of AGE * LWD (which is just AGE) and
ODDSU and ODDSL are upper and lower bounds of the odds ratio. We see that the
effect of LWD goes up dramatically with age, although the confidence interval
becomes quite large beyond age 30. The results presented here are calculated internally
within LOGIT and thus differ slightly from those reported in H&L, who use printed
output with fewer decimal places of precision to obtain their results.

I-575
Chapter 17

Example 4
Deciles of Risk and Model Diagnostics
Before turning to more detailed model diagnostics, we fit H&Ls final model. As a
result of experimenting with more variables and a large number of interactions, H&L
arrive at the model used here. The input is:
USE HOSLEM
LOGIT
CATEGORY RACE / DUMMY
MODEL LOW=CONSTANT+AGE+RACE+SMOKE+HT+UI+LWD+PTD+ ,
AGE*LWD+SMOKE*LWD
ESTIMATE
SAVE RESID
DC / P=0.06850,0.09360,0.15320,0.20630,0.27810,0.33140,
0.42300,0.49124,0.61146
USE RESID
PPLOT PEARSON / SIZE=VARIANCE
PLOT DELPSTAT*PROB/SIZE=DELBETA(1)

The categorical variable RACE is specified to have three levels. By default LOGIT uses
the highest category as the reference group, although this can be changed. The model
includes all of the main variables except FTV, with LWT and PTL transformed into
dummy variable variants LWD and PTD, and two interactions. To reproduce the results
of Table 5.1 of H&L, we specify a particular set of cut points for the deciles of risk
table. Some of the results are:
Variables in the SYSTAT Rectangular file are:
ID
LOW
AGE
LWT
PTL
HT
UI
FTV
CASEID
PTD
LWD
Categorical values encountered during processing are:
RACE (3 levels)
1,
2,
3
LOW (2 levels)
0,
1
Binary LOGIT Analysis.
Dependent variable: LOW
Input records:
189
Records for analysis:
Sample split
Category choices
REF
RESP
Total
:
L-L
L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:

189

59
130
189
1
2
3
4
5
6

is
is
is
is
is
is

-131.005
-98.066
-96.096
-96.006
-96.006
-96.006
-96.006

RACE
BWT

SMOKE
RACE1

I-576
Logistic Regression

t-ratio
0.232
-1.843
-1.637
0.608
2.515
2.055
1.519
-0.926
2.613
1.779
-1.719
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 AGE
0.919
1.005
0.841
3 RACE_1
0.468
1.162
0.188
4 RACE_2
1.382
3.920
0.487
5 SMOKE
3.168
7.781
1.290
6 HT
3.893
14.235
1.065
7 UI
2.071
5.301
0.809
8 LWD
0.177
6.902
0.005
9 PTD
3.427
8.632
1.360
10 AGE*LWD
1.159
1.363
0.985
11 SMOKE*LWD
0.245
1.218
0.049
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
42.660 with 10 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.182
1
2
3
4
5
6
7
8
9
10
11

Parameter
CONSTANT
AGE
RACE_1
RACE_2
SMOKE
HT
UI
LWD
PTD
AGE*LWD
SMOKE*LWD

Estimate
0.248
-0.084
-0.760
0.323
1.153
1.359
0.728
-1.730
1.232
0.147
-1.407

S.E.
1.068
0.046
0.464
0.532
0.458
0.661
0.479
1.868
0.471
0.083
0.819

p-value
0.816
0.065
0.102
0.543
0.012
0.040
0.129
0.354
0.009
0.075
0.086

Deciles of Risk
Records processed: 189
Sum of weights =
189.000
Statistic
p-value
df
Hosmer-Lemeshow*
5.231
0.733
8.000
Pearson
183.443
0.374
178.000
Deviance
192.012
0.224
178.000
* Large influence of one or more deciles may affect statistic.
Category
0.423
Resp Obs
6.000
Exp
6.816
Ref Obs
12.000
Exp
11.184

0.069
0.491
0.0
10.000
0.854
8.570
18.000
9.000
17.146
10.430

0.094
0.611
1.000
9.000
1.641
10.517
19.000
10.000
18.359
8.483

Avg Prob
0.379

0.047
0.451

0.082
0.554

Category

1.000

Resp Obs
Exp
Ref Obs
Exp

15.000
14.122
4.000
4.878

Avg Prob
0.743
SYSTAT save file created.
189 records written to %1 save file.

0.153

0.206

0.278

0.331

4.000

2.000

6.000

6.000

2.252

3.646

5.017

5.566

14.000

18.000

14.000

12.000

15.748

16.354

14.983

12.434

0.125

0.182

0.251

0.309

I-577

Expected Value for Normal Distribution

Chapter 17

3
2
1
0
-1
-2
-3
-4

-3

-2

-1 0
1
PEARSON

VARIANCE
0.3
0.2
0.1
0.0

Deciles of Risk
How well does a model fit the data? Are the results unduly influenced by a handful of
unusual observations? These are some of the questions we try to answer with our
model assessment tools. Besides the prediction success table and likelihood-ratio tests
(see the Binary Logit with Interactions example), the model assessment methods in
LOGIT include the Pearson chi-square, deviance and Hosmer-Lemeshow statistics, the
deciles of risk table, and a collection of residual, leverage, and influence quantities.
Most of these are produced by the DC command, which is invoked after estimating a
model.

I-578
Logistic Regression

The table in this example is generated by partitioning the sample into 10 groups
based on the predicted probability of the observations. The row labeled Category gives
the end points of the cells defining a group. Thus, the first group consists of all
observations with predicted probability between 0 and 0.069, the second group covers
the interval 0.069 to 0.094, and the last group contains observations with predicted
probability greater than 0.611.
The cell end points can be specified explicitly as we did or generated automatically
by LOGIT. Cells will be equally spaced if the DC command is given without any
arguments, and LOGIT will allocate approximately equal numbers of observations to
each cell when the SMART option is given, as:
DC / SMART = 10

which requests 10 cells. Within each cell, we are given a breakdown of the observed
and expected 0s (Ref) and 1s (Resp) calculated as in the prediction success table.
Expected ls are just the sum of the predicted probabilities of 1 in the cell. In the table,
it is apparent that observed totals are close to expected totals everywhere, indicating a
fairly good fit. This conclusion is borne out by the Hosmer-Lemeshow statistic of 5.23,
which is approximately chi-squared with eight degrees of freedom. H&L discuss the
degrees of freedom calculation.
In using the deciles of risk table, it should be noted that the goodness-of-fit statistics
will depend on the grouping rule specified and that not all statistics programs will
apply the same rules. For example, some programs assign all tied probabilities to the
same cell, which can result in very unequal cell counts. LOGIT gives the user a high
degree of control over the grouping, allowing you to choose among several methods.
The table also provides the Pearson chi-square and the sum of squared deviance
residuals, assuming that each observation has a unique covariate pattern.

Regression Diagnostics
If the DC command is preceded by a SAVE command, a SYSTAT data file containing
regression diagnostics will be created (Pregibon, 1981; Cook and Weisberg, 1984).
The SAVE file contains these variables:

I-579
Chapter 17

ACTUAL
PREDICT
PROB
LEVERAGE(1)
LEVERAGE(2)
PEARSON
VARIANCE
STANDARD
DEVIANCE
DELDSTART
DELPSTART
DELBETA(1)
DELBETA(2)
DELBETA(3)

Value of Dependent Variable


Class Assignment (1 or 0)
Predicted probability
Diagonal element of Pregibon hat matrix
Component of LEVERAGE(1)
Pearson Residual for observation
Variance of Pearson Residual
Standardized Pearson Residual
Deviance Residual
Change in Deviance chi-square
Change in Pearson chi-square
Standardized Change in Beta
Standardized Change in Beta
Standardized Change in Beta

LEVERAGE(1) is a measure of the influence of an observation on the model fit and is


H&Ls h. DELBETA(1) is a measure of the change in the coefficient vector due to the
observation and is their (delta beta), DELPSTAT is based on the squared residual
and is their 2 (delta chi-square), and DELDSTAT is the change in deviance and is
their D (delta D). As in linear regression, the diagnostics are intended to identify
outliers and influential observations. Plots of PEARSON, DEVIANCE, LEVERAGE(l),
DELDSTAT, DELPSTAT against the CASE will highlight unusual data points. H&L
suggest plotting 2 , D , and against PROB and against h.
There is an important difference between our calculation of these measures and
those produced by H&L. In LOGIT, the above quantities are computed separately for
each observation, with no account taken of covariate grouping; whereas, in H&L,
grouping is taken into account. To obtain the grouped variants of these statistics,
several SYSTAT programming steps are involved. For further discussion and
interpretation of diagnostic graphs, see H&Ls Chapter 5. We include the probability
plot of the residuals from our model, with the variance of the residuals used to size the
plotting characters.
We also display an example of the graph on the cover of H&L. The original cover
was plotted using SYSTAT Version 5 for the Macintosh. There are slight differences
between the two plots because of the scales and number of iterations in the model
fitting, but the examples are basically the same. H&L is an extremely valuable resource
for learning about graphical aids to diagnosing logistic models.

I-580
Logistic Regression

Example 5
Quantiles
In bioassay, it is common to estimate the dosage required to kill 50% of a target
population. For example, a toxicity experiment might establish the concentration of
nicotine sulphate required to kill 50% of a group of common fruit flies (Hubert, 1984).
More generally, the goal is to identify the level of a stimulus required to induce a 50%
response rate, where the response is any binary outcome variable and the stimulus is a
continuous covariate. In bioassay, stimuli include drugs, toxins, hormones, and
insecticides; the responses include death, weight gain, bacterial growth, and color
change, but the concepts are equally applicable to other sciences.
To obtain the LD50 in LOGIT, simply issue the QNTL command. However, dont
make the mistake of spelling quantile as QU, which means QUIT in SYSTAT. QNTL
will produce not only the LD50 but also a number of other quantiles as well, with upper
and lower bounds when they exist. Consider the following data from Williams (1986):
RESPONSE LDOSE COUNT

CASE
CASE
CASE
CASE
CASE
CASE
CASE
CASE
CASE

Here, RESPONSE is the dependent variable, LDOSE is the logarithm of the dose
(stimulus), and COUNT is the number of subjects with that response. The model
estimated is:
USE WILL
FREQ=COUNT
LOGIT
MODEL RESPONSE=CONSTANT+LDOSE
ESTIMATE
QNTL

I-581
Chapter 17

Following is the output:


Variables in the SYSTAT Rectangular file are:
RESPONSE
LDOSE
COUNT
Case frequencies determined by value of variable COUNT.
Categorical values encountered during processing are:
RESPONSE (2 levels)
0,
1
Binary LOGIT Analysis.
Dependent variable: RESPONSE
Analysis is weighted by COUNT
Sum of weights =
25.000
Input records:
9
Records for analysis:
Sample split
Category
REF
RESP
Total

Weighted
Count
5
15.000
4
10.000
9
25.000

Count
:

L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 CONSTANT
2 LDOSE

1
2
3
4
5

is
is
is
is
is

-17.329
-13.277
-13.114
-13.112
-13.112
-13.112
Estimate
0.564
0.919

S.E.
t-ratio
0.496
1.138
0.394
2.334
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 LDOSE
2.507
5.425
1.159
Log Likelihood of constants only model = LL(0) =
-16.825
2*[LL(N)-LL(0)] =
7.427 with 1 df Chi-sq p-value = 0.006
McFaddens Rho-Squared =
0.221
Evaluation Vector
1 CONSTANT
2 LDOSE

1.000
VALUE

Quantile Table
Probability

LOGIT

LDOSE

Upper

Lower

0.999
0.995
0.990
0.975
0.950
0.900
0.750
0.667
0.500
0.333
0.250
0.100
0.050
0.025
0.010
0.005
0.001

6.907
5.293
4.595
3.664
2.944
2.197
1.099
0.695
0.0
-0.695
-1.099
-2.197
-2.944
-3.664
-4.595
-5.293
-6.907

6.900
5.145
4.385
3.372
2.590
1.777
0.582
0.142
-0.613
-1.369
-1.809
-3.004
-3.817
-4.599
-5.612
-6.372
-8.127

44.788
33.873
29.157
22.875
18.042
13.053
5.928
3.551
0.746
-0.347
-0.731
-1.552
-2.046
-2.503
-3.081
-3.508
-4.486

3.518
2.536
2.105
1.519
1.050
0.530
-0.445
-1.047
-3.364
-7.392
-9.987
-17.266
-22.281
-27.126
-33.416
-38.136
-49.055

p-value
0.255
0.020

I-582
Logistic Regression

This table includes LD (probability) values between 0.001 and 0.999. The median
lethal LDOSE (log-dose) is 0.613 with upper and lower bounds of 0.746 and 3.364
for the default 95% confidence interval, corresponding to a dose of 0.542 with limits
2.11 and 0.0346.

Indeterminate Confidence Intervals


Quantile confidence intervals are calculated using Fieller bounds (Finney, 1978),
which can easily include positive or negative infinity for steep dose-response
relationships. In the output, these are represented by the SYSTAT missing value. If this
happens, an alternative suggested by Williams (1986) is to calculate confidence
bounds using likelihood-ratio (LR) tests. See Cox and Oakes (1984) for a likelihood
profile example. Williams observes that the LR bounds seem to be invariably smaller
than the Fieller bounds even for well-behaved large-sample problems.
With SYSTAT BASIC, the search for the LR bounds can be conducted easily.
However, if you are not familiar with LR testing of this type, please refer to Cox and
Oakes (1984) and Williams (1986) for further explanation, because our account here
is necessarily brief.
We first estimate the model of RESPONSE on LDOSE reported above, which will be
the unrestricted model in the series of tests. The key statistic is the final log-likelihood
of 13.112. We then need to search for restricted models that force the LD50 to other
values and that yield log-likelihoods no worse than 13.112 1.92 = 15.032 . A
difference in log-likelihoods of 1.92 marks a 95% confidence interval because 2 * 1.92
= 3.84 is the 0.95 cutoff of the chi-squared distribution with one degree of freedom.
A restricted model is estimated by using a new independent variable and fitting a
model without a constant. The new independent variable is equal to the original minus
the value of the hypothesized LD50 bound. Values of the bounds will be selected by
trial and error. Thus, to test an LD50 value of 0.4895, we could type:
LOGIT
LET LDOSEB=LDOSE-.4895
MODEL RESPONSE=LDOSEB
ESTIMATE
LET LDOSEB=LDOSE+2.634
MODEL RESPONSE=LDOSEB
ESTIMATE

SYSTAT BASIC is used to create the new variable LDOSEB on the fly, and the new
model is then estimated without a constant. The only important part of the results from
a restricted model is the final log-likelihood. It should be close to 15.032 if we have

I-583
Chapter 17

found the boundary of the confidence interval. We wont show the results of these
estimations except to say that the lower bound was found to be 2.634 and is tested
using the second LET statement. Note that the value of the bound is subtracted from the
original independent variable, resulting in the subtraction of a negative number. While
the process of looking for a bound that will yield a log-likelihood of 15.032 for these
data is one of trial and error, it should not take long with the interactive program.
Several other examples are provided in Williams (1986). We were able to reproduce
most of his confidence interval results, but for several models his reported LD50 values
seem to be incorrect.

Quantiles and Logistic Regression


The calculation of LD values has traditionally been conducted in the context of simple
regressions containing a single predictor variable. LOGIT extends the notion to multiple
regression by allowing you to select one variable for LD calculations while holding the
values of the other variables constant at prespecified values. Thus,
USE HOSLEM
CATEGORY RACE
MODEL LOW = CONSTANT + AGE + RACE + SMOKE + HT +,
UI + LWD + PTD
ESTIMATE
QNTL AGE / CONSTANT=1, RACE[1]=1, SMOKE=1, PTD=1,
LWD=1, HT=1, UI=1

will produce the quantiles for AGE with the other variables set as specified. The Fieller
bounds are calculated, adjusting for all other parameters estimated.

Example 6
Multinomial Logit
We will illustrate multinomial modeling with an example, emphasizing what is new in
this context. If you have not already read the example on binary logit, this is a good
time to do so. The data used here have been extracted from the National Longitudinal
Survey of Young Men, 1979. Information on 200 individuals is supplied on school
enrollment status (NOTENR = 1 if not enrolled, 0 otherwise), log10 of wage (LW), age,
highest completed grade (EDUC), mothers education (MED), fathers education
(FED), an index of reading material available in the home (CULTURE = 1 for least, 3
for most), mean income of persons in fathers occupation in 1960 (FOMY), an IQ

I-584
Logistic Regression

measure, a race dummy (BLACK = 0 for white), a region dummy (SOUTH = 0 for nonSouth), and the number of siblings (NSIBS).
We estimate a model to analyze the CULTURE variable, predicting its value with
several demographic characteristics. In this example, we ignore the fact that the
dependent variable is ordinal and treat it as a nominal variable. (See Agresti, 1990, for
a discussion of the distinction.)
USE NLS
FORMAT=4
PRINT=LONG
LOGIT
MODEL CULTURE=CONSTANT+MED+FOMY
ESTIMATE / MEANS,PREDICT,CLASS,DERIVATIVE=INDIVIDUAL
PRINT

These commands look just like our binary logit analyses with the exception of the
DERIVATIVE and CLASS options, which we will discuss below. The resulting output is:
Categorical values encountered during processing are:
CULTURE (3 levels)
1,
2,
3
Multinomial LOGIT Analysis.
Dependent variable: CULTURE
Input records:
200
Records for analysis:
Sample split
Category choices
1
2
3
Total
:

200

12
49
139
200

Independent variable MEANS

1
2
3
L-L
L-L
L-L
L-L
L-L
L-L
L-L
Log

PARAMETER
CONSTANT
MED
FOMY
at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:

Parameter
Choice Group: 1
1 CONSTANT
2 MED
3 FOMY

1
2
3
4
5
6
7

1
1.0000
8.7500
4551.5000
is
-219.7225
is
-145.2936
is
-138.9952
is
-137.8612
is
-137.7851
is
-137.7846
is
-137.7846
-137.7846

2
1.0000
10.1837
5368.8571

3
1.0000
11.4460
6116.1367

OVERALL
1.0000
10.9750
5839.1750

Estimate

S.E.

t-ratio

p-value

5.0638
-0.4228
-0.0006

1.6964
0.1423
0.0002

2.9850
-2.9711
-2.6034

0.0028
0.0030
0.0092

I-585
Chapter 17

Choice Group: 2
1 CONSTANT
2 MED
3 FOMY

2.5435
-0.1917
-0.0003

0.9834
0.0768
0.0001

2.5864
-2.4956
-2.1884
95.0 % bounds
Upper
Lower

0.0097
0.0126
0.0286

Parameter
Odds Ratio
Choice Group: 1
2 MED
0.6552
0.8660
0.4958
3 FOMY
0.9994
0.9998
0.9989
Choice Group: 2
2 MED
0.8255
0.9597
0.7101
3 FOMY
0.9997
1.0000
0.9995
Log Likelihood of constants only model = LL(0) =
-153.2535
2*[LL(N)-LL(0)] =
30.9379 with 4 df Chi-sq p-value = 0.0000
McFaddens Rho-Squared =
0.1009

Wald tests on effects across all choices


Wald
Statistic
12.0028
12.1407
9.4575

Effect
1 CONSTANT
2 MED
3 FOMY

Chi-Sq
Signif
0.0025
0.0023
0.0088

df
2.0000
2.0000
2.0000

Covariance Matrix
1
2
3
4
5
6
6

1
2.8777
-0.1746
-0.0002
0.5097
-0.0274
-0.0000
6
0.0000

0.0202
-0.0000
-0.0282
0.0027
-0.0000

0.0000
-0.0000
-0.0000
0.0000

0.9670
-0.0541
-0.0001

0.0059
-0.0000

2
-0.7234
1.0000
-0.0633
-0.2017
0.2462
-0.0149

3
-0.6151
-0.0633
1.0000
-0.1515
-0.0148
0.2284

4
0.3055
-0.2017
-0.1515
1.0000
-0.7164
-0.5544

5
-0.2100
0.2462
-0.0148
-0.7164
1.0000
-0.1570

Correlation Matrix
1
2
3
4
5
6
1
2
3
4
5
6

1
1.0000
-0.7234
-0.6151
0.3055
-0.2100
-0.1659
6
-0.1659
-0.0149
0.2284
-0.5544
-0.1570
1.0000

Individual variable derivatives averaged over all observations


PARAMETER
1 CONSTANT
2 MED
3 FOMY

1
0.2033
-0.0174
-0.0000

2
0.3441
-0.0251
-0.0000

3
-0.5474
0.0425
0.0001

I-586
Logistic Regression

Model Prediction Success Table


Actual
Choice

Predicted Choice
1

Actual
Total

1
2
3

1.8761
3.6373
6.4865

4.0901
13.8826
31.0273

6.0338
31.4801
101.4862

12.0000
49.0000
139.0000

Pred. Tot.
Correct
Success Ind.
Tot. Correct

12.0000
0.1563
0.0963
0.5862

49.0000
0.2833
0.0383

139.0000
0.7301
0.0351

200.0000

Model Classification Table


Actual
Choice

Predicted Choice
1

Actual
Total

1
2
3

1.0000
0.0
1.0000

3.0000
4.0000
5.0000

8.0000
45.0000
133.0000

12.0000
49.0000
139.0000

Pred. Tot.
Correct
Success Ind.
Tot. Correct

2.0000
0.0833
0.0233
0.6900

12.0000
0.0816
-0.1634

186.0000
0.9568
0.2618

200.0000

The output begins with a report on the number of records read and retained for analysis.
This is followed by a frequency table of the dependent variable; both weighted and
unweighted counts would be provided if the FREQ option had been used. The means
table provides means of the independent variables by value of the dependent variable.
We observe that the highest educational and income values are associated with the
most reading material in the home. Next, an abbreviated history of the optimization
process lists the log-likelihood at each iteration, and finally, the estimation results are
printed.
Note that the regression results consist of two sets of estimates, labeled Choice
Group 1 and Choice Group 2. It is this multiplicity of parameter estimates that
differentiates multinomial from binary logit. If there had been five categories in the
dependent variable, there would have been four sets of estimates, and so on. This
volume of output provides the challenge to understanding the results.
The results are a little more intelligible when you realize that we have really
estimated a series of binary logits simultaneously. The first submodel consists of the
two dependent variable categories 1 and 3, and the second consists of categories 2 and
3. These submodels always include the highest level of the dependent variable as the
reference class and one other level as the response class. If NCAT had been set to 25,
the 24 submodels would be categories 1 and 25, categories 2 and 25, through categories
24 and 25. We then obtain the odds ratios for the two submodels separately, comparing

I-587
Chapter 17

dependent variable levels 1 against 3 and 2 against 3. This table shows that levels 1 and
2 are less likely as MED and FOMY increase, as the odds ratio is less than 1.

Wald Test Table


The coefficient/standard-error ratios (t ratios) reported next to each coefficient are a
guide to the significance of an individual parameter. But when the number of
categories is greater than two, each variable corresponds to more than one parameter.
The Wald test table automatically conducts the hypothesis test of dropping all
parameters associated with a variable, and the degrees of freedom indicates how many
parameters were involved. Because each variable in this example generates two
coefficients, the Wald tests have two degrees of freedom each. Given the high
individual t ratios, it is not surprising that every variable is also significant overall. The
PRINT = LONG option also produces the parameter covariance and correlation
matrices.

Derivative Tables
In a multinomial context, we will want to know how the probabilities of each of the
outcomes will change in response to a change in the covariate values. This information
is provided in the derivative table, which tells us, for example, that when MED
increases by one unit, the probability of category 3 goes up by 0.042, and categories 1
and 2 go down by 0.017 and 0.025, respectively. To assess properly the effect of
fathers income, the variable should be rescaled to hundreds or thousands of dollars (or
the FORMAT increased) because the effect of an increase of one dollar is very small.
The sum of the entries in each row is always 0 because an increase in probability in one
category must come about by a compensating decrease in other categories. There is no
useful interpretation of the CONSTANT row.
In general, the table shows how probability is reallocated across the possible values
of the dependent variable as the independent variable changes. It thus provides a global
view of covariate effects that is not easily seen when considering each binary submodel
separately. In fact, the overall effect of a covariate on the probability of an outcome can
be of the opposite sign of its coefficient estimate in the corresponding submodel. This
is because the submodel concerns only two of the outcomes, whereas the derivative
table considers all outcomes at once.
This table was generated by evaluating the derivatives separately for each individual
observation in the data set and then computing the mean; this is the theoretically

I-588
Logistic Regression

correct way to obtain the results. A quick alternative is to evaluate the derivatives once
at the sample average of the covariates. This method saves time (but at the possible cost
of accuracy) and is requested with the option DERIVATIVE=AVERAGE.

Prediction Success
The PREDICT option instructs LOGIT to produce the prediction success table, which we
have already seen in the binary logit. (See Hensher and Johnson, 1981; McFadden,
1979.) The table will break down the distribution of predicted outcomes by actual
choice, with diagonals representing correct predictions and off-diagonals representing
incorrect predictions. For the multinomial model, the table will have dimensions NCAT
by NCAT with additional marginal results. For our example model, the core table is 3
by 3.
Each row of the table takes all cases having a specific value of the dependent
variable and shows how the model allocates those cases across the possible outcomes.
Thus in row 1, the 12 cases that actually had CULTURE = 1 were distributed by the
predictive model as 1.88 to CULTURE = 1, 4.09 to CULTURE = 2, and 6.03 to
CULTURE = 3. These numbers are obtained by summing the predicted probability of
being in each category across all of the cases with CULTURE actually equal to 1. A
similar allocation is provided for every value of the dependent variable.
The prediction success table is also bordered by additional informationrow totals
are observed sums, and column totals are predicted sums and will be equal for any
model containing a constant. The Correct row gives the ratio of the number correctly
predicted in a column to the column total. Thus, among cases for which CULTURE =
1, the fraction correct is 1.8761 12 = 0.1563 ; for CULTURE = 3, the ratio is
101.4862 139 = 0.7301 . The total correct gives the fraction correctly predicted
overall and is computed as the sum Correct in each column divided by the table total.
This is ( 1.8761 + 13.8826 + 101.4862 ) 200 = 0.5862 .
The success index measures the gain that the model exhibits in number correctly
predicted in each column over a purely random model (a model with just a constant).
A purely random model would assign the same probabilities of the three outcomes to
each case, as illustrated below:
Random Probabitity Model
Predicted Sample Fraction

Success Index =
CORRECT - Random Predicted

PROB (CULTURE=l)= 12/200 = 0.0600


PROB (CULTURE=2)= 49/200 = 0.2450
PROB (CULTURE=3)=139/200 = 0.6950

0.1563 0.0600 = 0.0963


0.2833 0.2450 = 0.0383
0.7301 0.6950 = 0.0351

I-589
Chapter 17

Thus, the smaller the success index in each column, the poorer the performance of the
model; in fact, the index can even be negative.
Normally, one prediction success table is produced for each model estimated.
However, if the data have been separated into learning and test subsamples with BY, a
separate prediction success table will be produced for each portion of the data. This can
provide a clear picture of the strengths and weaknesses of the model when applied to
fresh data.

Classification Tables
Classification tables are similar to prediction success tables except that predicted
choices instead of predicted probabilities are added into the table. Predicted choice is
the choice with the highest probability. Mathematically, the classification table is a
prediction success table with the predicted probabilities changed, setting the highest
probability of each case to 1 and the other probabilities to 0.
In the absence of fractional case weighting, each cell of the main table will contain
an integer instead of a real number. All other quantities are computed as they would be
for the prediction success table. In our judgment, the classification table is not as good
a diagnostic tool as the prediction success table. The option is included primarily for
the binary logit to provide comparability with results reported in the literature.

Example 7
Conditional Logistic Regression
Data must be organized in a specific way for the conditional logistic model;
fortunately, this organization is natural for matched sample case-control studies. First,
matched samples must be grouped together; all subjects from a given stratum must be
contiguous. It is thus advisable to provide each set with a unique stratum number to
facilitate the sorting and tracking of records. Second, the dependent variable gives the
relative position of the case within a matched set. Thus, the dependent variable will be
an integer between 1 and NCAT, and if the case is first in each stratum, then the
dependent variable will be equal to 1 for every record in the data set.
To illustrate how to set up conditional logit models, we use data discussed at length
by Breslow and Day (1980) on cases of endometrial cancer in a retirement community
near Los Angeles. The data are reproduced in their Appendix III and are identified in
SYSTAT as MACK.SYD.

I-590
Logistic Regression

The data set includes the dependent variable CANCER, the exposure variables AGE,
GALL (gall bladder disease), HYP (hypertension), OBESE, ESTROGEN, DOSE, DUR
(duration of conjugated estrogen exposure), NON (other drugs), some transformations
of these variables, and a set identification number. The data are organized by sets, with
the case coming first, followed by four controls, and so on, for a total of 315
observations ( 63 * ( 4 + 1 ) ) .
To estimate a model of the relative risks of gall bladder disease, estrogen use, and
their interaction, you may proceed as follows:
USE MACK
PRINT LONG
LOGIT
MODEL DEPVAR=GALL+EST+GALL*EST ;
ALT=SETSIZE
NCAT=5
ESTIMATE

There are three key points to notice about this sequence of commands. First, the NCAT
command is required to let LOGIT know how many subjects there are in a matched set.
Unlike the unconditional binary LOGIT, a unit of information in matched samples will
typically span more than one line of data, and NCAT will establish the minimum size
of each matched set. If each set contains the same number of subjects, the NCAT
command completely describes the data organization. If there were a varying number
of controls per set, the size of each set would be signaled with the ALT command, as in
ALT = SETSIZE

Here, SETSIZE is a variable containing the total number of subjects (number of


controls plus 1) per set. Each set could have its own value.
The second point is that the matched set conditional logit never contains a constant;
the constant is eliminated along with all other variables that do not vary among
members of a matched set. The third point is the appearance of the semicolon at the
end of the model. This is required to distinguish the conditional from the unconditional
model.

I-591
Chapter 17

After you specify the commands, the output produced includes:


Variables in the SYSTAT Rectangular file are:
CANCER
GALL
HYP
OBESE
DURATION
NON
REC
DEPVAR
DOSGRP
DUR
DURGRP
CEST
Conditional LOGIT, data organized by matched set.

EST
GROUP
SETSIZ

DOS
OB

Categorical values encountered during processing are:


DEPVAR (1 levels)
1
Conditional LOGIT Analysis.
Dependent variable: DEPVAR
Number of alternatives: SETSIZ
Input records:
315
Matched sets for analysis:
63
L-L at iteration 1 is
-101.3946
L-L at iteration 2 is
-79.0552
L-L at iteration 3 is
-76.8868
L-L at iteration 4 is
-76.7326
L-L at iteration 5 is
-76.7306
L-L at iteration 6 is
-76.7306
Log Likelihood:
-76.7306
Parameter
Estimate
1 GALL
2.8943
2 EST
2.7001
3 GALL*EST
-2.0527

S.E.
t-ratio
0.8831
3.2777
0.6118
4.4137
0.9950
-2.0631
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 GALL
18.0717
102.0127
3.2014
2 EST
14.8818
49.3621
4.4866
3 GALL*EST
0.1284
0.9025
0.0183
Log Likelihood of constants only model = LL(0) =
0.0000
McFaddens Rho-Squared = 4.56944E+15
Covariance Matrix
1
2
3

1
0.7798
0.3398
-0.7836

0.3743
-0.3667

0.9900

2
0.6290
1.0000
-0.6024

3
-0.8918
-0.6024
1.0000

Correlation Matrix

2
3

1
1.0000
0.6290
-0.8918

p-value
0.0010
0.0000
0.0391

I-592
Logistic Regression

The output begins with a report on the number of SYSTAT records read and the
number of matched sets kept for analysis. The remaining output parallels the results
produced by the unconditional logit model. The parameters estimated are coefficients
of a linear logit, the relative risks are derived by exponentiation, and the interpretation
of the model is unchanged. Model selection will proceed as it would in linear
regression; you might experiment with logarithmic transformations of the data, explore
quadratic and higher-order polynomials in the risk factors, and look for interactions.
Examples of such explorations appear in Breslow and Day (1980).

Example 8
Discrete Choice Models
The CHOICE data set contains hypothetical data motivated by McFadden (1979). The
CHOICE variable represents which of the three transportation alternatives (AUTO,
POOL, TRAIN) each subject prefers. The first subscripted variable in each choice
category represents TIME and the second, COST. Finally, SEX$ represents the gender
of the chooser, and AGE, the age.
A basic discrete choice model is estimated with:
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+COST
ESTIMATE

I-593
Chapter 17

There are two new features of this program. First, the word TIME is not a SYSTAT
variable name; rather, it is a label we chose to remind us of time spent commuting. The
group of names in the SET statement are valid SYSTAT variables corresponding, in
order, to the three modes of transportation. Although there are three variable names in
the SET variable, only one attribute is being measured.
Following is the output:
Categorical values encountered during processing are:
CHOICE (3 levels)
1,
2,
3
Categorical variables are effects coded with the highest value as reference.
Conditional LOGIT Analysis.
Dependent variable: CHOICE
Input records:
29
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 TIME
2 COST

29

15
6
8
29
1
2
3
4

is
is
is
is

-31.860
-31.142
-31.141
-31.141
-31.141
Estimate
-0.020
-0.088

t-ratio
-1.169
-0.611
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 TIME
0.980
1.014
0.947
2 COST
0.915
1.216
0.689
Log Likelihood of constants only model = LL(0) =
-29.645
McFaddens Rho-Squared =
-0.050
Covariance Matrix
1
2

1
0.000
0.001

2
0.021

Correlation Matrix
1
2

1
1.000
0.384

2
0.384
1.000

S.E.
0.017
0.145

p-value
0.243
0.541

I-594
Logistic Regression

The output begins with a frequency distribution of the dependent variable and a brief
iteration history and prints standard regression results for the parameters estimated.
A key difference between a conditional variable clause and a standard SYSTAT
polytomous variable is that each clause corresponds to only one estimated parameter
regardless of the value of NCAT, while each free-standing polytomous variable
generates NCAT 1 parameters. The difference is best seen in a model that mixes both
types of variables (see Hoffman and Duncan, 1988, or Steinberg, 1987) for further
discussion).

Mixed Parameters
The following is an example of mixing polytomous and conditional variables:
USE CHOICE
LOGIT
CATEGORY SEX$
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+COST+SEX$+AGE
ESTIMATE

The hybrid model generates a single coefficient each for TIME and COST and two sets
of parameters for the polytomous variables.
The resulting output is:
Categorical values encountered during processing are:
SEX$ (2 levels)
Female, Male
CHOICE (3 levels)
1,
2,
3
Conditional LOGIT Analysis.
Dependent variable: CHOICE
Input records:
29
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 TIME
2 COST

29

15
6
8
29
1
2
3
4
5

is
is
is
is
is

-31.860
-28.495
-28.477
-28.477
-28.477
-28.477
Estimate
-0.018
-0.351

S.E.
0.020
0.217

t-ratio
-0.887
-1.615

p-value
0.375
0.106

I-595
Chapter 17

Choice Group: 1
3 SEX$_Female
4 AGE
Choice Group: 2
3 SEX$_Female
4 AGE

0.328
0.026

0.509
0.014

0.645
1.850

0.519
0.064

0.024
-0.008

0.598
0.016

0.040
-0.500
95.0 % bounds
Upper
Lower
1.022
0.945
1.078
0.460

0.968
0.617

Parameter
Odds Ratio
1 TIME
0.982
2 COST
0.704
Choice Group: 1
4 AGE
1.026
1.054
0.998
Choice Group: 2
4 AGE
0.992
1.024
0.961
Log Likelihood of constants only model = LL(0) =
-29.645
2*[LL(N)-LL(0)] =
2.335 with 4 df Chi-sq p-value = 0.674
McFaddens Rho-Squared =
0.039
Wald tests on effects across all choices
Effect
3 SEX$_Female
4 AGE

Wald
Statistic
0.551
4.475

Chi-Sq
Signif
0.759
0.107

df
2.000
2.000

Covariance Matrix
1
2
3
4
5
6

1
0.000
0.001
0.002
-0.000
0.002
-0.000

0.047
0.009
-0.001
-0.018
0.001

0.259
0.002
0.165
0.002

0.000
0.002
0.000

0.358
0.003

0.000

2
0.180
1.000
0.084
-0.499
-0.140
0.310

3
0.150
0.084
1.000
0.230
0.543
0.193

4
-0.076
-0.499
0.230
1.000
0.281
0.265

5
0.146
-0.140
0.543
0.281
1.000
0.323

6
-0.266
0.310
0.193
0.265
0.323
1.000

Correlation Matrix
1
2
3
4
5
6

1
1.000
0.180
0.150
-0.076
0.146
-0.266

Varying Alternatives
For some discrete choice problems, the number of alternatives available varies across
choosers. For example, health researchers studying hospital choice pooled data from
several cities in which each city had a different number of hospitals in the choice set
(Luft et al., 1988). Transportation research may pool data from locations having train
service with locations without trains. Carson, Hanemann, and Steinberg (1990) pool
responses from two contingent valuation survey questions having differing numbers of
alternatives. To let LOGIT know about this, there are two ways of proceeding. The most
flexible is to organize the data by choice. With the standard data layout, use the ALT
command, as in
ALT=NCHOICES

I-596
Logistic Regression

where NCHOICES is a SYSTAT variable containing the number of alternatives


available to the chooser. If the value of the ALT variable is less than NCAT for an
observation, LOGIT will use only the first NCHOICES variables in each conditional
variable clause in the analysis.
With the standard data layout, the ALT command is useful only if the choices not
available to some cases all appear at the end of the choice list. Organizing data by
choice is much more manageable. One final note on varying numbers of alternatives:
if the ALT command is used in the standard data layout, the model may not contain a
constant or any polytomous variables; the model must be composed only of conditional
variable clauses. We will not show an example here because by now you must have
figured that we believe the by-choice layout is more suitable if you have data with
varying choice alternatives.

Interactions
A common practice in discrete choice models is to enter characteristics of choosers as
interactions with attributes of the alternatives in conditional variable clauses. When
dealing with large sets of alternatives, such as automobile purchase choices or hospital
choices, where the model may contain up to 60 different alternatives, adding
polytomous variables can quickly produce unmanageable estimation problems, even
for mainframes. In the transportation literature, it has become commonplace to
introduce demographic variables as interactions with, or other functions, of the discrete
choice variables. Thus, instead of, or in addition to, the COST group of variables,
AUTO(2), POOL(2), TRAIN(2), you might see the ratio of cost to income. These ratios
would be created with LET transformations and then added in another SET list for use
as a conditional variable in the MODEL statement. Interactions can also be introduced
this way. By confining demographic variables to appear only as interactions with
choice variables, the number of parameters estimated can be kept quite small.
Thus, an investigator might prefer
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET TIMEAGE=AUTO(1)*AGE,POOL(1)*AGE,TRAIN(1)*AGE
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=TIME+TIMEAGE+COST
ESTIMATE

I-597
Chapter 17

as a way of entering demographics. The advantage to using only conditional clauses is


clear when dealing with a large value of NCAT as the number of additional parameters
estimated is minimized. The model above yields:
Categorical values encountered during processing are:
CHOICE (3 levels)
1,
2,
3
Conditional LOGIT Analysis.
Dependent variable: CHOICE
Input records:
29
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
1 TIME
2 TIMEAGE
3 COST

29

15
6
8
29
1
2
3
4
5

is
is
is
is
is

-31.860
-28.021
-27.866
-27.864
-27.864
-27.864
Estimate
-0.148
0.003
0.007

S.E.
0.062
0.001
0.155

t-ratio
-2.382
2.193
0.043
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 TIME
0.863
0.974
0.764
2 TIMEAGE
1.003
1.006
1.000
3 COST
1.007
1.365
0.742
Log Likelihood of constants only model = LL(0) =
-29.645
2*[LL(N)-LL(0)] =
3.561 with 1 df Chi-sq p-value = 0.059
McFaddens Rho-Squared =
0.060
Covariance Matrix
1
2
3

1
0.004
-0.000
-0.001

0.000
0.000

0.024

2
-0.936
1.000
0.273

3
-0.110
0.273
1.000

Correlation Matrix
1
2
3

1
1.000
-0.936
-0.110

p-value
0.017
0.028
0.966

I-598
Logistic Regression

Constants
The models estimated here deliberately did not include a constant because the constant
is treated as a polytomous variable in LOGIT. To obtain an alternative specific constant,
enter the following model statement:
USE CHOICE
LOGIT
SET TIME = AUTO(1),POOL(1),TRAIN(1)
SET COST = AUTO(2),POOL(2),TRAIN(2)
MODEL CHOICE=CONSTANT+TIME+COST
ESTIMATE

Two CONSTANT parameters would be estimated. For the discrete choice model with
the type of data layout of this example, there is no need to specify the NCAT value
because LOGIT determines this automatically by the number of variables between the
brackets. If the model statement is inconsistent in the number of variables within
brackets across conditional variable clauses, an error message will be generated.
Following is the output:
Categorical values encountered during processing are:
CHOICE (3 levels)
1,
2,
3
Conditional LOGIT Analysis.
Dependent variable: CHOICE
Input records:
29
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
TIME
COST
CONSTANT
CONSTANT

29

15
6
8
29
1
2
3
4
5

is
is
is
is
is

-31.860
-25.808
-25.779
-25.779
-25.779
-25.779
Estimate
-0.012
-0.567
1.510
-0.865

t-ratio
-0.575
-2.550
2.482
-1.282
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
1 TIME
0.988
1.029
0.950
2 COST
0.567
0.877
0.367
Log Likelihood of constants only model = LL(0) =
-29.645
2*[LL(N)-LL(0)] =
7.732 with 2 df Chi-sq p-value = 0.021
McFaddens Rho-Squared =
0.130
1
2
3
3

S.E.
0.020
0.222
0.608
0.675

p-value
0.565
0.011
0.013
0.200

I-599
Chapter 17

Wald tests on effects across all choices


Wald
Statistic
8.630

Effect
3 CONSTANT

Chi-Sq
Signif
0.013

df
2.000

Covariance Matrix
1
2
3
4

1
0.000
0.001
-0.001
-0.005

0.049
-0.082
0.056

0.370
0.046

0.455

2
0.130
1.000
-0.606
0.372

3
-0.053
-0.606
1.000
0.113

4
-0.350
0.372
0.113
1.000

Correlation Matrix
1
2
3
4

1
1.000
0.130
-0.053
-0.350

Example 9
By-Choice Data Format
In the standard data layout, there is one data record per case that contains information
on every alternative open to a chooser. With a large number of alternatives, this can
quickly lead to an excessive number of variables. A convenient alternative is to
organize data by choice; with this data layout, there is one record per alternative and
as many as NCAT records per case. The data set CHOICE2 organizes the CHOICE data
of the Discrete Choice Models example in this way. If you analyze the differences
between the two data sets, you will see that they are similar to those between the splitplot and multivariate layout for the repeated measures design (see Analysis of
Variance). To set up the same problem in a by-choice layout, input the following:
USE CHOICE2
LOGIT
NCAT=3
ALT=NCHOICES
MODEL CHOICE=TIME+COST ;
ESTIMATE

The by-choice format requires that the dependent variable appear with the same value
on each record pertaining to the case. An ALT variable (here NCHOICES) indicating
the number of records for this case must also appear on each record. The by-choice
organization results in fewer variables on the data set, with the savings increasing with
the number of alternatives. However, there is some redundancy in that certain data
values are repeated on each record. The best reason for using a by-choice format is to

I-600
Logistic Regression

handle varying numbers of alternatives per case. In this situation, there is no need to
shuffle data values or to be concerned with choice order.
With the by-choice data format, the NCAT statement is required; it is the only way
for LOGIT to know the number of alternatives to expect per case. For varying numbers
of alternatives per case, the ALT statement is also required, although we use it here with
the same number of alternatives.
USE CHOICE2
LOGIT
CATEGORY SEX$
NCAT=3
ALT=NCHOICES
MODEL CHOICE=TIME+COST ; AGE+SEX$
ESTIMATE

Because the number of alternatives (ALT) is the same for each case in this example, the
output is the same as the Mixed Parameters example.

Weighting Choice-Based Samples


For estimation of the slope coefficients of the discrete choice model, weighting is not
required even in choice-based samples. For predictive purposes, however, weighting
is necessary to forecast aggregate shares, and it is also necessary for consistent
estimation of the alternative specific dummies (Manski and Lerman, 1977).
The appropriate weighting procedure for choice-based sample logit estimation
requires that the sum of the weights equal the actual number of observations retained
in the estimation sample. For choice-based samples, the weight for any observation
choosing the ith option is W j = S j j s , where S j is the population share choosing the
jth option and s j is the choice-based sample share choosing the jth option.
As an example, suppose theatergoers make up 10% of the population and we have
a choice-based sample consisting of 100 theatergoers ( Y = 1 ) and 100 nontheatergoers ( Y = 0 ). Although theatergoers make up only 10% of the population,
they are heavily oversampled and make up 50% of the study sample. Using the above
formulas, the correct weights would be

W 0 = 0.9 0.5 = 1.8


W 1 = 0.1 0.5 = 0.2

I-601
Chapter 17

and the sum of the weights would be 100 * 1.8 + 100 * 0.2 = 200 , as required. To
handle such samples, LOGIT permits non-integer weights and does not truncate them
to integers.

Example 10
Stepwise Regression
LOGIT offers forward and backward stepwise logistic regression with single stepping
as an option. The simplest way to initiate stepwise regression is to substitute START
for ESTIMATE following a MODEL statement and then proceed with stepping with the
STEP command, just as in GLM or Regression.

An upward step consists of three components. First, the current model is estimated
to convergence. The procedure is exactly the same as regular estimation. Second, score
statistics for each additional effect are conducted, adjusted for variables already in the
model. The joint significance of all additional effects together is also computed.
Finally, the effect with the smallest significance level for its score statistic is identified.
If this significance level is below the ENTER option (0.05 by default), the effect is
added to the model.
A downward step also consists of three computational segments. First, the model is
estimated to convergence. Then Wald statistics are computed for each effect in the
model. Finally, the effect with the largest p value for its Wald test statistic is identified.
If this significance level is above the REMOVE criterion (by default 0.10), the effect is
removed from the model.
If you require certain effects to remain in the model regardless of the outcome of the
Wald test, force them into the model by listing them first on the model and using the
FORCE option of START. It is important to set the ENTER and REMOVE criteria
carefully because it is possible to have a variable cycle in and out of a model
repeatedly. The defaults are:
START / ENTER = .05, REMOVE = .10

although Hosmer and Lemeshow use


START / ENTER =.15, REMOVE =.20

in the example we reproduce below.

I-602
Logistic Regression

Hosmer and Lemeshow use stepwise regression in their search for a model of low
birth weight discussed in the Binary Logit section. We conduct a similar analysis
with:
USE HOSLEM
LOGIT
CATEGORY RACE
MODEL LOW=CONSTANT+PTL+LWT+HT+RACE+SMOKE+UI+AGE+FTV
START / ENTER=.15,REMOVE=.20
STEP / AUTO

Following is the output:


Variables in the SYSTAT Rectangular file are:
ID
LOW
AGE
LWT
PTL
HT
UI
FTV
CASEID
PTD
LWD
Stepping parameters:
Significance to include =
0.150
Significance to remove =
0.200
Number of effects to force =
1
Maximum number of steps =
10
Direction : Up and Down

RACE
BWT

SMOKE
RACE1

Categorical values encountered during processing are:


RACE (3 levels)
1,
2,
3
LOW (2 levels)
0,
1
Categorical variables are effects coded with the highest value as reference.
Binary Stepwise LOGIT Analysis.
Dependent variable: LOW
Input records:
189
Records for analysis:
Sample split
Category choices
REF
RESP
Total
:

189

59
130
189

Step
0
Log Likelihood:
-117.336
Parameter
Estimate
1 CONSTANT
-0.790
Score tests on effects not in model
Effect
PTL
LWT
HT
RACE
SMOKE
UI
AGE
FTV
Joint Score
Step
1
Log Likelihood:
Parameter
1 CONSTANT
2 PTL
2
3
4
5
6
7
8
9

S.E.
0.157

Score
Statistic
7.267
5.438
4.388
5.005
4.924
5.401
2.674
0.749
30.959

Chi-Sq
Signif
0.007
0.020
0.036
0.082
0.026
0.020
0.102
0.387
0.000

Estimate
-0.964
0.802

S.E.
0.175
0.317

t-ratio
-5.033

p-value
0.000

df
1.000
1.000
1.000
2.000
1.000
1.000
1.000
1.000
9.000

-113.946
t-ratio
-5.511
2.528

p-value
0.000
0.011

I-603
Chapter 17

Score tests on effects not in model


Score
Effect
Statistic
LWT
4.113
HT
4.722
RACE
5.359
SMOKE
3.164
UI
3.161
AGE
3.478
FTV
0.577
Joint Score
24.772
Step
2
Log Likelihood:
-111.792
Parameter
Estimate
1 CONSTANT
-1.062
2 PTL
0.823
3 HT
1.272
Score tests on effects not in model

Chi-Sq
Signif
0.043
0.030
0.069
0.075
0.075
0.062
0.448
0.002

Score
Effect
Statistic
LWT
6.900
RACE
4.882
SMOKE
3.117
UI
4.225
AGE
3.448
FTV
0.370
Joint Score
20.658
Step
3
Log Likelihood:
-107.982
Parameter
Estimate
1 CONSTANT
1.093
2 PTL
0.726
3 HT
1.856
4 LWT
-0.017
Score tests on effects not in model

Chi-Sq
Signif
0.009
0.087
0.078
0.040
0.063
0.543
0.004

Score
Effect
Statistic
5 RACE
5.266
6 SMOKE
2.857
7 UI
3.081
8 AGE
1.895
9 FTV
0.118
Joint Score
14.395
Step
4
Log Likelihood:
-105.425
Parameter
Estimate
1 CONSTANT
1.405
2 PTL
0.746
3 HT
1.805
4 LWT
-0.018
5 RACE_1
-0.518
6 RACE_2
0.569
Score tests on effects not in model

Chi-Sq
Signif
0.072
0.091
0.079
0.169
0.732
0.026

Score
Statistic
5.936
3.265
1.019
0.025
9.505

Chi-Sq
Signif
0.015
0.071
0.313
0.873
0.050

3
4
5
6
7
8
9

4
5
6
7
8
9

Effect
6 SMOKE
7 UI
8 AGE
9 FTV
Joint Score

S.E.
0.184
0.318
0.616

S.E.
0.841
0.328
0.705
0.007

S.E.
0.900
0.328
0.714
0.007
0.237
0.318

df
1.000
1.000
2.000
1.000
1.000
1.000
1.000
8.000
t-ratio
-5.764
2.585
2.066

p-value
0.000
0.010
0.039

df
1.000
2.000
1.000
1.000
1.000
1.000
7.000
t-ratio
1.299
2.213
2.633
-2.560

p-value
0.194
0.027
0.008
0.010

df
2.000
1.000
1.000
1.000
1.000
6.000
t-ratio
1.560
2.278
2.530
-2.607
-2.190
1.787

df
1.000
1.000
1.000
1.000
4.000

p-value
0.119
0.023
0.011
0.009
0.029
0.074

I-604
Logistic Regression

Step
5
Log Likelihood:
-102.449
Parameter
Estimate
1 CONSTANT
0.851
2 PTL
0.602
3 HT
1.745
4 LWT
-0.017
5 RACE_1
-0.734
6 RACE_2
0.557
7 SMOKE
0.946
Score tests on effects not in model

Score
Effect
Statistic
7 UI
3.034
8 AGE
0.781
9 FTV
0.014
Joint Score
3.711
Step
6
Log Likelihood:
-100.993
Parameter
Estimate
1 CONSTANT
0.654
2 PTL
0.503
3 HT
1.855
4 LWT
-0.016
5 RACE_1
-0.741
6 RACE_2
0.585
7 SMOKE
0.939
8 UI
0.786
Score tests on effects not in model
Score
Effect
Statistic
8 AGE
0.553
9 FTV
0.056
Joint Score
0.696
Log Likelihood:
-100.993
Parameter
Estimate
1 CONSTANT
0.654
2 PTL
0.503
3 HT
1.855
4 LWT
-0.016
5 RACE_1
-0.741
6 RACE_2
0.585
7 SMOKE
0.939
8 UI
0.786

S.E.
0.913
0.335
0.695
0.007
0.263
0.324
0.395

Chi-Sq
Signif
0.082
0.377
0.904
0.294
S.E.
0.921
0.341
0.695
0.007
0.265
0.323
0.399
0.456
Chi-Sq
Signif
0.457
0.813
0.706
S.E.
0.921
0.341
0.695
0.007
0.265
0.323
0.399
0.456

t-ratio
0.933
1.797
2.511
-2.418
-2.790
1.720
2.396

p-value
0.351
0.072
0.012
0.016
0.005
0.085
0.017

df
1.000
1.000
1.000
3.000
t-ratio
0.710
1.475
2.669
-2.320
-2.797
1.811
2.354
1.721

p-value
0.477
0.140
0.008
0.020
0.005
0.070
0.019
0.085

df
1.000
1.000
2.000

t-ratio
0.710
1.475
2.669
-2.320
-2.797
1.811
2.354
1.721
95.0 % bounds
Parameter
Odds Ratio
Upper
Lower
2 PTL
1.654
3.229
0.847
3 HT
6.392
24.964
1.637
4 LWT
0.984
0.998
0.971
5 RACE_1
0.477
0.801
0.284
6 RACE_2
1.795
3.379
0.953
7 SMOKE
2.557
5.586
1.170
8 UI
2.194
5.367
0.897
Log Likelihood of constants only model = LL(0) =
-117.336
2*[LL(N)-LL(0)] =
32.686 with 7 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.139

p-value
0.477
0.140
0.008
0.020
0.005
0.070
0.019
0.085

Not all logistic regression programs compute the variable addition statistics in the same
way, so minor differences in output are possible. Our results listed in the Chi-Square
Significance column of the first step, for example, correspond to H&Ls first row in
their Table 4.15; the two sets of results are very similar but not identical. While our
method yields the same final model as H&L, the order in which variables are entered

I-605
Chapter 17

is not the same because intermediate p values differ slightly. Once a final model is
arrived at, it is re-estimated to give true maximum likelihood estimates.

Example 11
Hypothesis Testing
Two types of hypothesis tests are easily conducted in LOGIT: the likelihood ratio (LR)
test and the Wald test. The tests are discussed in numerous statistics books, sometimes
under varying names. Accounts can be found in Maddalas text (1988), Cox and
Hinkley (1974), Rao (1973), Engel (1984), and Breslow and Day (1980). Here we
provide some elementary examples.

Likelihood-Ratio Test
The likelihood-ratio test is conducted by fitting two nested models (the restricted and
the unrestricted) and comparing the log-likelihoods at convergence. Typically, the
unrestricted model contains a proposed set of variables, and the restricted model omits
a selected subset, although other restrictions are possible. The test statistic is twice the
difference of the log-likelihoods and is chi-squared with degrees of freedom equal to
the number of restrictions imposed. When the restrictions consist of excluding
variables, the degrees of freedom are equal to the number of parameters set to 0.
If a model contains a constant, LOGIT automatically calculates a likelihood-ratio test
of the null hypothesis that all coefficients except the constant are 0. It appears on a line
that looks like:
2*[LL(N)-LL(0)] = 26.586 with 5 df, Chi-sq p-value = 0.00007

This example line states that twice the difference between the likelihood of the
estimated model and the constants only model is 26.586, which is a chi-squared
deviate on five degrees of freedom. The p value indicates that the null hypothesis
would be rejected.
To illustrate use of the LR test, consider a model estimated on the low birth weight
data (see the Binary Logit example). Assuming CATEGORY=RACE, compare the
following model
MODEL LOW CONSTANT + LWD + AGE + RACE + PTD

I-606
Logistic Regression

with
MODEL LOW CONSTANT + LWD + AGE

The null hypothesis is that the categorical variable RACE, which contributes two
parameters to the model, and PTD are jointly 0. The model likelihoods are 104.043
and 112.143, and twice the difference (16.20) is chi-squared with three degrees of
freedom under the null hypothesis. This value can also be more conveniently
calculated by taking the difference of the LR test statistics reported below the
parameter estimates and the difference in the degrees of freedom. The unrestricted
model above has G = 26.587 with five degrees of freedom, and the restricted model
has G = 10.385 with two degrees of freedom. The difference between the G values
is 16.20, and the difference between degrees of freedom is 3.
Although LOGIT will not automatically calculate LR statistics across separate
models, the p value of the result can be obtained with the command:
CALC 1-XCF(16.2,3)

Wald Test
The Wald test is the best known inferential procedure in applied statistics. To conduct
a Wald test, we first estimate a model and then pose a linear constraint on the
parameters estimated. The statistic is based on the constraint and the appropriate
elements of the covariance matrix of the parameter vector. A test of whether a single
parameter is 0 is conducted as a Wald test by dividing the squared coefficient by its
variance and referring the result to a chi-squared distribution on one degree of freedom.
Thus, each t ratio is itself the square root of a simple Wald test. Following is an
example:
USE HOSLEM
LOGIT
CATEGORY RACE
MODEL LOW=CONSTANT+LWD+AGE+RACE+PTD
ESTIMATE
HYPOTHESIS
CONSTRAIN PTD=0
CONSTRAIN RACE[1]=0
CONSTRAIN RACE[2]=0
TEST

I-607
Chapter 17

Following is the output (minus the estimation stage):


Entering hypothesis procedure.
Linear Restriction System
Parameter
1
2
0.0
0.0
0.0
0.0
0.0
0.0
EQN
6
RHS
1
1.000
0.0
2
0.0
0.0
3
0.0
0.0
General linear Wald test results
EQN

1
2
3

ChiSq Statistic =
ChiSq p-value = 0.002
Degrees of freedom = 3

3
0.0
0.0
0.0
Q
1.515
-0.442
0.464

4
0.0
1.000
0.0

5
0.0
0.0
1.000

15.104

Note that this statistic of 15.104 is close to the LR statistic of 16.2 obtained for the same
hypothesis in the previous section. Although there are three separate CONSTRAIN lines
in the HYPOTHESIS paragraph above, they are tested jointly in a single test. To test
each restriction individually, place a TEST after each CONSTRAIN. The restrictions
being tested are each entered with separate CONSTRAIN commands. These can include
any linear algebraic expression without parentheses involving the parameters. If
interactions were present on the MODEL statement, they can also appear on the
CONSTRAIN statement. To reference dummies generated from categorical covariates,
use square brackets, as in the example for RACE. This constraint refers to the
coefficient labeled RACE1 in the output.
More elaborate tests can be posed in this framework. For example,
CONSTRAIN 7*LWD - 4.3*AGE + 1.5*RACE[2] = -5

or
CONSTRAIN AGE + LWD = 1

For multinomial models, the architecture is a little different. To reference a variable


that appears in more than one parameter vector, it is followed with curly braces around
the number corresponding to the Choice Group. For example,
CONSTRAIN CONSTANT{1} - CONSTANT{2} = 0
CONSTRAIN AGE{1} - AGE{2} = 0

I-608
Logistic Regression

Comparisons between Tests


The Wald and likelihood-ratio tests are classical testing methods in statistics. The
properties of the tests are based on asymptotic theory, and in the limit, as sample sizes
tend to infinity, the tests give identical results. In small samples, there will be
differences between results and conclusions, as has been emphasized by Hauck and
Donner (1977). Given a choice, which test should be used?
Most statisticians favor the LR test over the Wald for three reasons. First, the
likelihood is the fundamental measure on which model fitting is based. Cox and Oakes
(1984) illustrate this preference when they use the likelihood profile to determine
confidence intervals for a parameter in a survival model. Second, Monte Carlo studies
suggest that the LR test is more reliable in small samples. Finally, a nonlinear
constraint can be imposed on the parameter estimates and simply tested by estimating
restricted and unrestricted models. See the Quantiles example for an illustration
involving LD50 values. Also, you can use the FUNPAR option in NONLIN to do the
same thing.
Why bother with the Wald test, then? One reason is simplicity and computational
cost. The LR test requires estimation of two models to final convergence for a single
test, and each additional test requires another full estimation. By contrast, any number
of Wald tests can be run on the basis of one estimated model, and they do not require
an additional pass through the data.

Example 12
Quasi-Maximum Likelihood
When a model to be estimated by maximum likelihood is misspecified, White (1982)
has shown that the standard methods for obtaining the variance-covariance matrix are
incorrect. In particular, standard errors derived from the inverse matrix of second
derivatives and all hypothesis tests based on this matrix are unreliable. Since
misspecification may be the rule rather than the exception, is there any safe way to
proceed with inference? White offers an alternative variance-covariance matrix that
simplifies (asymptotically) to the inverse Hessian when the model is not misspecified
and is correct when the model is misspecified. Calling the procedure of estimating a
misspecified model quasi-maximum likelihood estimation (QMLE), the proper QML
matrix is defined as
Q = H1GH1

where H1 is the covariance matrix at convergence and G is the cumulated outer


product of the gradient vectors.

I-609
Chapter 17

White shows that for a misspecified model, the LR test is not asymptotically chisquared, and the Wald and likelihood-ratio tests are not asymptotically equivalent,
even when the QML matrix is used for Wald tests.
The best course of action appears to be to use only the QML version of the Wald test
when misspecification is a serious possibility. If the QML covariance matrix is
requested with the ESTIMATE command, a second set of parameter statistics will be
printed, reflecting the new standard errors, t ratios and p values; the coefficients are
unchanged. The QML covariance matrix will replace the standard covariance matrix
during subsequent hypothesis testing with the HYPOTHESIS command. Following is
an example:
USE NLS
LOGIT
MODEL CULTURE=CONSTANT+IQ
ESTIMATE / QML

Following is the output:


Categorical values encountered during processing are:
CULTURE (3 levels)
1,
2,
3
Multinomial LOGIT Analysis.
Dependent variable: CULTURE
Input records:
200
Records for analysis:
Sample split
Category choices
1
2
3
Total
:
L-L
L-L
L-L
L-L
L-L
L-L
Log

at iteration
at iteration
at iteration
at iteration
at iteration
at iteration
Likelihood:
Parameter
Choice Group: 1
1 CONSTANT
2 IQ
Choice Group: 2
1 CONSTANT
2 IQ

200

12
49
139
200
1
2
3
4
5
6

is
-219.722
is
-148.554
is
-144.158
is
-143.799
is
-143.793
is
-143.793
-143.793
Estimate

S.E.

t-ratio

p-value

4.252
-0.065

2.107
0.021

2.018
-3.052

0.044
0.002

3.287
-0.041

1.275
0.012

2.579
-3.372
95.0 % bounds
Upper
Lower

0.010
0.001

Parameter
Odds Ratio
Choice Group: 1
2 IQ
0.937
0.977
0.898
Choice Group: 2
2 IQ
0.960
0.983
0.937
Log Likelihood of constants only model = LL(0) =
-153.254
2*[LL(N)-LL(0)] =
18.921 with 2 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.062
Covariance matrix QML adjusted.

I-610
Logistic Regression

Log Likelihood:
Parameter
Choice Group: 1
1 CONSTANT
2 IQ
Choice Group: 2
1 CONSTANT
2 IQ

-143.793
Estimate

S.E.

t-ratio

p-value

4.252
-0.065

2.252
0.023

1.888
-2.860

0.059
0.004

3.287
-0.041

1.188
0.011

2.767
-3.682
95.0 % bounds
Upper
Lower

0.006
0.000

Parameter
Odds Ratio
Choice Group: 1
2 IQ
0.937
0.980
0.896
Choice Group: 2
2 IQ
0.960
0.981
0.939
Log Likelihood of constants only model = LL(0) =
-153.254
2*[LL(N)-LL(0)] =
18.921 with 2 df Chi-sq p-value = 0.000
McFaddens Rho-Squared =
0.062

Note the changes in the standard errors, t ratios, p values, odds ratio bounds, Wald test
p values, and covariance matrix.

Computation
All calculations are in double precision.

Algorithms
LOGIT uses Gauss Newton methods for maximizing the likelihood. By default, two

tolerance criteria must be satisfied: the maximum value for relative coefficient changes
must fall below 0.001, and the Euclidean norm of the relative parameter change vector
must also fall below 0.001. By default, LOGIT uses the second derivative matrix to
update the parameter vector. In discrete choice models, it may be preferable to use a
first derivative approximation to the Hessian instead. This option, popularized by
Berndt, Hall, Hall, and Hausman (1974), will be noted if it is used by the program.
BHHH uses the summed outer products of the gradient vector in place of the Hessian
matrix and generally will converge much more slowly than the default method.

Missing Data
Cases with missing data on any variables included in a model are deleted.

I-611
Chapter 17

Basic Formulas
For the binary logistic regression model, the dependent variable for the ith case is Y i ,
taking on values of 0 (nonresponse) and 1 (response), and the probability of response
is a function of the covariate vector x i and the unknown coefficient vector . We write
this probability as:
x

ei
Prob ( Y i = 1 x i ) = --------------x
1+ei
and abbreviate it as P i . The log-likelihood for the sample is given by
n

LL ( ) =

Y log P + ( 1 Y )log ( 1 P )
i

i=1

For the polytomous multinomial logit, the integer-valued dependent variable ranges
from 1 to k , and the probability that the ith case has Y = m , where 1 m k is:
x

eim
Prob ( Y i = m x i ) = ---------------k

xi j

j=1

In this model, k is fixed for all cases, there is a single covariate vector x i , and k j
parameter vectors are estimated. This last equation is identified by normalizing k to 0.
McFaddens discrete choice model represents a distinct variant of the logit model
based on Luces (1959) probabilistic choice model. Each subject is observed to make
a choice from a set C i consisting of J i elements. Each element is characterized by a
separate covariate vector of attributes Z k . The dependent variable Yi ranges from 1 to
J i , with J i possibly varying across subjects, and the probability that Y i = k , where
1 k Jis a function of the attribute vectors Z 1 , Z 2 , ... Z j and the parameter vector .
The probability that the ith subject chooses element m from his choice set is:
Z

em
Prob ( Y i = m Z ) = ---------------Z
ej

j Ci

I-612
Logistic Regression

Heuristically, this equation differs from the previous one in the components that vary
with alternative outcomes of the dependent variable. In the polytomous logit, the
coefficients are alternative-specific and the covariate vector is constant; in the discrete
choice model, while the attribute vector is alternative-specific, the coefficients are
constant. The models also differ in that the range of the dependent variable can be casespecific in the discrete choice model, while it is constant for all cases in the polytomous
model.
The polytomous logit can be recast as a discrete choice model in which each
covariate x is entered as an interaction with an alternative-specific dummy, and the
number of alternatives is constant for all cases. This reparameterization is used for the
mixed polytomous discrete choice model.

Regression Diagnostics Formulas


The SAVE command issued before the deciles of risk command (DC) produces a
SYSTAT save file with a number of diagnostic quantities computed for each case in
the input data set. Computations are always conducted on the assumption that each
covariate pattern is unique. The following formulas are based on the binary dependent
variable y i , which is either 0 or 1, and fitted probabilities P i , obtained from the basic
logistic equation.
LEVERAGE(1) is the diagonal element of Pregibons (1981) hat matrix, with
formulas given by Hosmer and Lemeshow (1989) as their equations (5.7) and (5.8). It
is defined as b j v j , where

bj = x j ( XVX ) 1 x j
and x j is the covariate vector for the xth case, X is the data matrix for the sample
i ( 1 P i ) ,
including a constant, and V is a diagonal matrix with general A A element P
the fitted probability for the ith case. b j is our LEVERAGE(2).

v j = P i ( 1 P i )
Thus LEVERAGE(L) is given by

hj = vj b j

I-613
Chapter 17

The PEARSON residual is

y i p i
r j = -------------------------p i ( 1 p i )
The VARIANCE of the residual is

vj ( 1 hj )
and the standardized residual STANDARD is

rj
r s j = ---------------1 hj
The DEVIANCE residual is defined as

dj =

2 ln ( pj )

for y j = 1 and

d j = 2 ln ( 1 p j )
otherwise.
DELDSTAT is the change in deviance and is

D j = d j ( 1 h j )
2

DELPSTAT is the change in Pearson chi-square:

= r s j
2

The final three saved quantities are measures of the overall change in the estimated
parameter vector .
DELBETA ( 1 )

= rsj hj ( 1 hj )
2

I-614
Logistic Regression

is a measure proposed by Pregibon, and


DELBETA ( 2 )

= rs j hj ( 1 hj )

DELBETA ( 3 )

= rs j hj ( 1 hj )

References
Agresti, A. (1990). Categorical data analysis. New York: John Wiley & Sons, Inc.
Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates
in logistic regression models. Biometrika, 71, 110.
Amemiya, T. (1981). Qualitative response models: A survey. Journal of Economic
Literature, 14831536.
Begg, Colin B. and Gray, R. (1984). Calculation of polychotomous logistic regression
parameters using individualized regressions. Biometrika, 71, 1118.
Beggs, S., Cardell, N. S., and Hausman, J. A. (1981). Assessing the potential demand for
electric cars. Journal of Econometrics, 16, 119.
Ben-Akival, M. and Lerman, S. (1985). Discrete choice analysis. Cambridge, Mass.: MIT
Press.
Berndt, E. K., Hall, B. K., Hall, R. E., and Hausman, J. A. (1974). Estimation and inference
in non-linear structural models. Annals of Economic and Social Measurement, 3,
653665.
Breslow, N. (1982). Covariance adjustment of relative-risk estimates in matched studies.
Biometrics, 38, 661672.
Breslow, N. and Day, N. E. (1980). Statistical methods in cancer research, vol. II: The
design and analysis of cohort studies. Lyon: IARC.
Breslow, N., Day, N. E., Halvorsen, K.T, Prentice, R.L., and Sabai, C. (1978). Estimation
of multiple relative risk functions in matched case-control studies. American Journal of
Epidemiology, 108, 299307.
Carson, R., Hanemann, M., and Steinberg, S. (1990). A discrete choice contingent
valuation estimate of the value of kenai king salmon. Journal of Behavioral Economics,
19, 5368.
Chamberlain, G. (1980). Analysis of covariance with qualitative data. Review of Economic
Studies, 47, 225238.
Cook, D. R. and Weisberg, S. (1984). Residuals and influence in regression. New York:
Chapman and Hall.

I-615
Chapter 17

Coslett, S. R. (1980). Efficient estimation of discrete choice models. In C. Manski and D.


McFadden, Eds., Structural Analysis of Discrete Data with Econometric Applications.
Cambridge, Mass.: MIT Press.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269276.
Cox, D. R. and Hinkley, D.V. (1974). Theoretical statistics. London: Chapman and Hall.
Cox, D. R. and Oakes, D. (1984). Analysis of survival data. New York: Chapman and Hall.
Domencich, T. and McFadden, D. (1975). Urban travel demand: A behavioral analysis.
Amsterdam: North-Holland.
Engel, R. F. (1984). Wald, likelihood ratio and Lagrange multiplier tests in econometrics.
In Z. Griliches and M. Intrilligator, Eds., Handbook of Econometrics. New York: NorthHolland.
Finney, D. J. (1978). Statistical method in biological assay. London: Charles Griffin.
Hauck, W. W. (1980). A note on confidence bands for the logistic response Curve.
American Statistician, 37, 158160.
Hauck, W. W. and Donner, A. (1977). Walds test as applied to hypotheses in logit
analysis. Journal of the American Statistical Association, 72, 851853.
Hensher, D. and Johnson, L. W. (1981). Applied discrete choice modelling. London:
Croom Helm.
Hoffman, S. and Duncan, G. (1988). Multinomial and conditional logit discrete choice
models in demography. Demography, 25, 415428.
Hosmer, D. W. and Lemeshow, S. (1989). Applied logistic regression. New York: John
Wiley & Sons, Inc.
Hubert, J. J. (1984). Bioassay, 2nd ed. Dubuque, Iowa: Kendall-Hunt.
Kalbfleisch, J. and Prentice, R. (1980). The statistical analysis of failure time data. New
York: John Wiley & Sons, Inc.
Kleinbaum, D., Kupper, L., and Chambliss, L. (1982). Logistic regression analysis of
epidemiologic data: Theory and practice. Communications in Statistics: Theory and
Methods, 11, 485547.
Luce, D. R. (1959). Individual choice behavior: A theoretical analysis. New York: John
Wiley & Sons, Inc.
Luft, H., Garnick, D., Peltzman, D., Phibbs, C., Lichtenberg, E., and McPhee, S. (1988).
The sensitivity of conditional choice models for hospital care to estimation technique.
Draft, Institute for Health Policy Studies. San Francisco: University of California.
Maddala, G. S. (1983). Limited-dependent and qualitative variables in econometrics.
Cambridge University Press.
Maddala, G. S. (1988). Introduction to econometrics. New York: MacMillan.
McFadden, D. (1982). Qualitative response models. In W. Hildebrand (ed.), Advances in
Econometrics. Cambridge University Press.
McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior. In P.

I-616
Logistic Regression

Zarembka (ed.), Frontiers in Econometrics. New York: Academic Press.


McFadden, D. (1976). Quantal choice analysis: A survey. Annals of Economic and Social
Measurement, 5, 363390.
McFadden, D. (1979). Quantitative methods for analyzing travel behavior of individuals:
Some recent developments. In D. A. Hensher and P. R. Stopher (eds.), Behavioral
Travel Modelling. London: Croom Helm.
McFadden, D. (1984). Econometric analysis of qualitative response models. In Z. Griliches
and M. D. Intrilligator (eds.), Handbook of Econometiics, Volume III. Elsevier Science
Publishers BV.
Manski, C. and Lerman, S. (1977). The estimation of choice probabilities from choice
based samples. Econometrica, 8, 19771988.
Manski, C. and McFadden, D. (1980). Alternative estimators and sample designs for
discrete choice analysis. In C. Manski and D. McFadden (eds.), Structural Analysis of
Discrete Data with Econometric Applications. Cambridge, Mass.: MIT Press.
Manski, C. and McFadden, D., eds. (1981). Structural analysis of discrete data with
econometric applications. Cambridge, Mass.: MIT Press.
Nerlove, M. and Press, S. J. (1973). Univariate and multivariate loglinear and logistic
models. Rand Report No R-1306EDA/NIH.
Peduzzi, P. N., Holford, T. R., and Hardy, R. J. (1980). A stepwise variable selection
procedure for nonlinear regression models. Biometrics, 36, 511516.
Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9, 705724.
Prentice, R. and Breslow, N. (1978). Retrospective studies and failure time models.
Biometrika, 65, 153158.
Prentice, R. and Pyke, R. (1979). Logistic disease incidence models and case-control
studies. Biometrika, 66, 403412.
Rao, C. R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John
Wiley & Sons, Inc.
Santer, T. J. and Duffy, D. E. (1989). The statistical analysis of discrete data. New York:
Springer-Verlag.
Steinberg, D. (1991). The common structure of discrete choice and conditional logistic
regression models. Unpublished paper. Department of Economics, San Diego State
University.
Steinberg, D. (1987). Interpretation and diagnostics of the multinomial and binary logistic
regression using PROC MLOGIT. SAS Users Group International, Proceedings of the
Twelfth Annual Conference, 10711073, Cary, N.C.: SAS Institute Inc.

I-617
Chapter 17

Steinberg, D. and Cardell, N. S. (1987). Logistic regression on pooled choice based


samples and samples missing the dependent variable. Proceedings of the Social
Statistics Section. Alexandria, Va.: American Statistical Association, 158160.
Train, K. (1986). Qualitative choice analysis. Cambridge, Mass.: MIT Press.
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica,
50, 125.
Williams, D. A. (1986). Interval estimation of the median lethal dose. Biometrics, 42,
641645.
Wrigley, N. (1985). Categorical data analysis for geographers and environmental
scientists. New York: Longman.

Chapter

18
Loglinear Models
Laszlo Engelman

Loglinear models are useful for analyzing relationships among the factors of a
multiway frequency table. The loglinear procedure computes maximum likelihood
estimates of the parameters of a loglinear model by using the Newton-Raphson
method. For each user-specified model, a test of fit of the model is provided, along
with observed and expected cell frequencies, estimates of the loglinear parameters
(lambdas), standard errors of the estimates, the ratio of each lambda to its standard
error, and multiplicative effects (EXP()).
For each cell, you can request its contribution to the Pearson chi-square or the
likelihood-ratio chi-square. Deviates, standardized deviates, Freeman-Tukey
deviates, and likelihood-ratio deviates are available to characterize departures of the
observed values from expected values.
When searching for the best model, you can request tests after removing each firstorder effect or interaction term one at a time individually or hierarchically (when a
lower-order effect is removed, so are its respective interaction terms). The models do
not need to be hierarchical.
A model can explain the frequencies well in most cells, but poorly in a few.
LOGLIN uses Freeman-Tukey deviates to identify the most divergent cell, fit a model
without it, and continue in a stepwise manner identifying other outlier cells that depart
from your model.
You can specify cells that contain structural zeros (cells that are empty naturally or
by design, not by sampling), and fit a model to the subset of cells that remain. A test
of fit for such a model is often called a test of quasi-independence.

I-618

I-619
Chapter 18

Statistical Background
Researchers fit loglinear models to the cell frequencies of a multiway table in order to
describe relationships among the categorical variables that form the table. A loglinear
model expresses the logarithm of the expected cell frequency as a linear function
of certain parameters in a manner similar to that of analysis of variance.
To introduce loglinear models, recall how to calculate expected values for the
Pearson chi-square statistic. The expected value for a cell in a row i and column j is:

(row i total) * (column j total)


-------------------------------------------------------------------------total table count
Lets ignore the denominator, because its the same for every cell. Write:

Ri * C j
(Part of each expected value comes from the row its in and part from the column its
in.) Now take the log:

ln (R i * C j ) = ln R i + ln C j
and let:

ln R i = A i
ln C j = B j
and write:

Ai + B j
This expected value is computed under the null hypothesis of independence (that is,
there is no interaction between the table factors). If this hypothesis is rejected, you
would need more information than Ai and Bj. In fact, the usual chi-square test can be
expressed as a test that the interaction term is needed in a model that estimates the log
of the cell frequencies. We write this model as:

ln F ij = + A i + B j + AB ij
or more commonly as:

I-620
Loglinear Models

ln F ij = + i + j + ij
A

AB

where is an overall mean effect and the parameters sum to zero over the levels of
the row factors and the column factors. For a particular cell in a three-way table (a cell
in the i row, j column, and k level of the third factor) we write:

ln F ijk = + i + j + k + ij + ik + jk + ijk
A

AB

AC

BC

ABC

The order of the effect is the number of indices in the subscript.


Notation in publications for loglinear model parameters varies. Grant Blank
summarizes:
SYSTAT

FATHER + SON + FATHER SON

Agresti (1984)

log mij= + iF+ jS+ ijFS

Fienberg (1980)

log mij= + 1(i)+ 2(j)+ 12(ij)

Goodman (1978)

ij= + iA+ jB+ ijAB

Haberman (1978)

log mij= + iA+ jB+ ijAB

Knoke and Burke (1980)

Gij= + iF+ jS+ ijFS

or, in multiplicative form,


Goodman (1971)

Fij= riArjBrijAB where ij= log(Fij), = log , iA= log(riA), etc.

So, a loglinear model expresses the logarithm of the expected cell frequency as a linear
function of certain parameters in a manner similar to that of analysis of variance. An
important distinction between ANOVA and loglinear modeling is that in the latter, the
focus is on the need for interaction terms; while in ANOVA, testing for main effects is
the primary interest. Look back at the loglinear model for the two-way tablethe usual
chi-square tests the need for the ABij interaction, not for A alone or B alone.
The loglinear model for a three-way table is saturated because it contains all
possible terms or effects. Various smaller models can be formed by including only
selected combinations of effects (or equivalently testing that certain effects are 0). An
important goal in loglinear modeling is parsimonythat is, to see how few effects are
needed to estimate the cell frequencies. You usually dont want to test that the main
effect of a factor is 0 because this is the same as testing that the total frequencies are
equal for all levels of the factor. For example, a test that the main effect for SURVIVE$

I-621
Chapter 18

(alive, dead) is 0 simply tests whether the total number of survivors equals the number
of nonsurvivors. If no interaction terms are included and the test is not significant (that
is, the model fits), you can report that the table factors are independent. When there are
more than two second-order effects, the test of an interaction is conditional on the other
interactions and may not have a simple interpretation.

Fitting a Loglinear Model


To fit a loglinear model:
n First, screen for an appropriate model to test.
n Test the model, and if significant, compare its results with those for models with

one or more terms. If not significant, compare results with models with fewer
terms.
n For the model you select as best, examine fitted values and residuals, looking for

cells (or layers within the table) with large differences between observed and
expected (fitted) cell counts.
How do you determine which effects or terms to include in your loglinear model?
Ideally, by using your knowledge of the subject matter of your study, you have a
specific model in mindthat is, you want to make statements regarding the
independence of certain table factors. Otherwise, you may want to screen for effects.
The likelihood-ratio chi-square is additive under partitioning for nested models.
Two models are nested if all the effects of the first are a subset of the second. The
likelihood ratio chi-square is additive because the statistic for the second model can be
subtracted from that for the first. The difference provides a test of the additional
effectsthat is, the difference in the two statistics has an asymptotic chi-square
distribution with degrees of freedom equal to the difference between those for the two
model chi-squares (or the difference between the number of effects in the two models).
This property does not hold for the Pearson chi-square. The additive property for the
likelihood ratio chi-square is useful for screening effects to include in a model.
If you are doing exploratory research and lack firm knowledge about which effects
to include, some statisticians suggest a strategy of starting with a large model and, step
by step, identifying effects to delete. (You compare each smaller model nested within
the larger one as described above.) But we caution you about multiple testing. If you
test many models in a search for your ideal model, remember that the p value
associated with a specific test is valid when you execute one and only one test. That is,
use p values as relative measures when you test several models.

I-622
Loglinear Models

Loglinear Models in SYSTAT


Loglinear Model Main Dialog Box
To open the Loglinear Model dialog box, from the menus choose:
Statistics
Tables
Loglinear Model
Estimate Model...

The following must be specified:


Model Terms. Build the model components (main effects and interactions) by adding
terms to the Model Terms text box. All variables should be categorical (either
numerical or character). Click Cross to add interactions.
Define Table. The variables that define the frequency table. Variables that are used in
the model terms must be included in the frequency table.

The following optional computational controls can also be specified:


n Convergence. The parameter convergence criteria.
n L Convergence. The log-likelihood convergence criteria.

I-623
Chapter 18

n Tolerance. The tolerance limit.


n Iterations. The maximum number of iterations.
n Halvings. The maximum number of step halvings.
n Delta. The constant value added to the observed frequency in each cell.

You can save two sets of statistics to a file:


n Estimates. Saves, for each cell in the table, the observed and expected frequencies

and their differences, standardized and Freeman-Tukey deviates, the contribution


to the Pearson and likelihood-ratio chi-square statistics, the contribution to the loglikelihood, and the cell indices.
n Lambdas. Saves, for each level of each term in the model, the estimate of lambda,

the standard error of lambda, the ratio of lambda to its standard error, the
multiplicative effect (EXP()), and the indices of the table of factors.

Loglinear Model Statistics


Loglinear models offer statistics for hypothesis testing, parameter estimation, and
individual cell examination.

The following statistics are available:


n Chi-square. Displays Pearson and likelihood-ratio chi-square statistics for lack of

fit.

I-624
Loglinear Models

n Ratio. Displays lambda divided by standard error of lambda. For large samples, this

ratio can be interpreted as a standard normal deviate (z score).


n Maximized likelihood value. The log of the models maximum likelihood value.
n Multiplicative effects. Multiplicative parameters, EXP(lambda). Large values

indicate an increased probability for that combination of indices.


n Term. One at a time, LOGLIN removes each first-order effect and each interaction
term from the model. For each smaller model, LOGLIN provides a likelihood-ratio

chi-square for testing the fit of the model and the difference in the chi-square
statistics between the smaller model and the full model.
n Hterm. Tests each term by removing it and its higher order interactions from the

model. These tests are similar to those in Term except that only hierarchical models
are testedif a lower-order effect is removed, so are the higher-order effects that
include it.
To examine the parameters, you can request the coefficients of the design variables,
the covariance matrix of the parameters, the correlation matrix of the parameters, and
the additive effect of each level for each term (lambda).
In addition, for each cell you can choose to display the observed frequency, the
expected frequency, the standardized deviate, the standard error of lambda, the
observed minus the expected frequency, the likelihood ratio of the deviate, the
Freeman-Tukey deviate, the contribution to Pearson chi-square, and the contribution
to the models log-likelihood.
Finally, you can select the number of cells to identify as outlandish. The first cell
has the largest Freeman-Tukey deviate (these deviates are similar to z scores when the
data are from a Poisson distribution). It is treated as a structural zero, the model is fit
to the remaining cells, and the cell with the largest Freeman-Tukey deviate is
identified. This process continues step by step, each time including one more cell as a
structural zero and refitting the model.

Structural Zeros
A cell is declared to be a structural zero when the probability is zero that there are
counts in the cell. Notice that such zero frequencies do not arise because of small
samples but because the cells are empty naturally (a male hysterectomy patient) or by
design (the diagonal of a two-way table comparing fathers (rows) and sons (columns)
occupations is not of interest when studying changes or mobility). A model can then

I-625
Chapter 18

be fit to the subset of cells that remain. A test of fit for such a model is often called a
test of quasi-independence.
To specify structural zeros, click Zero in the Loglinear Model dialog box.

The following can be specified:


No structural zeros. No cells are treated as structural zeros.
Make all empty cells structural zeros. Treats all empty cells with zero frequency as
structural zeros.
Define custom structural zeros. Specifies one or more cells for treatment as structural
zeros. List the index (n1, n2, ...) of each factor in the order in which the factor appears
in the table. If you want to select a layer or level of a factor, use 0s for the other factors
when specifying the indices. For example, in a table with four factors (TUMOR$ being
the fourth factor), to declare the third level of TUMOR$ as structural zeros, use 0 0 0 3.
Alternatively, you can replace the 0s with blanks or periods (. . . 3).

When fitting a model, LOGLIN excludes cells identified as structural zeros, and then,
as in a regression analysis with zero weight cases, it can compute expected values,
deviates, and so on, for all cells including the structural zero cells.
You might consider identifying cells as structural zeros when:
n It is meaningful to the study at hand to exclude some cellsfor example, the

diagonal of a two-way table crossing the occupations of fathers and sons.


n You want to determine whether an interaction term is necessary only because there

are one or two aberrant cells. That is, after you select the best model, fit a second
model with fewer effects and identify the outlier cells (the most outlandish cells)
for the smaller model. Then refit the best model declaring the outlier cells to be

I-626
Loglinear Models

structural zeros. If the additional interactions are no longer necessary, you might
report the smaller model, adding a sentence describing how the unusual cell(s)
depart from the model.

Frequency Tables (Tabulate)


If you want only a frequency table and no analysis, from the menus choose:
Statistics
Tables
Loglinear Model
Tabulate...

Simply specify the table factors in the same order in which you want to view them from
left to right. In other words, the last variable selected defines the columns of the table
and cross-classifications of all preceding variables define the rows.
Although you can also form multiway tables using Crosstabs, tables for loglinear
models are more compact and easy to read. Crosstabs forms a series of two-way tables
stratified by all combinations of the other table factors. Loglinear models create one
table, with the rows defined by factor combinations. However, loglinear model tables
do not display marginal totals, whereas Crosstabs tables do.

I-627
Chapter 18

Using Commands
First, specify your data with USE filename. Continue with:
LOGLIN
FREQ var
TABULATE var1*var2*
MODEL variables defining table = terms of model
ZERO CELL n1, n2,
SAVE filename / ESTIMATES or LAMBDAS
PRINT SHORT or MEDIUM or LONG or NONE ,
/ OBSFREQ CHISQ RATIO MLE EXPECT STAND ELAMBDA ,
TERM HTERM COVA CORR LAMBDA SELAMBDA DEVIATES ,
LRDEV FTDEV PEARSON LOGLIKE CELLS=n
ESTIMATE / DELTA=n LCONV=n CONV=n TOL=n ITER=n HALF=n

Usage Considerations
Types of data. LOGLIN uses a cases-by-variables rectangular file or data recorded as
frequencies with cell indices.
Print options. You can control what report panels appear in the output by globally
setting output length to SHORT, MEDIUM, or LONG. You can also use the PRINT
command in LOGLIN to request reports individually. You can specify individual panels
by specifying the particular option.
Short output panels include the observed frequency for each cell, the Pearson and
likelihood-ratio chi-square statistics, lambdas divided by their standard errors, the log
of the models maximized likelihood value, and a report of the three most outlandish
cells.
Medium results include all of the above, plus the following: the expected frequency
for each cell (current model), standardized deviations, multiplicative effects, a test of
each term by removing it from the model, a test of each term by removing it and its
higher-order interactions from the model, and the five most outlandish cells.
Long results add the following: coefficients of design variables, the covariance
matrix of the parameters, the correlation matrix of the parameters, the additive effect
of each level for each term, the standard errors of the lambdas, the observed minus the
expected frequency for each cell, the contribution to the Pearson chi-square from each
cell, the likelihood-ratio deviate for each cell, the Freeman-Tukey deviate for each cell,
the contribution to the models log-likelihood from each cell, and the 10 most
outlandish cells.
As a PRINT option, you can also specify CELLS=n, where n is the number of
outlandish cells to identify.

I-628
Loglinear Models

Quick Graphs. LOGLIN produces no Quick Graphs.


Saving files. For each level of a term included in your model, you can save the estimate
of lambda, the standard error of lambda, the ratio of lambda to its standard error, the
multiplicative effect, and the marginal indices of the effect. Alternatively, for each cell,
you can save the observed and expected frequencies, its deviates (listed above), the
Pearson and likelihood-ratio chi-square, the contributions to the log-likelihood, and the
cell indices.
BY groups. LOGLIN analyzes each level of any BY variables separately.
Bootstrapping. Bootstrapping is available in this procedure.
Case frequencies. LOGLIN uses the FREQ variable, if present, to duplicate cases.
Case weights. WEIGHT variables have no effect in LOGLIN.

Examples
Example 1
Loglinear Modeling of a Four-Way Table
In this example, you use the Morrison breast cancer data stored in the CANCER data
file (Bishop, et al. (1975)) and treat the data as a four-way frequency table:
CENTER$
SURVIVE$
AGE
TUMOR$

Center or city where the data were collected


Survivaldead or alive
Age groups of under 50, 50 to 69, and 70 or over
Tumor diagnosis (called INFLAPP by some researchers) with levels:
Minimal inflammation and benign
Greater inflammation and benign
Minimal inflammation and malignant
Greater inflammation and malignant

The CANCER data include one record for each of the 72 cells formed by the four table
factors. Each record includes a variable, NUMBER, that has the number of women in
the cell plus numeric or character value codes to identify the levels of the four factors
that define the cell.

I-629
Chapter 18

For the first model of the CANCER data, you include three two-way interactions.
The input is:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = center$ + age,
+ survive$ + tumor$,
+ age*center$,
+ survive$*center$,
+ tumor$*center$
PRINT SHORT / EXPECT LAMBDAS
ESTIMATE / DELTA=0.5

The MODEL statement has two parts: table factors and terms (effects to fit). Table
factors appear to the left of the equals sign and terms are on the right. The layout of the
table is determined by the order in which the variables are specifiedfor example,
specify TUMOR$ last so its levels determine the columns.
The LABEL statement assigns category names to the numeric codes for AGE. If the
statement is omitted, the data values label the categories. By default, SYSTAT orders
string variables alphabetically, so we specify SORT = NONE to list the categories for
the other factors as they first appear in the data file.
We specify DELTA = 0.5 to add 0.5 to each cell frequency. This option is common
in multiway table procedures as an aid when some cell sizes are sparse. It is of little
use in practice and is used here only to make the results compare with those reported
elsewhere.

I-630
Loglinear Models

The output is:


Case frequencies determined by value of variable NUMBER.
Number of cells (product of levels):
Total count:

72
764

Observed Frequencies
====================
CENTER$
AGE
SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
9.000
7.000
4.000
3.000
Alive
|
26.000
68.000
25.000
9.000
+
50 to 69 Dead
|
9.000
9.000
11.000
2.000
Alive
|
20.000
46.000
18.000
5.000
+
70 & Over Dead
|
2.000
3.000
1.000
0.0
Alive
|
1.000
6.000
5.000
1.000
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
6.000
7.000
6.000
0.0
Alive
|
11.000
24.000
4.000
0.0
+
50 to 69 Dead
|
8.000
20.000
3.000
2.000
Alive
|
18.000
58.000
10.000
3.000
+
70 & Over Dead
|
9.000
18.000
3.000
0.0
Alive
|
15.000
26.000
1.000
1.000
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
16.000
7.000
3.000
0.0
Alive
|
16.000
20.000
8.000
1.000
+
50 to 69 Dead
|
14.000
12.000
3.000
0.0
Alive
|
27.000
39.000
10.000
4.000
+
70 & Over Dead
|
3.000
7.000
3.000
0.0
Alive
|
12.000
11.000
4.000
1.000
-----------------------------+-------------------------------------------------

Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity

57.5272
55.8327
-282.7342
9.9530

df
df

51
51

Probability
Probability

0.24635
0.29814

I-631
Chapter 18

Expected Values
===============
CENTER$
AGE

SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
7.852
15.928
7.515
2.580
Alive
|
28.076
56.953
26.872
9.225
+
50 to 69 Dead
|
6.281
12.742
6.012
2.064
Alive
|
22.460
45.563
21.498
7.380
+
70 & Over Dead
|
1.165
2.363
1.115
0.383
Alive
|
4.166
8.451
3.988
1.369
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
5.439
12.120
2.331
0.699
Alive
|
10.939
24.378
4.688
1.406
+
50 to 69 Dead
|
11.052
24.631
4.737
1.421
Alive
|
22.231
49.542
9.527
2.858
+
70 & Over Dead
|
6.754
15.052
2.895
0.868
Alive
|
13.585
30.276
5.822
1.747
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
9.303
10.121
3.476
0.920
Alive
|
19.989
21.746
7.468
1.977
+
50 to 69 Dead
|
14.017
15.249
5.237
1.386
Alive
|
30.117
32.764
11.252
2.979
+
70 & Over Dead
|
5.582
6.073
2.086
0.552
Alive
|
11.993
13.048
4.481
1.186
-----------------------------+-------------------------------------------------

Log-Linear Effects (Lambda)


===========================
THETA
------------1.826
------------CENTER$
Tokyo
Boston
Glamorgn
------------------------------------0.049
0.001
-0.050
------------------------------------AGE
Under 50
50 to 69
70 & Over
------------------------------------0.145
0.444
-0.589
------------------------------------SURVIVE$
Dead
Alive
-------------------------0.456
0.456
------------------------TUMOR$
MinMalig
MinBengn
MaxMalig
MaxBengn
------------------------------------------------0.480
1.011
-0.145
-1.346
-------------------------------------------------

I-632
Loglinear Models

CENTER$

|
AGE
| Under 50
50 to 69
70 & Over
---------+------------------------------------Tokyo
|
0.565
0.043
-0.609
Boston
|
-0.454
-0.043
0.497
Glamorgn |
-0.111
-0.000
0.112
---------+------------------------------------CENTER$

|
SURVIVE$
| Dead
Alive
---------+------------------------Tokyo
|
-0.181
0.181
Boston
|
0.107
-0.107
Glamorgn |
0.074
-0.074
---------+------------------------CENTER$

|
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+------------------------------------------------Tokyo
|
-0.368
-0.191
0.214
0.345
Boston
|
0.044
0.315
-0.178
-0.181
Glamorgn |
0.323
-0.123
-0.036
-0.164
---------+-------------------------------------------------

Lambda / SE(Lambda)
===================
THETA
------------1.826
------------CENTER$
Tokyo
Boston
Glamorgn
------------------------------------0.596
0.014
-0.586
------------------------------------AGE
Under 50
50 to 69
70 & Over
------------------------------------2.627
8.633
-8.649
------------------------------------SURVIVE$
Dead
Alive
-------------------------11.548
11.548
------------------------TUMOR$
MinMalig
MinBengn
MaxMalig
MaxBengn
------------------------------------------------6.775
15.730
-1.718
-10.150
------------------------------------------------CENTER$

|
AGE
| Under 50
50 to 69
70 & Over
---------+------------------------------------Tokyo
|
7.348
0.576
-5.648
Boston
|
-5.755
-0.618
5.757
Glamorgn |
-1.418
-0.003
1.194
---------+-------------------------------------

I-633
Chapter 18

CENTER$

|
SURVIVE$
| Dead
Alive
---------+------------------------Tokyo
|
-3.207
3.207
Boston
|
1.959
-1.959
Glamorgn |
1.304
-1.304
---------+------------------------CENTER$

|
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+------------------------------------------------Tokyo
|
-3.862
-2.292
2.012
2.121
Boston
|
0.425
3.385
-1.400
-0.910
Glamorgn |
3.199
-1.287
-0.289
-0.827
---------+------------------------------------------------Model ln(MLE): -160.563

The 3 most outlandish cells (based on FTD, stepwise):


======================================================

ln(MLE) LR_ChiSq p-value Frequency


--------- -------- -------- ---------154.685
11.755
0.001
7
-150.685
8.001
0.005
1
-145.024
11.321
0.001
16

CENTER$
| AGE
| | SURVIVE$
| | | TUMOR$
- - - 1 1 1 2
2 3 2 3
3 1 1 1

Initially, SYSTAT produces a frequency table for the data. We entered cases for 72
cells. The total frequency count across these cells is 764that is, there are 764 women
in the sample. Notice that the order of the factors is the same order we specified in the
MODEL statement. The last variable (TUMOR$) defines the columns; the remaining
variables define the rows.
The test of fit is not significant for either the Pearson chi-square or the likelihoodratio test, indicating that your model with its three two-way interactions does not
disagree with the observed frequencies. The model statement describes an association
between study center and age, survival, and tumor status. However, at each center, the
other three factors are independent. Because the overall goal is parsimony, we could
explore whether any of the interactions can be dropped.
Rafterys BIC (Bayesian Information Criterion) adjusts the chi-square for both the
complexity of the model (measured by degrees of freedom) and the size of the sample.
It is the likelihood-ratio chi-square minus the degrees of freedom for the current model
times the natural log of the sample size. If BIC is negative, you can conclude that the
model is preferable to the saturated model. When comparing alternative models, select
the model with the lowest BIC value.
The index of dissimilarity can be interpreted as the percentage of cases that need to
be relocated in order to make the observed and expected counts equal. For these data,
you would have to move about 9.95% of the cases to make the expected frequencies fit.

I-634
Loglinear Models

The expected frequencies are obtained by fitting the loglinear model to the observed
frequencies. Compare these values with the observed frequencies. Values for
corresponding cells will be similar if the model fits well.
After the expected values, SYSTAT lists the parameter estimates for the model you
requested. Usually, it is of more interest to examine these estimates divided by their
standard errors. Here, however, we display them in order to relate them to the expected
values. For example, the observed frequency for the cell in the upper left corner (Tokyo,
Under 50, Dead, MinMalig) is 9. To find the expected frequency under your model, you
add the estimates (from each panel, select the term that corresponds to your cell):
theta
CENTER$
AGE
SURVIVE$
TUMOR$

1.826
0.049
0.145
-0.456
0.480

C*A
C*S
C*T

0.565
-0.181
-0.368

and then use SYSTATs calculator to sum the estimates:


CALC 1.826 + 0.049 + 0.145 0.456 + 0.480 + 0.565 0.181 0.368

and SYSTAT responds 2.06. Take the antilog of this value:


CALC EXP(2.06)

and SYSTAT responds 7.846. In the panel of expected values, this number is printed
as 7.852 (in its calculations, SYSTAT uses more digits following the decimal point).
Thus, for this cell, the sample includes 9 women (observed frequency) and the model
predicts 7.85 women (expected frequency).
The ratio of the parameter estimates to their asymptotic standard errors is part of the
default output. Examine these values to better understand the relationships among the
table factors. Because, for large samples, this ratio can be interpreted as a standard
normal deviate (z score), you can use it to indicate significant parametersfor
example, for an interaction term, significant positive (or negative) associations. In the
CENTER$ by AGE panel, the ratio for young women from Tokyo is very large (7.348),
implying a significant positive association, and that for older Tokyo women is
extremely negative (5.648). The reverse is true for the women from Boston. If you use
the Column Percent option in XTAB to print column percentages for CENTER$ by
AGE, you will see that among the women under 50, more than 50% are from Tokyo
(52.1), while only 23% are from Boston. In the 70 and over age group, 14% are from
Tokyo and 55% are from Boston.

I-635
Chapter 18

The Alive estimate for Tokyo shows a strong positive association (3.207) with
survival in Tokyo. The relationship in Boston is negative (1.959). In this study, the
overall survival rate is 72.5%. In Tokyo, 79.3% of the women survived, while in
Boston, 67.6% survived. There is a negative association for having a malignant tumor
with minimal inflammation in Tokyo (3.862). The same relationship is strongly
positive in Glamorgan (3.199).
Cells that depart from the current model are identified as outlandish in a stepwise
manner. The first cell has the largest Freeman-Tukey deviate (these deviates are
similar to z scores when the data are from a Poisson distribution). It is treated as a
structural zero, the model is fit to the remaining cells, and the cell with the largest
Freeman-Tukey deviate is identified. This process continues step by step, each time
including one more cell as a structural zero and refitting the model.
For the current model, the observations in the cell corresponding to the youngest
nonsurvivors from Tokyo with benign tumors and minimal inflammation (Tokyo,
Under 50, Dead, MinBengn) differs the most from its expected value. There are 7
women in the cell and the expected value is 15.9 women. The next most unusual cell
is 2,3,2,3 (Boston, 70 & Over, Alive, MaxMalig), and so on.

Medium Output
We continue the previous analysis, repeating the same model, but changing the PRINT
(output length) setting to request medium-length results:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=50 to 69, 70=70 & Over
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = age # center$,
+ survive$ # center$,
+ tumor$ # center$
PRINT MEDIUM
ESTIMATE / DELTA=0.5

I-636
Loglinear Models

Notice that we use shortcut notation to specify the model. The output includes:
Standardized Deviates = (Obs-Exp)/sqrt(Exp)
===========================================
CENTER$
AGE
SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
0.410
-2.237
-1.282
0.262
Alive
|
-0.392
1.464
-0.361
-0.074
+
50 to 69 Dead
|
1.085
-1.048
2.034
-0.044
Alive
|
-0.519
0.065
-0.754
-0.876
+
70 & Over Dead
|
0.774
0.414
-0.109
-0.619
Alive
|
-1.551
-0.843
0.507
-0.315
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
0.241
-1.471
2.403
-0.836
Alive
|
0.018
-0.077
-0.318
-1.186
+
50 to 69 Dead
|
-0.918
-0.933
-0.798
0.486
Alive
|
-0.897
1.202
0.153
0.084
+
70 & Over Dead
|
0.864
0.760
0.062
-0.932
Alive
|
0.384
-0.777
-1.999
-0.565
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
2.196
-0.981
-0.255
-0.959
Alive
|
-0.892
-0.374
0.195
-0.695
+
50 to 69 Dead
|
-0.004
-0.832
-0.977
-1.177
Alive
|
-0.568
1.089
-0.373
0.592
+
70 & Over Dead
|
-1.093
0.376
0.633
-0.743
Alive
|
0.002
-0.567
-0.227
-0.171
-----------------------------+------------------------------------------------Multiplicative Effects = exp(Lambda)
====================================
THETA
------------6.209
------------AGE
Under 50
50 to 69
70 & Over
------------------------------------1.156
1.559
0.555
------------------------------------CENTER$
Tokyo
Boston
Glamorgn
------------------------------------1.050
1.001
0.951
------------------------------------SURVIVE$
Dead
Alive
------------------------0.634
1.578
------------------------TUMOR$
MinMalig
MinBengn
MaxMalig
MaxBengn
------------------------------------------------1.616
2.748
0.865
0.260
-------------------------------------------------

I-637
Chapter 18

CENTER$

|
AGE
| Under 50
50 to 69
70 & Over
---------+------------------------------------Tokyo
|
1.760
1.044
0.544
Boston
|
0.635
0.958
1.644
Glamorgn |
0.895
1.000
1.118
---------+------------------------------------CENTER$

|
SURVIVE$
| Dead
Alive
---------+------------------------Tokyo
|
0.835
1.198
Boston
|
1.113
0.899
Glamorgn |
1.077
0.929
---------+------------------------CENTER$

|
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+------------------------------------------------Tokyo
|
0.692
0.826
1.238
1.412
Boston
|
1.045
1.370
0.837
0.834
Glamorgn |
1.382
0.884
0.965
0.849
---------+------------------------------------------------Model ln(MLE): -160.563
Term tested
--------------AGE. . . . . .
CENTER$. . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .
Term tested
hierarchically
--------------AGE. . . . . .
CENTER$. . . .
SURVIVE$ . . .
TUMOR$ . . . .

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------216.120
166.95
53
0.0000
-160.799
56.31
53
0.3523
-234.265
203.24
52
0.0000
-344.471
423.65
54
0.0000

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------111.11
2
0.0000
0.47
2
0.7894
147.41
1
0.0000
367.82
3
0.0000

-196.672

128.05

55

0.0000

72.22

0.0000

-166.007

66.72

53

0.0975

10.89

0.0043

-178.267

91.24

57

0.0027

35.41

0.0000

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------246.779
228.26
57
0.0000
-224.289
183.29
65
0.0000
-242.434
219.57
54
0.0000
-363.341
461.39
60
0.0000

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------172.43
6
0.0000
127.45
14
0.0000
163.74
3
0.0000
405.56
9
0.0000

The 5 most outlandish cells (based on FTD, stepwise):


======================================================

ln(MLE) LR_ChiSq p-value Frequency


--------- -------- -------- ---------154.685
11.755
0.001
7
-150.685
8.001
0.005
1
-145.024
11.321
0.001
16
-140.740
8.569
0.003
6
-136.662
8.157
0.004
11

CENTER$
| AGE
| | SURVIVE$
| | | TUMOR$
- - - 1 1 1 2
2 3 2 3
3 1 1 1
2 1 1 3
1 2 1 3

The goodness-of-fit tests provide an overall indication of how close the expected
values are to the cell counts. Just as you study residuals for each case in multiple
regression, you can use deviates to compare the observed and expected values for each
cell. A standardized deviate is the square root of each cells contribution to the Pearson
chi-square statisticthat is, (the observed frequency minus the expected frequency)

I-638
Loglinear Models

divided by the square root of the expected frequency. These values are similar to z
scores. For the second cell in the first row, the expected value under your model is
considerably larger than the observed count (its deviate is 2.237, the observed count
is 7, and the expected count is 15.9). Previously, this cell was identified as the most
outlandish cell using Freeman-Tukey deviates.
Note that LOGLIN produces five types of deviates or residuals: standardized, the
observed minus the expected frequency, the likelihood-ratio deviate, the FreemanTukey deviate, and the Pearson deviate.
( lambda )
Estimates of the multiplicative parameters equal e
. Look for values that
depart markedly from 1.0. Very large values indicate an increased probability for that
combination of indices and, conversely, a value considerably less than 1.0 indicates an
unlikely combination. A test of the hypothesis that a multiplicative parameter equals
1.0 is the same as that for lambda equal to 0; so use the values of (lambda)/S.E. to test
the values in this panel. For the CENTER$ by AGE interaction, the most likely
combination is women under 50 from Tokyo (1.76); the least likely combination is
women 70 and over from Tokyo (0.544).
After listing the multiplicative effects, SYSTAT tests reduced models by removing
each first-order effect and each interaction from the model one at a time. For each
smaller model, LOGLIN provides:
n A likelihood-ratio chi-square for testing the fit of the model
n The difference in the chi-square statistics between the smaller model and the full

model
The likelihood-ratio chi-square for the full model is 55.833. For a model that omits
AGE, the likelihood-ratio chi-square is 166.95. This smaller model does not fit the
observed frequencies (p value < 0.00005). To determine whether the removal of this
term results in a significant decrease in the fit, look at the difference in the statistics:
166.95 55.833 = 111.117, p value < 0.00005. The fit worsens significantly when AGE
is removed from the model.
From the second line in this panel, it appears that a model without the first-order
term for CENTER$ fits (p value = 0.3523). However, removing any of the two-way
interactions involving CENTER$ significantly decreases the model fit.
The hierarchical tests are similar to the preceding tests except that only hierarchical
models are testedif a lower-order effect is removed, so are the higher-order effects
that include it. For example, in the first line, when CENTER$ is removed, the three
interactions with CENTER$ are also removed. The reduction in the fit is significant
(p < 0.00005). Although removing the first-order effect of CENTER$ does not

I-639
Chapter 18

significantly alter the fit, removing the higher-order effects involving CENTER$
decreases the fit substantially.

Example 2
Screening Effects
In this example, you pretend that no models have been fit to the CANCER data (that is,
you have not seen the other examples). As a place to start, first fit a model with all
second-order interactions finding that it fits. Then fit models nested within the first by
using results from the HTERM (terms tested hierarchically) panel to guide your
selection of terms to be removed.
Heres a summary of your instructions: you study the output generated from the first
MODEL and ESTIMATE statements and decide to remove AGE by TUMOR$. After
seeing the results for this smaller model, you decide to remove AGE by SURVIVE$, too.
USE cancer
LOGLIN
FREQ = number
PRINT NONE / CHI HTERM
MODEL center$*age*survive$*tumor$
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$
ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$
-

= tumor$..center$^2
= tumor$..center$^2,
age*tumor$
= tumor$..center$^2,
age*tumor$,
- age*survive$

ESTIMATE / DELTA=0.5
MODEL center$*age*survive$*tumor$ = tumor$..center$^2,
- age*tumor$,
- age*survive$,
- tumor$*survive$
ESTIMATE / DELTA=0.5

The output follows:


All two-way interactions
Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity
Term tested
hierarchically
--------------TUMOR$ . . . .
SURVIVE$ . . .
AGE. . . . . .
CENTER$. . . .
SURVIVE$
* TUMOR$ . . .

40.1650
39.9208
-225.6219
7.6426

df
df

40
40

Probability
Probability

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------361.233
457.17
58
0.0000
-241.675
218.06
48
0.0000
-241.668
218.04
54
0.0000
-213.996
162.70
54
0.0000
-157.695

50.10

43

0.2125

0.46294
0.47378

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------417.25
18
0.0000
178.14
8
0.0000
178.12
14
0.0000
122.78
14
0.0000
10.18

0.0171

I-640
Loglinear Models

AGE
* TUMOR$ .
AGE
* SURVIVE$
CENTER$
* TUMOR$ .
CENTER$
* SURVIVE$
CENTER$
* AGE. . .

. .

-153.343

41.39

46

0.6654

1.47

0.9613

. .

-154.693

44.09

42

0.3831

4.17

0.1241

. .

-169.724

74.15

46

0.0053

34.23

0.0000

. .

-156.501

47.71

42

0.2518

7.79

0.0204

. .

-186.011

106.73

44

0.0000

66.81

0.0000

Remove AGE * TUMOR


Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity
Term tested
hierarchically
--------------TUMOR$ . . . .
SURVIVE$ . . .
AGE. . . . . .
CENTER$. . . .
SURVIVE$
* TUMOR$ . . .
AGE
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .
CENTER$
* SURVIVE$ . .
CENTER$
* AGE. . . . .

41.8276
41.3934
-263.9807
7.8682

df
df

46
46

Probability
Probability

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------361.233
457.17
58
0.0000
-242.434
219.57
54
0.0000
-241.668
218.04
54
0.0000
-215.687
166.08
60
0.0000
-158.454

51.61

49

0.3719

-155.452

45.61

48

-171.415

77.54

52

-157.291

49.29

48

-187.702

110.11

50

0.64757
0.66536

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------415.78
12
0.0000
178.18
8
0.0000
176.65
8
0.0000
124.69
14
0.0000
10.22

0.0168

0.5713

4.22

0.1214

0.0124

36.14

0.0000

0.4214

7.90

0.0193

0.0000

68.72

0.0000

Remove AGE * TUMOR$ and AGE * SURVIVE$


Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity
Term tested
hierarchically
--------------TUMOR$ . . . .
SURVIVE$ . . .
AGE. . . . . .
CENTER$. . . .
SURVIVE$
* TUMOR$ . . .
CENTER$
* TUMOR$ . . .
CENTER$
* SURVIVE$ . .
CENTER$
* AGE. . . . .

45.3579
45.6113
-273.0400
8.4720

df
df

48
48

Probability
Probability

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------363.341
461.39
60
0.0000
-242.434
219.57
54
0.0000
-241.668
218.04
54
0.0000
-219.546
173.80
62
0.0000

0.58174
0.57126

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------415.78
12
0.0000
173.96
6
0.0000
172.43
6
0.0000
128.19
14
0.0000

-160.563

55.83

51

0.2981

10.22

0.0168

-173.524

81.75

54

0.0087

36.14

0.0000

-161.264

57.23

50

0.2245

11.62

0.0030

-191.561

117.83

52

0.0000

72.22

0.0000

Remove AGE * TUMOR$, AGE * SURVIVE$, and TUMOR$ * SURVIVE$


Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity

57.5272
55.8327
-282.7342
9.9530

df
df

51
51

Probability
Probability

0.24635
0.29814

I-641
Chapter 18

Term tested
hierarchically
--------------TUMOR$ . . . .
SURVIVE$ . . .
AGE. . . . . .
CENTER$. . . .
CENTER$
* TUMOR$ . . .
CENTER$
* SURVIVE$ . .
CENTER$
* AGE. . . . .

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------363.341
461.39
60
0.0000
-242.434
219.57
54
0.0000
-246.779
228.26
57
0.0000
-224.289
183.29
65
0.0000
-178.267

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------405.56
9
0.0000
163.74
3
0.0000
172.43
6
0.0000
127.45
14
0.0000

91.24

57

0.0027

35.41

0.0000

-166.007

66.72

53

0.0975

10.89

0.0043

-196.672

128.05

55

0.0000

72.22

0.0000

The likelihood-ratio chi-square for the model that includes all two-way interactions is
39.9 (p value = 0.4738). If the AGE by TUMOR$ interaction is removed, the chi-square
for the smaller model is 41.39 (p value = 0.6654). Does the removal of this interaction
cause a significant change? No, chi-square = 1.47 (p value = 0.9613). This chi-square
is computed as 41.39 minus 39.92 with 46 minus 40 degrees of freedom. The removal
of this interaction results in the least change, so you remove it first. Notice also that the
estimate of the maximized likelihood function is largest when this second-order effect
is removed (153.343).
The model chi-square for the second model is the same as that given for the first
model with AGE * TUMOR$ removed (41.3934). Here, if AGE by SURVIVE$ is
removed, the new model fits (p value = 0.5713) and the change between the model
minus one interaction and that minus two interactions is insignificant (p value =
0.1214).
If SURVIVE$ by TUMOR$ is removed from the current model with four
interactions, the new model fits (p value = 0.2981). The change in fit is not significant
(p = 0.0168). Should we remove any other terms? Looking at the HTERM panel for the
model with three interactions, you see that a model without CENTER$ by SURVIVE$
has a marginal fit (p value = 0.0975) and the chi-square for the difference is significant
(p value = 0.0043). Although the goal is parsimony and technically a model with only
two interactions does fit, you opt for the model that also includes CENTER$ by
SURVIVE$ because it is a significant improvement over the very smallest model.

I-642
Loglinear Models

Example 3
Structural Zeros
This example identifies outliers and then declares them to be structural zeros. You
wonder if any of the interactions in the model fit in the example on loglinear modeling
for a four-way table are necessary only because of a few unusual cells. To identify the
unusual cells, first pull back from your ideal model and fit a model with main effects
only, asking for the four most unusual cells. (Why four cells? Because 5% of 72 cells
is 3.6 or roughly 4.)
USE cancer
LOGLIN
FREQ = number
ORDER center$ survive$ tumor$ / SORT=NONE
MODEL center$*age*survive$*tumor$ = tumor$ .. center$
PRINT / CELLS=4
ESTIMATE / DELTA=0.5

Of course this model doesnt fit, but following are selections from the output:
Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity

181.3892
174.3458
-243.8839
19.3853

df
df

63
63

Probability
Probability

0.00000
0.00000

The 4 most outlandish cells (based on FTD, stepwise):


======================================================

ln(MLE) LR_ChiSq p-value Frequency


--------- -------- -------- ---------203.261
33.118
0.000
68
-195.262
15.997
0.000
1
-183.471
23.582
0.000
25
-176.345
14.253
0.000
6

CENTER$
| AGE
| | SURVIVE$
| | | TUMOR$
- - - 1 1 2 2
1 3 2 1
1 1 2 3
1 3 2 2

Next, fit your ideal model, identifying these four cells as structural zeros and also
requesting PRINT / HTERM to test the need for each interaction term.

I-643
Chapter 18

Defining Four Cells As Structural Zeros


Continuing from the analysis of main effects only, now specify your original model
with its three second-order effects:
MODEL center$*age*survive$*tumor$ = ,
(age + survive$ + tumor$) # center$
ZERO CELL=1 1 2 2
CELL=1 3 2 1
CELL=1 1 2 3
CELL=1 3 2 2
PRINT / HTERMS
ESTIMATE / DELTA=0.5

Following are selections from the output. Notice that asterisks mark the structural zero
cells.
Number of cells (product of levels):
Number of structural zero cells:
Total count:

72
4
664

Observed Frequencies
====================
CENTER$
AGE
SURVIVE$ |
TUMOR$
| MinMalig
MinBengn
MaxMalig
MaxBengn
---------+---------+---------+------------------------------------------------Tokyo
Under 50 Dead
|
9.000
7.000
4.000
3.000
Alive
|
26.000
*68.000
*25.000
9.000
+
50 to 69 Dead
|
9.000
9.000
11.000
2.000
Alive
|
20.000
46.000
18.000
5.000
+
70 & Over Dead
|
2.000
3.000
1.000
0.0
Alive
|
*1.000
*6.000
5.000
1.000
---------+---------+---------+------------------------------------------------Boston
Under 50 Dead
|
6.000
7.000
6.000
0.0
Alive
|
11.000
24.000
4.000
0.0
+
50 to 69 Dead
|
8.000
20.000
3.000
2.000
Alive
|
18.000
58.000
10.000
3.000
+
70 & Over Dead
|
9.000
18.000
3.000
0.0
Alive
|
15.000
26.000
1.000
1.000
---------+---------+---------+------------------------------------------------Glamorgn Under 50 Dead
|
16.000
7.000
3.000
0.0
Alive
|
16.000
20.000
8.000
1.000
+
50 to 69 Dead
|
14.000
12.000
3.000
0.0
Alive
|
27.000
39.000
10.000
4.000
+
70 & Over Dead
|
3.000
7.000
3.000
0.0
Alive
|
12.000
11.000
4.000
1.000
-----------------------------+------------------------------------------------* indicates structural zero cells
Pearson ChiSquare
46.8417 df
47 Probability 0.47906
LR ChiSquare
44.8815 df
47 Probability 0.56072
Rafterys BIC -260.5378
Dissimilarity
10.1680

I-644
Loglinear Models

Term tested
hierarchically
--------------AGE. . . . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$. . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------190.460
132.87
53
0.0000
-206.152
164.25
50
0.0000
-326.389
404.72
56
0.0000
-177.829
107.60
61
0.0002
-158.900

69.75

51

0.0416

-149.166

50.28

49

-162.289

76.52

53

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------87.98
6
0.0000
119.37
3
0.0000
359.84
9
0.0000
62.72
14
0.0000
24.86

0.0001

0.4226

5.40

0.0674

0.0189

31.64

0.0000

The model has a nonsignificant test of fit and so does a model without the CENTER$
by SURVIVAL$ interaction (p value = 0.4226).

Eliminating Only the Young Women


Two of the extreme cells are from the youngest age group. What happens to the
CENTER$ by SURVIVE$ effect if only these cells are defined as structural zeros?
HTERM remains in effect.
MODEL center$*age*survive$*tumor$ =,
(age + survive$ + tumor$) # center$
ZERO CELL=1 1 2 2 CELL=1 1 2 3
ESTIMATE / DELTA=0.5

The output follows:


Number of cells (product of levels):
Number of structural zero cells:
Total count:

72
2

Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity

Probability
Probability

Term tested
hierarchically
--------------AGE. . . . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$. . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .

50.2610
49.1153
-269.8144
10.6372

df
df

671

49
49

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------221.256
188.37
55
0.0000
-210.369
166.60
52
0.0000
-331.132
408.12
58
0.0000
-192.179
130.22
63
0.0000

0.42326
0.46850

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------139.25
6
0.0000
117.48
3
0.0000
359.01
9
0.0000
81.10
14
0.0000

-172.356

90.57

53

0.0010

41.45

0.0000

-153.888

53.63

51

0.3737

4.52

0.1045

-169.047

83.95

55

0.0072

34.84

0.0000

When the two cells for the young women from Tokyo are excluded from the model
estimation, the CENTER$ by SURVIVE$ effect is not needed (p value = 0.3737).

I-645
Chapter 18

Eliminating the Older Women


Here you define the two cells for the Tokyo women from the oldest age group as
structural zeros.
MODEL center$*age*survive$*tumor$ =,
(age + survive$ + tumor$) # center$
ZERO CELL=1 3 2 1 CELL=1 3 2 2
ESTIMATE / DELTA=0.5

The output is:


Number of cells (product of levels):
Number of structural zero cells:
Total count:

72
2

Pearson ChiSquare
LR ChiSquare
Rafterys BIC
Dissimilarity

Probability
Probability

Term tested
hierarchically
--------------AGE. . . . . .
SURVIVE$ . . .
TUMOR$ . . . .
CENTER$. . . .
CENTER$
* AGE. . . . .
CENTER$
* SURVIVE$ . .
CENTER$
* TUMOR$ . . .

53.4348
50.9824
-273.8564
9.4583

df
df

757

49
49

The model without the term


ln(MLE)
Chi-Sq
df p-value
--------- -------- ---- --------203.305
147.41
55
0.0000
-238.968
218.73
52
0.0000
-358.521
457.84
58
0.0000
-209.549
159.89
63
0.0000

0.30782
0.39558

Removal of term from model


Chi-Sq
df p-value
-------- ---- -------96.42
6
0.0000
167.75
3
0.0000
406.86
9
0.0000
108.91
14
0.0000

-177.799

96.39

53

0.0003

45.41

0.0000

-161.382

63.56

51

0.1114

12.58

0.0019

-171.123

83.04

55

0.0086

32.06

0.0000

When the two cells for the women from the older age group are treated as structural
zeros, the case for removing the CENTER$ by SURVIVE$ effect is much weaker than
when the cells for the younger women are structural zeros. Here, the inclusion of the
effect results in a significant improvement in the fit of the model (p value = 0.0019).

Conclusion
The structural zero feature allowed you to quickly focus on 2 of the 72 cells in your
multiway table: the survivors under 50 from Tokyo, especially those with benign
tumors with minimal inflammation. The overall survival rate for the 764 women is
72.5%, that for Tokyo is 79.3%, and that for the most unusual cell is 90.67%. Half of
the Tokyo women under age 50 have MinBengn tumors (75 out of 151) and almost
10% of the 764 women (spread across 72 cells) are concentrated here. Possibly the
protocol for study entry (including definition of a tumor) was executed differently at
this center than at the others.

I-646
Loglinear Models

Example 4
Tables without Analyses
If you want only a frequency table and no analysis, use TABULATE. Simply specify the
table factors in the same order in which you want to view them from left to right. In
other words, the last variable defines the columns of the table and cross-classifications
of the preceding variables the rows.
For this example, we use data in the CANCER file. Here we use LOGLIN to display
counts for a 3 by 3 by 2 by 4 table (72 cells) in two dozen lines. The input is:
USE cancer
LOGLIN
FREQ = number
LABEL age / 50=Under 50, 60=59 to 69, 70=70 & Over
ORDER center$ / SORT=NONE
ORDER tumor$ / SORT =MinBengn, MaxBengn, MinMalig,
MaxMalig
TABULATE age * center$ * survive$ * tumor$

The resulting table is:


Number of cells (product of levels):
Total count:

72
764

Observed Frequencies
====================
AGE
CENTER$
SURVIVE$ |
TUMOR$
| MinBengn
MaxBengn
MinMalig
MaxMalig
---------+---------+---------+------------------------------------------------Under 50 Tokyo
Alive
|
68.000
9.000
26.000
25.000
Dead
|
7.000
3.000
9.000
4.000
+
Boston
Alive
|
24.000
0.0
11.000
4.000
Dead
|
7.000
0.0
6.000
6.000
+
Glamorgn Alive
|
20.000
1.000
16.000
8.000
Dead
|
7.000
0.0
16.000
3.000
---------+---------+---------+------------------------------------------------59 to 69 Tokyo
Alive
|
46.000
5.000
20.000
18.000
Dead
|
9.000
2.000
9.000
11.000
+
Boston
Alive
|
58.000
3.000
18.000
10.000
Dead
|
20.000
2.000
8.000
3.000
+
Glamorgn Alive
|
39.000
4.000
27.000
10.000
Dead
|
12.000
0.0
14.000
3.000
---------+---------+---------+------------------------------------------------70 & Over Tokyo
Alive
|
6.000
1.000
1.000
5.000
Dead
|
3.000
0.0
2.000
1.000
+
Boston
Alive
|
26.000
1.000
15.000
1.000
Dead
|
18.000
0.0
9.000
3.000
+
Glamorgn Alive
|
11.000
1.000
12.000
4.000
Dead
|
7.000
0.0
3.000
3.000
-----------------------------+-------------------------------------------------

I-647
Chapter 18

Computation
Algorithms
Loglinear modeling implements the algorithms of Haberman (1973).

References
Agresti, A. (1984). Analysis of ordinal categorical data. New York: Wiley-Interscience.
Agresti, A. (1990). Categorical data analysis. New York: Wiley-Interscience.
Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975). Discrete multivariate
analysis: Theory and practice. Cambridge, Mass.: McGraw-Hill.
Fienberg, S. E. (1980). The analysis of cross-classified categorical data, 2nd ed.
Cambridge, Mass.: MIT Press.
Goodman, L. A. (1978). Analyzing qualitative/categorical data: Loglinear models and
latent structure analysis. Cambridge, Mass.: Abt Books.
Haberman, S. J. (1973). Loglinear fit for contingency tables, algorithm AS 51. Applied
Statistics, 21, 218224.
Haberman, S. J. (1978). Analysis of qualitative data, Vol. 1: Introductory topics. New
York: Academic Press.
Knoke, D. and Burke, P. S. (1980). Loglinear models. Newbury Park: Sage.
Morrison, D. F. (1976). Multivariate statistical methods. New York: McGraw-Hill.

Index

A matrix, I-499
accelerated failure time distribution, II-538
ACF plots, II-642
additive trees, I-62, I-68
AID, I-35, I-37
Akaike Information Criterion, II-288
alpha level, II-314, II-315, II-321
alternative hypothesis, I-13, II-312
analysis of covariance, I-431
examples, I-462, I-478
model, I-432
analysis of variance, I-210, I-487
algorithms, I-485
ANOVA command, I-438
assumptions, I-388
between-group differences, I-394
bootstrapping, I-438
compared to loglinear modeling, I-619
compared to regression trees, I-35
contrasts, I-390, I-434, I-435, I-436
data format, I-438
examples, I-439, I-442, I-447, I-457, I-459, I-461,
I-462, I-464, I-470, I-472, I-475, I-478, I-480
factorial, I-387
hypothesis tests, I-386, I-434, I-435, I-436
interactions, I-387
model, I-432
multivariate, I-393, I-396
overview, I-431
post hoc tests, I-389, I-432
power analysis, II-311, II-318, II-319, II-351, II353, II-374, II-378
Quick Graphs, I-438
repeated measures, I-393, I-436
residuals, I-432
two-way ANOVA, II-319, II-353, II-378
unbalanced designs, I-391
unequal variances, I-388

649

usage, I-438
within-subject differences, I-394
Anderberg dichotomy coefficients, I-120, I-126
angle tolerance, II-496
anisotropy, II-500, II-508
geometric, II-500
zonal, II-500
A-optimality, I-246
ARIMA models, II-627, II-637, II-650
algorithms, II-678
ARMA models, II-632
autocorrelation plots, I-374, II-630, II-633, II-642
Automatic Interaction Detection, I-35
autoregressive models, II-630
axial designs, I-242

backward elimination, I-379


bandwidth, II-460, II-465, II-472, II-496
optimal values, II-466
relationship with kernel function, II-466
BASIC, II-536
basic statistics
See descriptive statistics, I-211
beta level, II-314, II-315
between-group differences
in analysis of variance, I-394
bias, I-379
binary logit, I-550
compared to multinomial logit, I-552
binary trees, I-33
biplots, II-298, II-299
Bisquare procedure, II-162
biweight kernel, II-463, II-472, II-473
Bonferroni inequality, I-37

650
Index

Bonferroni test, I-127, I-389, I-432, I-493, II-591


bootstrap, I-19, I-20
algorithms, I-28
bootstrap-t method, I-19
command, I-20
data format, I-20
examples, I-21, I-24, I-25, I-26
missing data, I-28
naive bootstrap, I-19
overview, I-17
Quick Graphs, I-20
usage, I-20
box plot, I-210
Box-Behnken designs, I-238, I-264
Box-Hunter designs, I-235, I-256
Bray-Curtis measure, I-119, I-126

C matrix, I-499
candidate sets
for optimal designs, I-245
canonical correlation analysis, I-487
bootstrapping, II-417
data format, II-416, II-405
examples, II-418, II-409, II-425
interactions, II-417
model, II-415
nominal scales, II-417
overview, II-407
partialled variables, II-415
Quick Graphs, II-417
rotation, II-416
usage, II-417
canonical correlations, I-286
canonical rotation, II-299
categorical data, II-199
categorical predictors, I-35
Cauchy kernel, II-463, II-472, II-473
CCF plots, II-643
central composite designs, I-238, I-269
central limit theorem, II-588
centroid designs, I-241
CHAID, I-36, I-37

chi-square, I-160
Chi-square test for independence, I-148
circle model
in perceptual mapping, II-297
city-block distance, I-126
classical analysis, II-604
classification functions, I-280
classification trees, I-36
algorithms, I-51
basic tree model, I-32
bootstrapping, I-44
commands, I-43
compared to discriminant analysis, I-36, I-39
data format, I-44
displays, I-41
examples, I-45, I-47, I-49
loss functions, I-38, I-41
missing data, I-51
mobiles, I-31
model, I-41
overview, I-31
pruning, I-37
Quick Graphs, I-44
saving files, I-44
stopping criteria, I-37, I-43
usage, I-44
cluster analysis
additive trees, I-68
algorithms, I-84
bootstrapping, I-70
commands, I-69
data types, I-70
distances, I-66
examples, I-71, I-75, I-78, I-79, I-81, I-82
exclusive clusters, I-54
hierarchical clustering, I-64
k-means clustering, I-67
missing values, I-84
overlapping clusters, I-54
overview, I-53
Quick Graphs, I-70
saving files, I-70
usage, I-70
clustered data, II-47
Cochrans test of linear trend, I-168

651
Index

coefficient of alienation, II-123, II-143


coefficient of determination
See multiple correlation
coefficient of variation, I-211
Cohens kappa, I-164, I-168
communalities, I-332
compound symmetry, I-394
conditional logistic regression model, I-552
conditional logit model, I-554
confidence curves, II-156
confidence intervals, I-11, I-211
path analysis, II-285
conjoint analysis
additive tables, I-88
algorithms, I-112
bootstrapping, I-95
commands, I-95
compared to logistic regression, I-92
data format, I-95
examples, I-96, I-100, I-103, I-107
missing data, I-113
model, I-93
multiplicative tables, I-89
overview, I-87
Quick Graphs, I-95
saving files, I-95
usage, I-95
constraints
in mixture designs, I-242
contingency coefficient, I-165, I-168
contour plots, II-506
contrast coefficients, I-393
contrasts
in analysis of variance, I-390
convex hulls, II-505
Cooks distance, I-375
Cook-Weisberg graphical confidence curves, II-156
coordinate exchange method, I-245, I-270
correlation matrix, II-12
correlations, I-55, I-115
algorithms, I-145
binary data, I-126
bootstrapping, I-129

canonical, II-407
commands, I-128
continuous data, I-125
data format, I-129
dissimilarity measures, I-126
distance measures, I-126
examples, I-129, I-132, I-134, I-135, I-137, I-140, I143, I-145
missing values, I-124, I-146, II-12
options, I-127
power analysis, II-311, II-317, II-336, II-338
Quick Graphs, I-129
rank-order data, I-126
saving files, I-129
set, II-407
usage, I-129
correlograms, II-509
correspondence analysis, II-294, II-298
algorithms, I-156
bootstrapping, I-150
commands, I-150
data format, I-150
examples, I-151, I-153
missing data, I-149, I-156
model, I-149
multiple correspondence analysis, I-149
overview, I-147
Quick Graphs, I-150
simple correspondence analysis, I-149
usage, I-150
covariance matrix, I-125, II-12
covariance paths
path analysis, II-236
covariograms, II-495
Cramers V, I-165
critical level, I-13
Cronbachs alpha, II-604, II-605
See descriptive statistics, I-214
cross-correlation plots, II-643
crossover designs, I-487
crosstabulation
bootstrapping, I-172
commands, I-171
data format, I-172

652
Index

examples, I-173, I-175, I-177, I-178, I-179, I-181, I186, I-188, I-189, I-192, I-194, I-196, I-197, I199, I-200
multiway, I-170
one-way, I-158, I-160, I-166
overview, I-157
Quick Graphs, I-172
standardizing tables, I-159
two-way, I-158, I-161, I-167, I-168
usage, I-172
cross-validation, I-37, I-280, I-380

D matrix, I-499
D SUB-A (da), II-433
dates, II-536
degrees-of-freedom, II-586
dendrograms, I-57, I-70
dependence paths
path analysis, II-235
descriptive statistics, I-1
basic statistics, I-211, I-212
bootstrapping, I-215
commands, I-215
Cronbachs alpha, I-214
data format, I-215
overview, I-205
Quick Graphs, I-215
stem-and-leaf plots, I-213
usage, I-215
design of experiments, I-92, I-250, I-251
axial designs, I-242
bootstrapping, I-252
Box-Behnken designs, I-238
central composite designs, I-238
centroid designs, I-241
commands, I-252
examples, I-253, I-254, I-256, I-258, I-260, I-263, I264, I-265, I-266, I-269, I-270
factorial designs, I-231, I-232
lattice designs, I-241
mixture designs, I-232, I-239
optimal designs, I-232, I-244
overview, I-227
Quick Graphs, I-252

response surface designs, I-232, I-236


screening designs, I-242
usage, I-252
determinant criterion
See D-optimality
dichotomy coefficients
Anderberg, I-126
Jaccard, I-126
positive matching, I-126
simple matching, I-126
Tanimoto, I-126
difference contrasts, I-498
difficulty, II-621
discrete choice model, I-554
compared to polytomous logit, I-555
discrete gaussian convolution, II-470
discriminant analysis, I-487
bootstrapping, I-288
commands, I-287
compared to classification trees, I-36
data format, I-288
estimation, I-284
examples, I-288, I-293, I-298, I-306, I-313, I-315, I321

linear discriminant function, I-280


linear discriminant model, I-276
model, I-283
multiple groups, I-282
options, I-284
overview, I-275
prior probabilities, I-282
Quick Graphs, I-288
statistics, I-286
stepwise estimation, I-284
usage, I-288
discrimination parameter, II-621
dissimilarities
direct, II-121
indirect, II-121
distance measures, I-55, I-115
distances
nearest-neighbor, II-503
distance-weighted least squares (DWLS) smoother,
II-470

dit plots, I-15

653
Index

D-optimality, I-246
dot histogram plots, I-15
D-PRIME (d ), II-444
dummy codes, I-490
Duncans test, I-390
Dunnett test, I-493
Dunn-Sidak test, I-127, II-603

ECVI, II-287
edge effects, II-517
effect size
in power analysis, II-315
effects codes, I-383, I-490
efficiency, I-244
eigenvalues, I-286
ellipse model
in perceptual mapping, II-298
EM algorithm, I-362
EM estimation, II-8
for correlations, I-127, II-12
for covariances, II-12
for SSCP matrix, II-12
endogenous variables
path analysis, II-236
Epanechnikov kernel, II-475, II-484, II-485
equamax rotation, I-333, I-337
Euclidean distances, II-121
exogenous variables
path analysis, II-236
expected cross-validation index, II-287
exponential distribution, II-550
exponential model, II-510, II-520
exponential smoothing, II-650
external unfolding, II-296

F distribution
noncentrality parameter, II-327, II-356
factor analysis, I-331, II-294
algorithms, I-362

bootstrapping, I-339
commands, I-339
compared to principal components analysis, I-334
convergence, I-335
correlations vs covariances, I-331
data format, I-339
eigenvalues, I-335
eigenvectors, I-338
examples, I-341, I-344, I-348, I-350, I-353, I-356
iterated principal axis, I-335
loadings, I-338
maximum likelihood, I-335
missing values, I-362
number of factors, I-335
overview, I-327
principal components, I-335
Quick Graphs, I-339
residuals, I-338
rotation, I-333, I-337
save, I-338
scores, I-338
usage, I-339
factor loadings, II-616
factorial analysis of variance, I-387
factorial designs, I-231, I-232
analysis of, I-235
examples, I-253
fractional factorials, I-234
full factorial designs, I-234
Fedorov method, I-245
Fieller bounds, I-581
filters, II-652
Fisher's LSD, I-389, I-493
Fishers exact test, I-164, I-168
Fishers linear discriminant function, II-294
fixed variance
path analysis, II-238
fixed-bandwidth method
compared to KNN method, II-479
for smoothing, II-477, II-479, II-484
Fletcher-Powell minimization, II-632
forward selection, I-379
Fourier analysis, II-651, II-664
fractional factorial designs, I-487

654
Index

Box-Hunter designs, I-235


examples, I-254, I-256, I-258, I-260, I-263
homogeneous fractional designs, I-235
Latin square designs, I-235
mixed-level fractional designs, I-235
Plackett-Burman designs, I-235
Taguchi designs, I-235
Freeman-Tukey deviates, I-622
frequencies, I-20, I-44, I-95, I-129, I-150, I-172, I215, I-288, I-339, I-403, I-438, I-501, I-565, I626, II-15, II-64, II-127, II-165, II-206, II-223,
II-249, II-301, II-359, II-389, II-417, II-437, II475, II-515, II-548, II-592, II-609, II-653, II-685
frequency tables, See crosstabulation
Friedman test, II-202

gamma coefficients, I-126


Gaussian kernel, II-463, II-472, II-473
Gaussian model, II-498, II-508
Gauss-Newton method, II-155, II-156
general linear models
algorithms, I-546
bootstrapping, I-501
categorical variables, I-490
commands, I-501
contrasts, I-495, I-497, I-498, I-499
data format, I-501
examples, I-503, I-510, I-512, I-513, I-515, I-518, I520, I-523, I-532, I-535, I-536, I-540, I-544, I545

hypothesis tests, I-495


mixture model, I-492
model estimation, I-488
overview, I-487
post hoc tests, I-493
Quick Graphs, I-501
repeated measures, I-491
residuals, I-488
stepwise regression, I-492
usage, I-501
generalized least squares, II-247, II-683
generalized variance, II-410
geostatistical models, II-494, II-495
Gini index, I-38, I-41

GLM
See general linear models
global criterion
See G-optimality
Goodman-Kruskal gamma, I-126, I-165, I-168
Goodman-Kruskal lambda, I-168
G-optimality, I-246
Graeco-Latin square designs, I-235
Greenhouse-Geisser statistic, I-395
Guttman mu2 monotonicity coefficients, I-119, I-126
Guttmans coefficient of alienation, II-123
Guttmans loss function, II-143
Guttman-Rulon coefficient, II-605

Hadi outlier detection, I-123, I-127


Hampel procedure, II-162
Hanning weights, II-626
hazard function
heterogeneity, II-541
heteroskedasticity, II-682
heteroskedasticity-consistent standard errors, II-683
hierarchical clustering, I-56, I-64
hierarchical linear models
See mixed regression
hinge, I-207
histograms
nearest-neighbor, II-515
hole model, II-499, II-508
Holts method, II-638
Huber procedure, II-162
Huynh-Feldt statistic, I-395
hyper-Graeco-Latin square designs, I-235
hypothesis
alternative, I-13
null, I-13
testing, I-12, I-371

655
Index

ID3, I-37
incomplete block designs, I-487
independence, I-161
in loglinear models, I-618
INDSCAL model, II-119
inertia, I-148
inferential statistics, I-7, II-312
instrumental variables, II-681
internal-consistency, II-605
interquartile range, I-207
interval censored data, II-534
inverse-distance smoother, II-470
isotropic, II-495
item-response analysis
See test item analysis
item-test correlations, II-604

Jaccard dichotomy coefficients, I-120, I-126


jackknife, I-18, I-20
jackknifed classification matrix, I-280

k nearest-neighbors method
compared to fixed-bandwidth method, II-467
for smoothing, II-465, II-472
Kendalls tau-b coefficients, I-165, I-168
Kendalls tau-b coefficients, I-126
kernel functions, II-460, II-462
biweight, II-463, II-472, II-473
Cauchy, II-463, II-472, II-473
Epanechnikov, II-463, II-472, II-473
Gaussian, II-463, II-472, II-473
plotting, II-463
relationship with bandwidth, II-466
tricube, II-463, II-472, II-473
triweight, II-463, II-472, II-473
uniform, II-463, II-472
k-exchange method, I-245
k-means clustering, I-60, I-67
Kolmogorov-Smirnov test, II-200

KR20, II-605
kriging, II-506
ordinary, II-501, II-512
simple, II-501, II-512
trend components, II-501
universal, II-502
Kruskals loss function, II-142
Kruskals STRESS, II-123
Kruskal-Wallis test, II-198, II-199
Kukoc statistic 7
Kulczynski measure, I-126
kurtosis, I-211

lags
number of lags, II-496
latent trait model, II-604, II-606
Latin square designs, I-235, I-258, I-487
lattice, II-220
lattice designs, I-241
Lawley-Hotelling trace, I-286
least absolute deviations, II-154
Levene test, I-388
leverage, I-376
likelihood ratio chi-square, I-164, I-620
compared to Pearson chi-square, I-620
likelihood-ratio chi-square, I-168, I-622
Lilliefors test, II-217
linear contrasts, I-390
linear discriminant function, I-280
linear discriminant model, I-276
linear models
analysis of variance, I-431
general linear models, I-487
hierarchical, II-47
linear regression, I-399
linear regression, I-11, I-371, II-330
bootstrapping, I-403
commands, I-403
data format, I-403
estimation, I-401

656
Index

examples, I-404, I-407, I-410, I-413, I-417, I-420, I424, I-426, I-427, I-428, I-429
model, I-400
overview, I-399
Quick Graphs, I-403
residuals, I-373, I-400
stepwise, I-379, I-401
tolerance, I-401
usage, I-403
using correlation matrix as input, I-381
using covariance matrix as input, I-381
using SSCP matrix as input, I-381
listwise deletion, I-362, II-3
Little MCAR test, II-1, II-11, II-12, II-31
loadings, I-330, I-331
LOESS smoothing, II-470, II-472, II-476, II-477, II480, II-489
logistic item-response analysis, II-620
one-parameter model, II-606
two-parameter model, II-606
logistic regression
algorithms, I-609
bootstrapping, I-565
categorical predictors, I-558
compared to conjoint analysis, I-92
compared to linear model, I-550
conditional variables, I-557
confidence intervals, I-581
convergence, I-560
data format, I-565
deciles of risk, I-561
discrete choice, I-559
dummy coding, I-558
effect coding, I-558
estimation, I-560
examples, I-566, I-568, I-569, I-574, I-579, I-582, I591, I-598, I-600, I-604, I-607
missing data, I-609
model, I-557
options, I-560
overview, I-549
post hoc tests, I-563
prediction table, I-557
print options, I-565
quantiles, I-562, I-582
Quick Graphs, I-565

simulation, I-563
stepwise estimation, I-560
tolerance, I-560
usage, I-565
weights, I-565
logit model, I-551
loglinear modeling
bootstrapping, I-626
commands, I-626
compared to analysis of variance, I-619
compared to Crosstabs, I-625
convergence, I-621
data format, I-626
examples, I-627, I-638, I-641, I-645
frequency tables, I-625
model, I-621
overview, I-617
parameters, I-622
Quick Graphs, I-626
saturated models, I-619
statistics, I-622
structural zeros, I-623
usage, I-626
log-logistic distribution, II-538
log-normal distribution, II-538
longitudinal data, II-47
loss functions, I-38, II-151
multidimensional scaling, II-142
LOWESS smoothing, II-627
low-pass filter, II-640
LSD test, I-432, I-493

madograms, II-509
Mahalanobis distances, I-276, I-286
Mann-Whitney test, II-198, II-199
MANOVA
See analysis of variance
Mantel-Haenszel test, I-170
MAR, II-9
Marquardt method, II-159
Marron & Nolan canonical kernel width, II-466, II472

657
Index

mass, I-148
matrix displays, I-57
maximum likelihood estimates, II-152
maximum likelihood factor analysis, I-334
maximum Wishart likelihood, II-247
MCAR, II-9
MCAR test, II-1, II-12, II-31
McFaddens conditional logit model, I-554
McNemars test, I-164, I-168
MDPREF, II-298, II-299
MDS,
See multidimensional scaling
mean, I-3, I-207, I-211
mean smoothing, II-468, II-472, II-474
means coding, I-384
median, I-4, I-206, I-211
median smoothing, II-468
meta-analysis, I-382
MGLH,
See general linear models
midrange, I-207
minimum spanning trees, II-503
Minkowski metric, II-123
MIS function, II-20
missing value analysis
algorithms, II-46
bootstrapping, II-15
casewise pattern table, II-20
data format, II-15
EM algorithm, II-8, II-12, II-25, II-33, II-42
examples, II-15, II-20, II-25, II-33, II-42
listwise deletion, II-3, II-25, II-33
MISSING command, II-14
missing value patterns, II-15
model, II-12
outliers, II-12
overview, II-1
pairwise deletion, II-3, II-25, II-33
pattern variables, II-2, II-42
Quick Graphs, II-15
randomness, II-9
regression imputation, II-6, II-12, II-25, II-42
saving estimates, II-12, II-15

unconditional mean imputation, II-4


usage, II-15
mixed regression
algorithms, II-117
bootstrapping, II-64
commands, II-64
data format, II-64
examples, II-65, II-74, II-84, II-104
overview, II-47
Quick Graphs, II-64
usage, II-64
mixture designs, I-232, I-239
analysis of, I-243
axial designs, I-242
centroid designs, I-241
constraints, I-242
examples, I-265, I-266
lattice designs, I-241
Scheff model, I-243
screening designs, I-242
simplex, I-241
models, I-10
estimation, I-10
mosaic plots, II-506
moving average, II-465, II-625, II-631
moving-averages smoother, II-470
mu2 monotonicity coefficients, I-126
multidimensional scaling, II-294
algorithms, II-142
assumptions, II-120
bootstrapping, II-127
commands, II-127
configuration, II-123, II-126
confirmatory, II-126
convergence, II-123
data format, II-127
dissimilarities, II-121
distance metric, II-123
examples, II-128, II-130, II-132, II-136, II-140
Guttman method, II-143
individual differences, II-119
Kruskal method, II-142
log function, II-123
loss functions, II-123
matrix shape, II-123

658
Index

metric, II-123
missing values, II-144
nonmetric, II-123
overview, II-119
power function, II-123
Quick Graphs, II-127
residuals, II-123
Shepard diagrams, II-123, II-127
usage, II-127
multilevel models
See mixed regression
multinomial logit, I-552
compared to binary logit, I-552
multiple correlation, I-372
multiple correspondence analysis, I-148
multiple regression, I-376
multivariate analysis of variance, I-396
mutually exclusive, I-160

Nadaraya-Watson smoother, II-470


nesting, I-487
Newman-Keuls test, I-390
Newton-Raphson method, I-617
nodes, I-33
nominal data, II-199
noncentral F distribution, II-327, II-356
noncentrality parameters, II-327
nonlinear models, II-147
algorithms, II-194
bootstrapping, II-165
commands, II-165
computation, II-159, II-194
convergence, II-159
data format, II-165
estimation, II-155
examples, II-166, II-169, II-172, II-174, II-177, II179, II-180, II-182, II-185, II-190, II-192, II-193
functions of parameters, II-161
loss functions, II-151, II-156, II-163, II-164
missing data, II-195
model, II-156
parameter bounds, II-159
problems in, II-155

Quick Graphs, II-165


recalculation of parameters, II-160
robust estimation, II-162
starting values, II-159
usage, II-165
nonmetric unfolding model, II-119
nonparametric statistics
algorithms, II-217
bootstrapping, II-206
commands, II-201, II-203, II-205
data format, II-206
examples, II-206, II-208, II-209, II-211, II-212, II214, II-216
Friedman test, II-202
independent samples tests, II-199, II-200
Kolmogorov-Smirnov test, II-200, II-203
Kruskal-Wallis test, II-199
Mann-Whitney test, II-199
one-sample tests, II-205
overview, II-197
Quick Graphs, II-206
related variables tests, II-201, II-202
sign test, II-201
usage, II-206
Wald-Wolfowitz runs test, II-205
Wilcoxon signed-rank test, II-202
Wilcoxon test, II-199
normal distribution, I-207
NPAR model, II-432
nugget, II-499
null hypothesis, I-12, II-312

oblimin rotation, I-333, I-337


observational studies, I-229
Occams razor, I-91
odds ratio, I-168
omni-directional variograms, II-496
optimal designs, I-232, I-244
analysis of, I-246
A-optimality, I-246
candidate sets, I-245
coordinate exchange method, I-245, I-270
D-optimality, I-246

659
Index

efficiency criteria, I-246


Fedorov method, I-245
G-optimality, I-246
k-exchange method, I-245
model, I-247
optimality criteria, I-246
optimality, I-244
ORDER, II-537
ordinal data, II-198
ordinary least squares, II-247
orthomax rotation, I-333, I-337

PACF plots, II-642


pairwise deletion, I-362, II-3
pairwise mean comparisons, I-389
parameters, I-10
parametric modeling, II-538
partial autocorrelation plots, II-632, II-633, II-642
partialing
in set correlation, II-411
partially ordered scalogram analysis with coordinates
algorithms, II-232
bootstrapping, II-223
commands, II-222
convergence, II-222
data format, II-223
displays, II-221
examples, II-224, II-225, II-228
missing data, II-232
model, II-222
overview, II-219
Quick Graphs, II-223
usage, II-223
path analysis
algorithms, II-284
bootstrapping, II-249
commands, II-249
confidence intervals, II-247, II-285
covariance paths, II-236
covariance relationships, II-245
data format, II-249
dependence paths, II-235

dependence relationships, II-243


endogenous variables, II-236
estimation, II-247
examples, II-250, II-255, II-268, II-274
exogenous variables, II-236
fixed parameters, II-243, II-245
fixed variance, II-238
free parameters, II-243, II-245
latent variables, II-247
manifest variables, II-247
measures of fit, II-285
model, II-241, II-243
overview, II-233
path diagrams, II-233
Quick Graphs, II-249
starting values, II-247
usage, II-249
variance paths, II-236
path diagrams, II-233
Pearson chi-square, I-161, I-166, I-168, I-618, I-622
compared to likelihood ratio chi-square, I-620
Pearson correlation, I-117, I-123, I-125
perceptual mapping
algorithms, II-308
bootstrapping, II-301
commands, II-301
data format, II-301
examples, II-302, II-303, II-304, II-306
methods, II-299
missing data, II-308
model, II-299
overview, II-293
Quick Graphs, II-301
usage, II-301
periodograms, II-639
permutation tests, I-160
phi coefficient, I-38, I-41, I-165, I-168
Pillai trace, I-286
Plackett-Burman designs, I-235, I-263
point processes, II-494, II-502
polynomial contrasts, I-390, I-393, I-498
polynomial smoothing, II-468, II-472, II-474
pooled variances, II-588
populations, I-7

660
Index

POSET, II-219
positive matching dichotomy coefficients, I-120
power, II-314
power analysis
analysis of variance, II-311
bootstrapping, II-359
commands, II-358
correlation coefficients, II-317, II-336, II-338
correlations, II-311
data format, II-359
examples, II-360, II-363, II-366, II-369, II-374, II378

generic, II-327, II-356, II-374


one sample t-test, II-318, II-345
one sample z test, II-341
one-way ANOVA, II-318, II-351, II-374
overview, II-311
paired t-test, II-318, II-346, II-363
power curves, II-359
proportions, II-311, II-317, II-332, II-333, II-360
Quick Graphs, II-359
randomized block designs, II-311
t-tests, II-311, II-318, II-345, II-346, II-348, II-369
two sample t-test, II-318, II-348, II-369
two sample z test, II-342
two-way ANOVA, II-319, II-353, II-378
usage, II-359
z tests, II-311, II-341, II-342
power curves, II-359
overlaying curves, II-363
response surfaces, II-363
Power model, II-498, II-508
preference curves, II-296
preference mapping, II-294
PREFMAP, II-299
principal components analysis, I-327, I-328, I-487
coefficents, I-330
compared to factor analysis, I-334
compared to linear regression, I-329
loadings, I-330
prior probabilities, I-282
probability plots, I-15, I-373
probit analysis
algorithms, II-392
bootstrapping, II-389

categorical variables, II-387


commands, II-388
data format, II-389
dummy coding, II-387
effect coding, II-387
examples, II-389, II-391
interpretation, II-386
missing data, II-392
model, II-385, II-386
overview, II-385
Quick Graphs, II-389
saving files, II-389
usage, II-389
Procrustes rotations, II-298, II-299
proportional hazards models, II-539
proportions
power analysis, II-311, II-317, II-332, II-333, II360
p-value, II-312

QSK coefficient, I-126


quadrat counts, II-493, II-505, II-506
quadratic contrasts, I-390
quantile plots, II-540
quantitative symmetric dissimilarity coefficient, I119

quartimax rotation, I-333, I-337


quasi-independence, I-623
Quasi-Newton method, II-155, II-156

random coefficient models


See mixed regression
random effects
in mixed regression, II-47
random fields, II-494
random samples, I-8
random variables, I-370
random walk, II-631
randomized block designs, I-487, II-330
power analysis, II-311

661
Index

range, I-207, I-211, II-499


rank-order coefficients, I-126
Rasch model, II-606
receiver operating characteristic curves
See signal detection analysis
regression
linear, I-11, I-399
logistic, I-549
two-stage least squares, II-681
rank, II-395
ridge, II-401
regression trees, I-35
algorithms, I-51
basic tree model, I-32
bootstrapping, I-44
commands, I-43
compared to analysis of variance, I-35
compared to stepwise regression, I-36
data format, I-44
displays, I-41
examples, I-45, I-47, I-49
loss functions, I-38, I-41
missing data, I-51
mobiles, I-31
model, I-41
overview, I-31
pruning, I-37
Quick Graphs, I-44
saving files, I-44
stopping criteria, I-37, I-43
usage, I-44
reliability, II-605, II-607
repeated measures, I-393, I-491
assumptions, I-394
response surface designs, I-232, I-236
analysis of, I-239
Box-Behnken designs, I-238
central composite designs, I-238
examples, I-264, I-269
rotatability, I-237, I-238
response surfaces, I-92, II-156
right censored data, II-534
RMSEA, II-287
robust smoothing, II-468, II-472, II-474
robustness, II-199
ROC curves, II-431, II-432, II-437

root mean square error of approximation, II-286


rotatability
in response surface designs, I-237
rotatable designs
in response surface designs, I-238
rotation, I-333
running median smoothers, II-626
running-means smoother, II-470

Sakitt D, II-433
sample size, II-315, II-323
samples, I-8
sampling
See bootstrap
saturated models
loglinear modeling, I-619
scalogram
See partially ordered scalogram analysis with coordinates
scatterplot matrix, I-117
Scheff model
in mixture designs, I-243
Scheffe test, I-389, I-432, I-493
screening designs, I-242
SD-RATIO, II-433
seasonal decomposition, II-637
second-order stationarity, II-495
semi-variograms, II-496, II-509
set correlations, II-407
assumptions, II-408
measures of association, II-409
missing data, II-429
partialing, II-408
See canonical correlation analysis
Shepard diagrams, II-123, II-127
Shepards smoother, II-470
sign test, II-201
signal detection analysis
algorithms, II-456
bootstrapping, II-437
chi-square model, II-433

662
Index

commands, II-436
convergence, II-433
data format, II-437
examples, II-433, II-434, II-447, II-450, II-453, II454

exponential model, II-433


gamma model, II-433
logistic model, II-433
missing data, II-457
nonparametric model, II-433
normal model, II-433
overview, II-431
Poisson model, II-433
Quick Graphs, II-437
ROC curves, II-431, II-437
usage, II-437
variables, II-433
sill, II-499
similarity measures, I-115
simple matching dichotomy coefficients, I-120
simplex, I-241
Simplex method, II-155, II-156
simulation, II-502
singular value decomposition, I-147, II-298, II-308
skewness, I-209, I-211
positive, I-4
slope, I-376
smoothing, II-472, II-624
bandwidth, II-460, II-465, II-472
biweight kernel, II-463, II-472, II-473
bootstrapping, II-475, II-477
Cauchy kernel, II-463, II-472, II-473
commands, II-474
confidence intervals, II-477
data format, II-475
discontinuities, II-470
discrete gaussian convolution, II-470
distance-weighted least squares (DWLS), II-470
Epanechnikov kernel, II-463, II-472, II-473
examples, II-476, II-477, II-480, II-489
fixed-bandwidth method, II-465, II-472
Gaussian kernel, II-463, II-472, II-473
grid points, II-471, II-472, II-489
inverse-distance, II-470
k nearest-neighbors method, II-465

kernel functions, II-460, II-462, II-472, II-473


LOESS smoothing, II-470, II-472, II-476, II-477,
II-480, II-489
Marron & Nolan canonical kernel width, II-466,
II-472

mean smoothing, II-468, II-472, II-474


median smoothing, II-468
methods, II-460, II-468, II-474
model, II-472
moving-averages, II-470
Nadaraya-Watson, II-470
nonparametric vs. parametric, II-460
overview, II-459
polynomial smoothing, II-468, II-472, II-474
Quick Graphs, II-475
residuals, II-471, II-475
robust smoothing, II-468, II-472, II-474
running-means, II-470
saving results, II-472, II-475, II-476
Shepards smoother, II-470
step, II-470
tied values, II-471
tricube kernel, II-463, II-472, II-473
trimmed mean smoothing, II-472, II-474
triweight kernel, II-463, II-472, II-473
uniform kernel, II-463, II-472
usage, II-475
window normalization, II-466, II-472
Somers d coefficients, I-165, I-168
sorting, I-5
Sosa statistic, 21, 66, 98
spaghetti plot, II-84
spatial statistics, II-493
algorithms, II-530
azimuth, II-509
bootstrapping, II-515
commands, II-513
data, II-515
dip, II-509
examples, II-515, II-522, II-523, II-529
grid, II-511
kriging, II-501, II-506, II-512
lags, II-509
missing data, II-530
model, II-493
models, II-508

663
Index

nested models, II-500


nesting structures, II-508
nugget, II-508
nugget effect, II-499
plots, II-506
point statistics, II-506
Quick Graphs, II-515
sill, II-499, II-508
simulation, II-502, II-506
trends, II-506
variograms, II-496, II-506, II-509
Spearman coefficients, I-119, I-126, I-165
Spearman-Brown coefficient, II-605
specificities, I-332
spectral models, II-624
spherical model, II-497, II-508
split plot designs, I-487
split-half reliabilities, II-607
SSCP matrix, II-12
standard deviation, I-3, I-207, I-211
standard error of estimate, I-371
standard error of kurtosis, I-211
standard error of skewness, I-211
standard error of the mean, I-11, I-211
standardization, I-55
standardized alpha, II-605
standardized deviates, I-147, I-622
standardized values, I-6
stationarity, II-495, II-633
statistics
defined, I-1
descriptive, I-1
inferential, I-7
See descriptive statistics
stem-and-leaf plots, I-3, I-206
See descriptive statistics, I-213
step smoother, II-470
stepwise regression, I-379, I-392, I-556
stochastic processes, II-494
stress, II-122, II-142
structural equation models
See path analysis

Stuarts tau-c coefficients, I-165, I-168


studentized residuals, I-373
subpopulations, I-209
subsampling, I-18
sum of cross-products matrix, I-125
sums of squares
type I, I-391, I-396
type II, I-409
type III, I-392, I-397
type IV, I-397
surface plots, II-506
survival analysis
algorithms, II-572
bootstrapping, II-548
censoring, II-534, II-541, II-576
centering, II-573
coding variables, II-541
commands, II-547
convergence, II-578
Cox regression, II-545
data format, II-548
estimation, II-543
examples, II-549, II-552, II-553, II-557, II-560, II562, II-567, II-569
exponential model, II-545
graphs, II-545
logistic model, II-545
log-likelihood, II-574
log-normal model, II-545
missing data, II-573
model, II-541
models, II-575
overview, II-533
parameters, II-573
plots, II-535, II-577
proportional hazards models, II-576
Quick Graphs, II-548
singular Hessian, II-575
stepwise, II-578
stepwise estimation, II-543
tables, II-545
time varying covariates, II-546
usage, II-548
variances, II-579
Weibull model, II-545
symmetric matrix, I-117

664
Index

t distributions, II-584
compared to normal distributions, II-586
t tests
assumptions, II-588
Bonferroni adjustment, II-591
bootstrapping, II-592
commands, II-592
confidence intervals, II-591
data format, II-592
degrees of freedom, II-586
Dunn-Sidak adjustment, II-591
examples, II-593, II-595, II-597, II-599, II-601
one-sample, II-318, II-345, II-586, II-590
overview, II-583
paired, II-318, II-346, II-363, II-587, II-590
power analysis, II-311, II-318, II-345, II-346, II348

Quick Graphs, II-592


separate variances, II-588
two-sample, II-318, II-348, II-369, II-587, II-589
usage, II-592
Taguchi designs, I-235, I-260
Tanimoto dichotomy coefficients, I-120, I-126
tau-b coefficients, I-126, I-168
tau-c coefficients, I-168
test item analysis
algorithms, II-620
bootstrapping, II-609
classical analysis, II-604, II-605, II-607, II-620
commands, II-609
data format, II-609
examples, II-613, II-614, II-617
logistic item-response analysis, II-606, II-608, II620

missing data, II-621


overview, II-603
Quick Graphs, II-609
reliabilities, II-607
scoring items, II-607, II-608
statistics, II-609
usage, II-609
tetrachoric correlation, I-120, I-121, I-126
theory of signal detectability (TSD), II-431
time domain models, II-624
time series, II-623

algorithms, II-678
ARIMA models, II-627, II-650
bootstrapping, II-653
clear series, II-645
commands, II-644, II-646, II-649, II-650, II-651, II653

data format, II-653


examples, II-654, II-655, II-656, II-658, II-661, II662, II-663, II-665, II-666, II-670, II-676
forecasts, II-648
Fourier transformations, II-652
missing values, II-623
moving average, II-625, II-646
overview, II-623
plot labels, II-641
plots, II-640, II-641, II-642, II-643
Quick Graphs, II-653
running means, II-626, II-646
running medians, II-626, II-646
seasonal adjustments, II-637, II-649
smoothing, II-624, II-646, II-647, II-648
stationarity, II-633
transformations, II-644, II-645
trends, II-648
usage, II-653
tolerance, I-380
T-plots, II-640
trace criterion
See A-optimality
transformations, I-209
tree clustering methods, I-37
tree diagrams, I-57
triangle inequality, II-120
tricube kernel, II-463, II-472, II-473
trimmed mean smoothing, II-472, II-474
triweight kernel, II-463, II-472, II-473
Tukey pairwise comparisons test, I-389, I-432, I-493
Tukeys jackknife, I-18
twoing, I-38
two-stage least squares
algorithms, II-692
bootstrapping, II-685
commands, II-685
data format, II-685

665
Index

estimation, II-681
examples, II-686, II-688, II-691
heteroskedasticity-consistent standard errors, II683

lagged variables, II-683


missing data, II-692
model, II-683
overview, II-681
Quick Graphs, II-685
usage, II-685
Type I error, II-314
type I sums of squares, I-391, I-396
Type II error, II-314
type II sums of squares, I-397
type III sums of squares, I-392, I-397
type IV sums of squares, I-397

unbalanced designs
in analysis of variance, I-391
uncertainty coefficient, I-168
unfolding models, II-295
uniform kernel, II-463, II-472

variance, I-211
of estimates, I-237
variance component models
See mixed regression
variance of prediction, I-238
variance paths
path analysis, II-236
varimax rotation, I-333, I-337
variograms, II-496, II-506, II-515
model, II-497
vector model
in perceptual mapping, II-297
Voronoi polygons, II-493, II-504, II-506

Wald-Wolfowitz runs test, II-205


wave model, II-499
Weibull distribution, II-538
weight, II-515
weighted running smoothing, II-626
weights, I-20, I-44, I-95, I-129, I-150, I-172, I-215, I288, I-339, I-403, I-438, I-501, I-565, I-626, II15, II-64, II-127, II-165, II-206, II-223, II-249,
II-301, II-389, II-417, II-437, II-475, II-548, II592, II-609, II-653, II-685
Wilcoxon signed-rank test, II-202
Wilcoxon test, II-199
Wilks lambda, I-280, I-286
Wilks trace, I-286
Winters three-parameter model, II-638
within-subjects differences
in analysis of variance, I-394

Yates correction, I-164, I-168


y-intercept, I-376
Youngs S-STRESS, II-123
Yules Q, I-165, I-168
Yules Y, I-165, I-168

z tests
one-sample, II-341
power analysis, II-311, II-341, II-342
two-sample, II-342

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy