Multivariate Analysis and Statistical Significance
Multivariate Analysis and Statistical Significance
STATISTICAL SIGNIFICANCE 19
INTRODUCTION 421
MULTIVARIATE ANALYSIS 421
Regression Analysis 421
Path Analysis 424
Other Multivariate Techniques 425
STATISTICAL INFERENCE 426
Tests of Statistical Significance 426
The Misuse of Tests of Significance 428
SUMMARY 429
KEY TERMS 430
EXERCISES 430
SUGGESTED READINGS 431
REFERENCES 432
INTRODUCTION
Downloaded from https://www.cambridge.org/core. Macquarie University, on 07 Dec 2019 at 15:16:07, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511819391.020
422 Multivariate Analysis and Statistical Significance
Y a bX
Y
X
b Y
X
scattergram, as
Figure 19.1. The equation for a straight line.
Y =a+
bX.
Figure 19.1 is an illustration of the interpreta-
tion of the constants a and b in this equation.
We find that a is the value Y takes when X is
equal to zero. It is referred to as the Y-
intercept because it is the value of Y at the
point where the straight line crosses the Y-axis.
The constant b is equal to the slope of this
line. If we move an arbitrary distance along
the line described by this equation, recording
the amount that Y has changed(callit Y)
andtheamount X haschanged (call it X), and
then divide the change in Y by the change in X,
the result is the slope of the line (i.e., b =
Y/X).
In this example from elementary algebra,
the Y values refer to points along the straight
line defined by the equation Y = a + bX. This
formula
does not account for any Y values that do not
fall
on this line. For any arbitrary value of X, we
can find the corresponding Y value that
satisfies the equation by locating the Y value
on the straight line that falls directly over the
specified X value (i.e., we would determine
that point at which a line constructed
perpendicular to the X-axis from the specified
X value intersects the straight line given by the
equation Y = a + bX).
Simple regression is a procedure for fitting
a straight line to a set of points in a
X
Yˆ = a
+ bX.
Y a bX
YYi
X
Yi b Y
X
Figure 19.2. Fitting a least squares regression line to a set of points in a scattergram.
a given respondent’s X value. The discrepancy between Y and Yˆ represents error in our
between the actual Y value and the estimated pre- diction. If we have selected a set of
Yˆ value represents prediction error. When the predictors that yield accurate estimates of Y,
Y values tend to cluster very close to the then the dif- ference between Y and Yˆ values
regression line, Yˆ and Y values will be very will be small, andthemultiplecorrelationwill
similar, and the error in prediction will be low. behigh. If, how- ever, we have selected a set
However, when the Y values tend to deviate of predictors that yields poor estimates of Y,
markedly from the regression line, the Y and then the difference between Yˆ and Y values
Yˆ values will be quite different, will tend to be larger, and the multiple
andtheerrorinpredictionwill behigh. Multiple correlation will be low. The multiple
regression is an extension of simple correlation ranges from .00 (when the
regression: Instead of one predictor, we include independent variables in no way help to
two or more predictors in a single regression predict Y) to 1.00 (when the independent
equation. When there are four predictors, the variables pre- dict Y with complete accuracy).
equation is as follows:
The multiple- correlation coefficient squared
Yˆ = a + b1x1 + b2x2, + b3x3 + b4x4. (R2) gives the proportion of the variance in
the dependent variable that is accounted for
The b values in multiple regression are by the set of pre- dictors included in the
referred toaspartial-regressioncoefficients. 1 regression equation. If R = .50, then R2 = .
Thesecoef- ficients give the change in the 25; we would conclude that the predictors
dependent vari- able (in whatever units the being considered account for 25 percent of
dependent variable is measured) the variance in the dependent variable.
thatwewouldestimateforaone-unit change in Let us assume that our goal is to predict
the specified predictor (in whatever units the the grade-point average for 1,000 seniors who
predictor is measured). have just graduated from college. Suppose we
The MULTIPLE CORRELATION COEFfiCIENT decide to use the following four predictors: high
(R) is used to summarize the accuracy of our
prediction equation.2 Recall that the difference script 1 refers to the dependent variable X1 and the sub-
scripts 2, 3, 4, and 5 refer to the predictorsX2, X3, X4, and
1 It is equal to the Pearson correlation between Y and the X5. There will be as many numbers following the period in
Yˆ values. The b values are referred to as partial- the subscript as there are predictors. The notation system
regression coefficients because they are estimates of has been changed here so that the dependent variable
the change in the dependent variable that is estimated referred tointhetestas Y isreferredtohereas X1. For
for a one- unit change in the specified predictor after thesubscripted multiple-correlation coefficient, as for
we statisti- cally control for the effects of the other several other multi- variate statistics (e.g., the partial-
predictors in the equation. correlation coefficient), the notion is simpler if we refer
2 The subscripted version of the multiple-correlation coeffi- to our variables as X1, X2, X3, and so on, rather than as Y,
cient is designated symbolically as “R1.2345” where the sub- X1, and X2.
424 Multivariate Analysis and Statistical Significance
.40
.30
.10
.05
.45 .30
Downloaded from https://www.cambridge.org/core. Macquarie University, on 07 Dec 2019 at 15:16:07, subject to the Cambridge Core terms of use, available at
https://www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9780511819391.020
Multivariate Analysis 425
Xˆ3 = a3 + b1 x1 + b2 x2 (3)
Xˆ4 = a4 + b1 x1 (4)
6
The rationale for the names “one-tailed” and “two-
reject the null hypothesis. There is a distinction
between the conclusion that the difference in
means found in our sample could result from
chance (sampling error) alone if the population
means were exactly equal and the conclusion
that the population means are identical.
A test of significance is a procedure for decid-
ing how likely it is that the relationship we have
found in the sample is due to sampling error
when there is no relationship between variables
inthepopulation. Itcannotbeusedtoprovethat there
actually is a relationship in the population. In
addition, it cannot prove that there actually is no
relationship in the population. A test of sig-
nificance can be used only to indicate how likely
we would be to obtain the relationship we find in
the sample if there were no relationship in the
population.
The t-TEST is a test of significance that we use
when we are interested in comparing the means
for two samples or two categories of the same
sample. The null hypothesis for a t-test is that the
population means are equal (and therefore, in the
example above, that the mean incomes for the
Democrats and Republicans are equal). When we
want to compare the means for more than two
groups in the same test of significance, we use the
F-TEST. The null hypothesis for the F-test is that
the means for all the groups being
428 Multivariate Analysis and Statistical Significance
a situation, our sample (N = 50) is the popu- reports based only on the statistically significant
lation, and so it is inappropriate to compute a relationships. Such researchers are
test of statistical significance. If the capitalizing on sampling error and may
correlation between these variable is .30, then . prepare an entire reportarounda set of
30 is the cor- relation in the population, and it correlations thatcould not be replicated.
is meaningless to compute a test of Suppose a researcher computes 1,000
significance to test the null hypothesis that in correlation coefficients and tests each for
the population the correla- tion is zero. In statistical significance. Even if all of these
short, if we already have the pop- ulation, there cor- relations in the population are zero, we
is no need to make the inferences that tests of would expect on the basis of sampling error
significance are designed to help us make. that 50 (or 5 of every 100) of these
Researchers sometimes use a test of signifi- correlations would be significant at the .05
cance to generalize beyond the population level. Thus, if the researcher looks through the
from which the sample was drawn. Suppose we 1,000 correlations and bases thereportonthe 50
have a orsothatarestatistically sig- nificant, the findings
simplerandomsampleoftheseniorsatacollege, reported run a risk of being highly unreliable.
and we find that the Jews in the sample are Most of these correlations will have resulted
more likely to support gun control than the from sampling error alone. For this reason, it
Catholics. If this difference turns out to be will not be possible to replicate the findings
statistically sig- nificant, we can generalize to of the study.
all the seniors at the college. It might seem Therearedifferenttestsof significanceforvar-
plausible that a simi- lar ious types of data. Some tests are appropriate
trendwouldholdforseniorsatothercolleges, for for nominal-level data, some are appropriate
all college students, or for the adult popula- for ordinal-level data, and still others are
tion in general. However, we have no grounds appropri- ate for interval- and ratio-level data.
for making such a generalization based on the A common error is to use a test of significance
data we have. appropri- ate for interval-level data when the
Statistical significance is often confused with data are only ordinal level. A typical example
SUBSTANTIvE SIGNIfiCANCE, that is, whether of this error is the computation of a test of
some piece of data is important to us . It is com- significance for a Pearson correlation
mon for researchers to suggest that a finding between two ordinal-level variables.
is important because it is statistically signifi-
cant. While it is generally reasonable to
SUMMARY
discount findings that are not statistically
significant, statistical significance per se does Regression is one of the most commonly used
not make a relationship important. We can statistical procedures in social research. In sim-
often find sta- tistically significant ple regression analysis, we consider only two
relationships between vari- ables that are variables, one dependentvariable andone
causally unrelated, variables that are inde- pendent variable (the predictor). The
alternative measures of the same thing (for regression line is the line through the set of
instance, the relationship between age and data points being considered that minimizes
year of birth), and variables that are the sum of the square of the deviations from
sociologically uninteresting (such as the the line. For any other line through this same
relationship between weight and waist set of points, the sum is greater.
measurement). There can also be relationships Multiple regression, an extension of simple
based on very large samples that are regression to include two or more predictors,
statistically significant but so weak as to be is themostwidelyusedformofregressionanalysis.
substantively unimportant. A correlation can The coefficients that result when we do multi-
be very close to zero (even .01) and still be ple regression are called partial-regression
statistically significant if the sample size is coef- ficients, or unstandardized partial-
large enough. regression coefficients. These specify the
Another misuse of significance tests is number of units of difference in the dependent
illus- trated by researchers who compute a variable we would
very large number of tests and then prepare
research
430 Multivariate Analysis and Statistical Significance