Hotelling 1936
Hotelling 1936
Hotelling 1936
1. The Correlation of Vectors. The Most Predictable Criterion and the Tetrad
Difference. Concepts of correlation and regression may be applied not only to
ordinary one-dimensional variates bat also to variates of two or more dimensions.
Marksmen side by side firing simultaneous shots at targets, so that the deviations
are in part due to independent individual errors and in part to common causes
such as wind, provide a familiar introduction to the theory of correlation; but only
the correlation of the horizontal components is ordinarily discussed, whereas the
complex consisting of horizontal and vertical deviations may be even more interest-
ing. The wind at two places may be compared, using both components of the
velocity in each place. A fluctuating vector is thus matched at each moment with
another fluctuating vector. The study of individual differences in mental and
physical traits calls for a detailed study of the relations between sets of correlated
variates. For example the scores on a number of mental tests may be compared
with physical measurements on the same persons. The questions then arise of
determining the number and nature of the independent relations of mind and body
shown by these data to exist, and of extracting from the multiplicity of correlations
in the system suitable characterizations of these independent relationa As another
* Presented before the American Mathematical Society and the Institute of Mathematical Statisticians
at Ann Arbor, September 12, 1985.
Biometrik* xxvm 21
322 Relations between Two Sets of Varieties
example, the inheritance of intelligence in rats might be studied by applying not
one but » different mental tests to N mothers and to a daughter of each. Then
-~z—- correlation coefficients could be determined, taking each of the mother-
daughter pairs as one of the N cases. From these it would be possible to obtain
a clearer knowledge as to just what components of mental ability are inherited
than could be obtained from any single test.
Much attention has been given to the effects of the crops of various agricultural
commodities on their respective prices, with a view to obtaining demand curves.
The standard errors associated with such attempts, when calculated, have usually
been found quite excessive. One reason for this unfortunate outcome has been the
was determined exactly by Wilks under the hypothesis that the distribution is
normal, with no population correlation between any variate in one set and any in
the other. Wilks also found distributions of analogous functions of three or more
sets, and of other related statistics.
The statistic (1*1) is invariant under internal linear transformations of either
set, as will be proved in Section 4. Another example of such a statistic is provided
by the maximum multiple correlation with either set of a linear function of the
other set, which has been the subject of a brief study§. This problem of finding,
not only a best predictor among the linear functions of one set, but at the same
time the function of the other set which it predicts most accurately, will be solved
in Section 3 in a more symmetrical manner. When the influence of these two
linear functions is eliminated by partial correlation, the process may be repeated
with the residuals. In this way we may obtain a sequence of pairs of variates, and
of correlations between them, which in the aggregate will fully characterize the
invariant relations between the sets, in so far as these can be represented by
correlation coefficients. They will be called canonical variates and canonical
correlations. Every invariant under general linear internal transformations, such
for example as t, will be seen to be a function of the canonical correlations.
• Annalt of Mathematical Statistic*, Vol. n. pp. 860—878, August, 1981.
t L. G. M. Baas-Beoking, Henrietta van de Sande Bakhuyzen, and Harold Hotelling, " The Physical
State of Protoplasm" in Vtrhandelingcn der Koninklijke Akadenat van Wetenschappen te Armterdam,
Second Section, Vol. v. (1928).
X "Certain Generalizations in the Analysis of Variance" in Biometrika, Vol. xrrv. pp. 471—494,
November, 1932.
§ Harold Hotelling, " T h e Most Predictable Criterion" in Journal of Educational Prycholoffy,
Vol. xxvi. pp. 189—142, February, 1936.
21—2
324 Relations between Two Sets of Varieties
Observations of the values taken in N cases by the components of two vectors
constitute two matrices, each of N columns. If each vector has s components, then
each matrix has s rows. In this case we may consider the correlation coefficient
between the C^-rowed determinants in one matrix and the corresponding deter-
minants in the other. Since a linear transformation of the variates in either set
effects a linear transformation of the rows of the matrix of observations, which
merely multiplies all these determinants by the same constant, it is evident that
the correlation coefficient thus calculated is invariant in absolute value. We shall
call it the vector correlation or vector correlation coefficient, and denote it by q.
When 8 = 2, if we call the variates of one set xlt xt, and those of the other x3, xt,
and rit the correlation of x{ with xt, then it is easy to deduce with the help of the
(2) The assumption is made implicitly thai the distribution of the tetrad is
normal, though this cannot possibly be the case, since the range is finite *.
(3) Since the standard error formulae involve unknown population values, these
are in practice replaced by sample values. No limit is known for the errors com-
mitted in this way.
Now it is evident that to test whether the population value of the tetrad is
zero—the only value of interest—is the same thing us to test the vanishing of any
'multiple of the tetrad by a finite non-vanishing quantity. Wishartf considered the
tetrad of covariances, which is simply the product of the tetrad of correlations by
the four standard deviations. For this function he found exact values of the mean
then the covariances of the new variates are expressed in terms of those of the old
by the equations
22 (3-1),
obtained by substituting the equations above directly in the definition
afi ij
in two separate sets of variables, and of a bilinear form
ai
in both sets, under real linear non-singular transformations of the two sets
separately.
HABOLD HOTELLING 327
Sample covariances are also transformed by the formula (3"1). The ensuing
analysis might therefore equally well be carried out for a sample instead of for
the population. Correlations might be used instead of covariances, either for the
sample or for the population, by introducing appropriate factors, or by assuming
the standard deviations to be unity.
We shall assume that there is no fixed linear relation among the variatea, so
that the determinant of their covariances or correlations is not zero. This implies
that there is no fixed linear relation among any subset of them; consequently
every principal minor of the determinant of s + t rows is different from zero.
If we consider a function u of the variates in the first set and a function v
namely .(3-4),
.(3-5).
Here X and fi are Lagrange multipliers. Their interpretation will be evident upon
multiplying (3'4) by a. and summing with respect to a, then multiplying (35) by
bt and summing with respect to t. With (3-2) and (3'3), this process gives
X = ^ = R.
The s + t homogeneous linear equations (3-4) and (3-5) in the s + t unknowns
aa and b{ will determine variates u and v making R a maximum, a minimum, or
otherwise stationary, if their determinant vanishes. Since X = fi, this condition is
.(3-6).
328 Relations between Two Sets of Variates
This symmetrical determinant is the discriminant of a quadratic form <£ —
where
Here sfr is positive definite because it is the sum of two positive definite quadratic
forms. Consequently* all the roots of (36) are real. Moreover the elementary
divisors are all of the first degreef. This means that the matrix of the determinant
in (3-6) is reducible, by transformations which do not affect either its rank or its
linear factors, to a matrix having zeros everywhere except in the principal diagonal,
while the elements in this diagonal are polynomials
^(X), Et(\). ..., B,+t(\),
with either set of a disposable linear function of the other set If u, v are canonical
variates corresponding to pr, tben the pair « , - » o n ) , - u is associated with the
root — py.
If a pair of canonical variates corresponding to a root py is
u^Sa.ya;., vy-='ZbiyXi (37),
a i
<(^ (3-9).
be canonical variates associated with a canonical correlation />j. Among the four
variates (3"7) and (3'10) there are six correlations. Apart from py and p& these are
obviously
We shall prove that the last four are all zero. Multiply (3-8) by a,8-and sum with
respect to a. The result, with the help of (3"11), may be written
If py*=fc pt*, the last two equations show that EuyUt = Evyvt = 0. Hence, by (312)
and (313), EvyVt and EuyVt vanish. Thus all the correlations among canonical
variates are zero except those between the canonical variates associated with the
same canonical correlation.
If p. is a root of multiplicity m, it is possible by well-known processes to obtain
m solutions of the linear equations such that, if
are any two of these solutions, they will satisfy the orthogonality condition
2a aY a. s + 26,T6« = 0 (3-15).
a i
There is no loss of generality in supposing that each of the original variates was
330 Relations between Two Sets of Varieties
uncorrelated with the others in the same set and had unit variance. In this case
(315) is equivalent to
—0,
where Uy, vy, vt, vt are given by (37) and (310). For this case of equal roots we
have also from (3-14),
P. (EllyUt - EVyV») «= 0.
If pa =f 0, the last two equations show that EuyUt = Evyvs =» 0, and then from (3'12)
and (313) we have that Evyvt = EuyVt = 0. These correlations also vanish if p. = 0,
for then the right-hand members of (38) and (3-9) vanish, leaving two distinct sets
of equations in disjunct sets of unknowns. The solutions may therefore be chosen
so that the two sums in (3-15) vanish separately.
of correlations between the two sets is invariant under non-singular linear trans-
formations of either set. Transformation to canonical variates reduces this
matrix to
p, 0 ... 0 0
0 p, ... 0 0
Suppose now that new variates X\ x,' are defined in terms of the old
variates in the first set by the 8 equations
The new covariances are then expressed in terms of the old by (31). The deter-
minant of these new covariances, which we shall denote by A', may by (3"1) and
the multiplication theorem of determinants be expressed as the product of three
determinants, of which two equal the determinant c = | cY« | of the coefficients of
the transformation, while the third is A. If the variates of the second set are
subjected to a transformation of determinant d, the determinants of covariances
among the new variates analogous to those defined above are readily seen in this
way to equal
A'^c'A, B' = d}B, C' = c*d*C, D' = <*d}D (4-1).
HAROLD HOTELLING 333
)
it is evident that q may be positive for some samples of a particular set of variates,
and negative for other samples. It may sometimes be advantageous, as in testing
whether two samples arose from the same population, to retain the sign of q for
each sample, since this provides evidence in addition to that given by the absolute
value of q. But unless otherwise stated we shall always regard q as the positive
root of q*. Likewise, Q, *]i and *JZ will denote the positive roots unless otherwise
specifically indicated in each case. A transformation of either set will reverse the
sign of the algebraic expression (4-4) if the determinant of the transformation is
negative. This will be true of a simple interchange of two variates; for example,
xj = x\, xt = xi has the determinant — 1. On the other hand, the sign is conserved
if the determinant of the transformation is positive. Such considerations apply
whenever s = t.
334 Relations between Two Sets of Variates
Since the vector correlation and alienation coefficients are invariants, they
may be computed on the assumption that the variates are canonical. In this case
A = B = 1, and D is given by (3'16). To obtain G we replace the first s l's in
the principal diagonal of (316) by O's. It then follows that
This confirms that the value of Q* given in (4-2) is positive. In this way the vector
correlation and alienation coefficients are expressible in terms of the canonical
correlations by the equations
Q=±ftP.-P., Z = (\-Pl')(l-pi*)...(l-p,*) (45),
respectively of a^,<r,,..., x, obtained from x^-i, • • •, <*VK by least squares, and let the
regression equations be
£. = £&.«*, (4-7).
i
The appropriateness of Q as. a generalization of the correlation coefficient, and of >JZ
as a generalization of the alienation coefficient, will be apparent from the following
theorem:
The ratio of the generalized variance of £x £, to that of a^, ..., x, is Q*.
The ratio of the generalised variance of x1 — ^1,xt — ^t, ...,x, — £, to that of
Xi, ...,x,is Z.
This theorem is expressed in terms of the population, but an exactly parallel
S(j (52)>
—w=i —jm—
where x{ and S, are the sample means. To simplify the later work, we introduce
the pseudo-observatiovs, xu', defined in terms of the observations by the equations
* Foi a proof of approach to normality for a general class of statistics including those with which
we deal, of. Doob, op. eit.
HAROLD HOTELLING 337
where the quantities ctB, independent of i and therefore the same for all the
variates xit are the coefficients of an orthogonal transformation, such that
where S/o is the Kronecker delta, equal to unity iff=g, but to zero if f=£g. The
coefficients cfe may be chosen in an infinite variety of ways consistently with these
requirements, but will be held fixed throughout the discussion. Since linear
(5-7).
h
The equations (5'3) may, on account of their orthogonality, be solved in the form
Xtf = ± S '
Therefore, by (5"5),
/ a
Substituting this result and (5-8) in (5-2), we find that the final term of the sum
cancels out. Introducing therefore the symbol S for summation from 1 to N-l
with respect to the second subscript, and putting also
n = N-l (5-9),
we have the compact result
Biometrika u v m 22
338 Relations between Two Sets of Variates
Since the pseudo-observations are normally distributed with the covariances (5'7)
and zero means, they have exactly the same distribution as the observations in
a random sample of n from the original population. The equivalence of the mean
product (510) with the sample covariance (5-2) establishes the important principle
that the distribution of covariances in a sample of n+1 is exactly the same as the
distribution of mean products in a sample of n, if the parent population is normally
distributed about zero means. Use of this principle will considerably simplify the
discussions of sampling.
An important extension of this consideration lies in the use of deviations, not
merely from.sample means, but from regression equations based on other variates.
In such cases the number of degrees of freedom n to be used is the difference
The moments of the distribution are the derivatives of the characteristic function,
evaluated for ti =*tt= ...= 0. From the fourth derivative with respect to tit tt, tt
and *„, it is easy to show in this way that
p ^ ^ A B (5-16),
where dojx^x.sAB — aAS, and the summations are over all values of A and B from
1 to a +1. Then differentiating (515) we have
2 2 (2(7^ a. dap + a.apd(?^) = 0, mi{2ailb(dbi + bibjdaij) = 0,1 ..,,.,.
J- (5'17).
Let us now suppose that the variates are in the population canonical. This
assumption does not entail any loss of generality as regards p\, since pi is an
invariant under transformations of the variates of either set. Since a. is the
coefficient of «. in the expression for one of the canonical variates, which we take
to be x^, we have in the population a i » I, a* -»a* =» ... = a, = 0. In the same way,
Also, since the covariances among canonical variates are the elements of the
determinant in (3*16), we have
the Kronecker deltas being equal to unity if the two subscripts are equal, and
otherwise vanishing. When these special values of the a's, b'e and a's are substi-
tuted in (5'17) most of the terms drop out, leaving the simple equations
the resulting variates have a distribution which, as n increases, approaches the normal
distribution ofp independent variates of zero means and unit standard deviations.
For small samples there will be ambiguities as to which root of the determinantal
equation for the sample is to be regarded as approximating a particular canonical
correlation of the population. As n increases, the sample roots will separately
cluster more and more definitely about individual population roota
If a canonical correlation py is zero, and if s = t, the foregoing result is
applicable with the qualification that sample values r 7 approximating py must not
all be taken positive, but must be assigned positive and negative values with equal
probabilities. Alternatively, if we insist on taking all the sample canonical correla-
EdQdZ=--QZ?.(\-p*) (5-28).
For the case a = 2 these formulae reduce with the help of (4*5) to
1-Z+QP
If in the s equations (3'4) we regard \ai, Xoj,.... \a, as the unknowns, we may
solve for them in terms of the b'a by the methods appropriate for solving normal
equationa Indeed, the matrix of the coefficients of the unknowns is symmetrical;
and in the solving process it is only necessary to carry along, instead of a single
column of right-hand members, £ columns, from which the coefficients of b,+1,... ,bt+t
in the expressions for a.\ a, are to be determined. The entries initially placed
* Harold Hotelling, " Analysia of a Complex of Statistical Variables into Principal Components" in
Journal of Kditcaliuiitil Vnydioloflij, Vol. xxiv. pp. 417—441 and 498—520 (September nnd October,
1933), Section 4.
HAROLD HOTELLING 346
in these columns are of course the covariances between the two sets. Let the
solution of these equations consist of the s erpressions
S^ < 6 < (o = l, 2, ...,«) (63).
i
In exactly the same way the t equations (3*5), with /* replaced by \, may be solved
for XfcJ+1, ..., \b,+t in the form
(i = s + l , ..., s + t) (64).
then the true coefficients of the first pair of canonical variates are mai, ..., ma,',
mb,+1', ..., mb,+t.
In the iterative process, if ai, ..., a, represent trial values at any stage, those
at the next stage will be proportional to
a«'=Sfcaflu/i (6-8).
Another application of the process gives
a," = 2 * , . a.',
whence, substituting, we have ny" = SAr^*a^,
provided we put t Tfl *
346 Relations between Two Sets of Variates
The last equation is equivalent to the statement that the matrix K* of the
coefficients A^* is the square of the matrix K of the k*p. It follows therefore that
one application of the iterative process by means of the squared matrix is exactly
equivalent to two successive applications with the original matrix. This means
that if at the beginning we square the matrix only half the number of steps will
.ubsequently be required for a given degree of accuracy.
The number of steps required may again be cut in half if we square K1, for
with the resulting matrix K* one iteration is exactly equivalent to four with the
original matrix. Squaring again we obtain Ka, with which one iteration is
equivalent to eight, and so on. This method of accelerating convergence is also
applicable to the calculation of principal components*. It embodies the root-
o,
ku -
has multiple roots. If we assume that the roots a>i, to%, .... to, are all simple, and
regard ax, ...,o, as the homogeneous coordinates of a point in s —1 dimensions
which is moved by the collineation (68) into a point (c^', .... a,'), we knowf that
there exists in this space a transformed system of coordinates such that the col-
lineation is represented in terms of them by
= to.a,.
Continuation of this process means, if o>i is the root of greatest absolute value, that
* Another method of accelerated iterative calculation of prinoipal components is given by T. L.
Kelley in Essential Traits of Mental Life, Cambridge, Mass., 19S5. A method similar to that given above
is applied to principal components by the author in Ptychometrika, Vol. i. No. 1 (1986).
+ Bflcher, p. 298.
HAEOLD HOTELLING 347
the ratio of the first transformed coordinates to any of the others increases in
geometric progression. Consequently the moving point approaches as a limit
the invariant point corresponding to this greatest root Therefore the ratios of the
trial values of a.\,..., a, will approach those among the coefficients in the expression
for the canonical variate corresponding to the greatest canonical correlation. Thus
the iterative process is seen to converge, just as in the determination of principal
components.
After the greatest canonical correlation and the corresponding canonical variates
are determined, it is possible to construct a new matrix of covariances of deviations
from these canonical variates. When the iterative process is applied to this new
matrix, the second largest canonical correlation and the corresponding canonical
From the first three rows we obtain the set of normal equations indicated by
1-0 -7 -1 -5 -4 -2 2-9
10 -1 -4 -3 -5 30
10 2 -2 -4 2-0
Here the second and third rows are understood to be filled out with unwritten
terms in such a way as to make the matrix consisting of the first three columns
symmetric. The entries in the last column are the sums of those written or
understood in the respective rows preceding them. By linear operations on the
rows, equivalent to solving the equations, they are reduced to
1 -423 -362 --316 1-470
1 -089 -031 -685 1-804
1 149 161 -362 1-671
348 Relations between Two Sets of Variates
This is the numerical equivalent of (6'3). Hence g^ is the element in the ath row
and ith column of the matrix
•423 -362 - 3 1 6
G= 089 -031 -685
149 -161 -362
From the last three columns of the given matrix of correlations we obtain likewise
the normal equations indicated by
1-0 -8 -6 -5 4 -2 3-5
10 -7 -4 -3 -2 34
where S stands for summation from 1 to n, the number of degrees of freedom, and
where L and M stand for an arbitrary pair, equal or unequal, of the subscripts
1, 2, ...,s + t.
The sequences of variates which we shall consider may be defined as follows.
First, let x±=xx. Then let xj (a = 2, 3, ..., s) be the difference between tva and
350 Relations between Two Sets of Variates
a least-square estimate of a;, iu terms of xlt ...,xa_lta\l divided by such a constant
that the variance of xj is unity. To define the other sequence, let x,+1' be a linear
function of x,+1, .... x,+t having maximum correlation with xi; and let x,+f
(y9 = 2, .... s) be a linear function uncorrelated with x,+1', ..., *,+£_/, and having
maximum correlation with xp. All these are to have unit variances. For a sample
we may set aside as infinitely improbable the possibility that any of these new
variates should be indeterminate. Putting Rfi for the correlation of xp' with «,+/,
we shall find that
q = R1Rt...R, (72).
The process will be more perspicuous in geometrical than in algebraic language
because of the simplicity of the geometry associated with samples from normal
0 0 0 .. . 1 ... 0
HAROLD HOTELLING 351
None of these transformations affects the value of q, and we have, adapting the
definition (4*2) to this case, by replacing the covariances by the functions (7'1) of
«ii'. •••,*«»' and of the elements of (74), and then multiplying each row of each
determinant by n,
A'B'
where now A' = B1 = 1, while
C' = ATS).
= 1,2,...,*) (7-6).
Upon expanding (7-5) with respect to the first s rows and columns we find, with
the help of Section 2,
q* (iyC'>D. (77).
Now any line perpendicular to OXt and OXt is perpendicular to all the lines in
the plane of these two, in particular to 0Xt'. Hence Pt-t, which consists of lines
perpendicular to OXX and OXt, is perpendicular to OXt'. In like manner, Pt-t is
perpendicular to OXi, OX% and 0X3'; and in general Pt_p is perpendicular to
0X1',0Xt',...,0Xp'.
Since Pt-p lies entirely within Pt, the coordinates of any point U in Pt-fi will
be linearly dependent on the rows of (7 "4), and so of the form
"I, «s, •.., «t.O, 0, . . . , 0 (7-8).
The orthogonality of OU to OXx,..., OXp' means that
2 W =0 ( a = l , 2, ...,/5) (7-9).
Now let 0^+1 denote the angle that OXp+i makes with P,_fl; that is, 6p+1 is the
minimum angle of 0Xp+1' with a line 0 U such that the coordinates of U are of the
form (78) and satisfy (7-9). Without loss of generality we may also take U at unit
distance from the origin, so that
Eu^l (7-10).
Since Sxf+i% = 1 by (7 3), we then have
cos dfi+1 <= Su«^ + i' (Til).
To determine the minimum angle we therefore differentiate with respect to «i,..., u<
the expression
S ; i ' — i y Su 1 -
352 Relations between Two Sets of Varieties
where y, Xlt ...,Xf are Lagrange multipliers. This gives
\.xik'+...+XftXfh'=xft+Xth'-yuh (h=\,...,t) (712).
Multiply (712) by uA and sum with respect to h. The left member disappears
by (79), and from (710) and (711) we have
y = cos0p+1 (713).
Upon multiplying (7'12) by x^, summing with respect to h, and using (79),
we have, for a •= 1,2,..., fi,
A 1 S.r 1 '*/ + . . . + A 3 2 V ; C a ' = Sz.>0 + i' (7-14).
= 0 (7-15).
Multiply the last row of this determinant by a^j-n,*' and sum with respect to A
from 1 to t. The last element, with the help of (7-11) and (713), reduces to
SW8-y",
and so, from (76), we have
7* = ^ (M6).
Hence, from (713),
cos^+1=^/^ 0 = 1,2 s-1) (717).
The angle 0t defined in Section 7 between the line OXi determined by the
sample values of the first variate and the flat space Pt determined by those of the
second set has the property that Ri = cos 0i is the multiple correlation of X\ with
the second set of variatea The population value of this multiple correlation is p\.
We assume all the variates subject to random sampling. In this case Rx will have
the "A" distribution discovered by R. A. Fisher*. In our notation, with samples
of n + 1 from which the means have been eliminated, or in samples of n + k from
which k degrees of freedom have been removed by least-squares elimination of
other variates, the distribution of Ri is
d(f)
n-t-2 -n+t+1
1 a 1 2
x J [(l-i?)(i?-g*)] (R ) . F Q , | , | , Kfi*)d(#»)...(8-3),
where the subscript is dropped from the variable of integration. Now
' "The General Sampling Distribution of the Multiple Correlation Coefficient" In Proceeding! of
the Royal Society, Vol. CIXL A (1928), p. 660.
Biometrika xxvni 28
354 Relations betxoeen Two Sets of Varieties
Making this substitution and changing the variable of integration to
we have therefore
(W
" 2 ) ! . _.. (1 - v? (1 - f)»-<-i (f)~ d (f)
r
2»-'(<2)!r»( y)
n-t-2 -n+t+1
(8-4).
2 2
0 (lfy-tqdqi'ixilx)] [ 1 ( 1 ? V ]
However the conditions of sampling under which q is likely to be used are such
that (84) appears to be the more important form, and we shall give no further
consideration to (8"5).
By extension of •the reasoning above the distribution of q may be found for
larger values of s, provided all but one of the parameters pi, ...,p, vanish. Thus
for « = 3 we have, from (7"2),
q = Ri R» Rg — q'Rt,
where q' has the distribution (83), i.e. (8*4), while R» has the distribution obtained
from (8"2) by replacing R* by R», n by n — 1, and t by t — 1. Combining this with
(84) in the same manner that (82) was combined with (81) to produce (83), the
new distribution of q is obtained. This process may be repeated to obtain the
distribution for values of « as great as desired; but it must of course be remembered
that s^t.
It is tempting to try to obtain the general distribution of q, without our
assumption that all but one of the quantities pi, ...,p, are zero, by treating these
as population values of the multiple correlations whose distributions are used
successively in finding the distribution of q as above. However this suggested
procedure appears to be incorrect Ifpi^O, the centre of the globular cluster
formed by the projected X% points will have a centre which is not on Pt-i, where
it should be if the multiple correlation distributioa were to be amplified.
9. Moments of q. The Distribution for Large Samples. We shall derive the
even momenta of the distribution of Section 8, assuming that v does not take
HAROLD HOTELLING 355
either of the extreme values 0 and 1. The latter is the case in which q becomes
a partial correlation coefficient, the theory of which is well understood. The case
v = 0, corresponding to complete independence, is a very simple one, concerning
which all information desired may be obtained from Section 11. The moments
will be obtained by processes involving repeated interchanges of order of the
processes of integration, differentiation, and summation of series. It will be
observed that the uniform convergence and continuity required to justify these
interchanges exist, provided v is definitely between 0 and 1 without taking either
of these values.
The odd momenta of q about zero, which require no consideration unless s = t,
vanish in this case when only one of the canonical correlations is different from
„ j
t-2 , n-t-3
we obtain
(iii)rCi
In this we make the substitution
ri-+rl » ,
1
"~ ' l rl - + r - l
1
I 2 /i _\t—IJ_ /'U•9^ •
fl/ I i — *./ Cr* • . * . . . • • • • • • I t / —/>
0
as—a
356 Relations between Two Sets of Variates
upon reversing the order of integration and summation we have
• (1 - v)
xJ o (l-s)*-^S v'dx.
- (l-O
n-t
Jo
\2/>
The sum is now a binomial expansion and
n-t
(9-4).
in which the series can easily be calculated to any required accuracy. For large
samples the convergence is extremely rapid. A series of powers of n~l may also
be obtained by expanding each term of the hypergeometric series:
+ ... (9-6).
This form brings out the manner in which, for large samples, fit varies with v.
r(^ ,+k+r-l
dx.
X afv
d? *
The integral equals
J+fc-i «J-l i_ ru n »i n
dx=
whence
Jo
(HJ
-- -
5 +
/ (I-,) 2
n n
This may be made even more explicit by performing the differentiation with the
help of Leibnitz' theorem; and Euler's transformation may be applied to each of
the hypergeometric functions to give a rapidly convergent series. In this way we
obtain
k\
r
(1 - i')r F (k, k,^ + 2k-r,v\ (9 8).
358 Relations between Two Sets of Variatev
The expressions obtained from this by substituting particular values of k are
different in form from those obtained directly from (9*4), but are reducible to them
with the help of the Gauss relations between "neighbouring" hypergeometric
functions.
From (97) it is easy to see that f^ = 1, as it should; this checks a long chain
of deductions.
The asymptotic value of /i, t for large values of n will now be investigated.
In the expression for the kth derivative obtained from (9*4) by Leibnitz' theorem,
the term of highest order in n is
/I \2
f1 "-*
x (1 — x)r~1(l — vx) dx
Jo
Hence, for 8 — t = 2, the distribution of 5 approaches the normal form, with variance
- , as is seen either from (9'9) or from (9'4), and mean value zero.
71
For t ^ 2, the distribution of q does not behave in this way. As in the case of
multiple correlation, it is then confined to positive values. An approximation to
the distribution is however suggested by the foregoing asymptotic values of the
moments. These are in fact the moments of the j£ distribution with t— 1 degrees
of freedom, if we put
The approximate distributions thus obtained may tentatively be used for testing
the significance of q in large samples when v has a value not too close to zero. For
small samples and small values of v, the methods of the next section are appropriate.
HAROLD HOTBLLINO 369
10. The Distribution for Small Sampfa*. Form of the Frequency Curve. The
distribution of Section 8 may in certain cases be expressed in elementary forms.
If n — t is even, the Euler transformation
F(a, b, c, x)= (I - if-*-*F(c -a, c-b, c, x) (1CKL)
may be applied to the hypergeometric function in the distribution to give a
terminating series, and the integration can then be carried out for each term, the
integrand in (8"3) being a rational function of R, or involving (if t is odd) a single
quadratic surd.
Consider for example the simple case t = 2, n = 4. Since t = «, it is more con-
venient to work with the distribution of q than with that of q*. We halve the
(104).
A factor £ must be applied to this expression in the case t = 2 if we then distinguish
which may be used as a recurrence relation for computing the successive c t 's as
soon as we have determined two values whose indices differ by unity.
From the identity of Gauss (ibid., p. 227)
;
It is easy to show by means of the recurrence relation that the limit of vk as m
increases is <?*, and that the remaining terms of vk constitute a polynomial in q
having the factor (1 - q)%.
For a test of significance of q it is necessary to integrate the distribution from
an arbitrary value to unity. For this purpose it is convenient to put p = 1 — q. In
terms of p,
The series is uniformly convergent and may be integrated term by term, thus
providing a test of significance for the tetrad difference. It is not however very
convenient for computation unless n and v are small. For large values of n, the
method of the preceding section may be used: the standard error often gives a
satisfactory test of significance, even when used with the crude inequality of
Tchebycheff, which takes no account of the nature of the particular distribution.
Light is thrown on the form of the frequency curves by the expansions we have
just obtained. The case t = 2 stands out as of a special character, different from the
rest; this will be true in general where s = f. This special character is related to
the fact that positive and negative sample values of q are distinguishable only if
8 — t. In other cases, just as in that of the multiple correlation (i.e. that of q when
8 = 1), the values must be taken as positive, and q* is in some respects a more
natural variate to use.
The vt, and therefore the convergent series in (10-4) and (109), and also the
derivatives of the vk and of the series, take definite finite values both for q = 0 and
for <? = 1. From (104) it is therefore evident that the frequency curve for q has, for
q = l, contact with the axis of order m — 2. For q = 0 the ordinate of the curve is
zero for t > 3, but has a finite value for t •= 2.
The derivative with respect to q of the integral in (8'3) has, if v< 1, a finite
negative value for q = 0 as well as for every positive value of q. The ordinate of the
distribution curve for t=2 will therefore have these properties. This curve must
be symmetrical about q = 0. Hence it is not flat-topped, but has a corner above
the origin.
362 Relations between Two Sets of Variates
But if v = 1 the distribution of q for s*= t = 2 does not have such a discontinuity
in the middle. For in this case linear functions of the variates in the two sets exist
which are perfectly correlated with each other, and are thus for our purposes identical.
Taking these as Xi and xs, (12) shows that q is in every sample the partial corre-
lation of the remaining two variates. Hence when v = l the distribution becomes
identically that of the partial correlation coefficient. According to R. A. Fisher's
work*, this is the same as the distribution of the simple correlation coefficient,
with the sample number reduced by unity, a distribution having continuous
derivatives of all orders throughout its range.
11. Tests for Complete Independence. If « = 2 and botJi canonical correlations
for positive values of q. Thus q has in this case the same distribution as the square
of the multiple correlation coefficient in samples of n (= N— 1) from an uncorrelated
normal population, with t — 1 variates.
The question whether complete independence exists between two sets of variates
for which we have sample correlations may be investigated by computing q and
determining from (11*1) whether the probability of so great a value of q is negli-
gible. This requires the integral of (ll'l), which is easy to compute for any
moderate value of t. For large values of t it may be obtained from the Tables of
the Incomplete Beta Function^-. For t = 2 the probability of a greater value of | q |
if complete independence really exists is simply
where N is the number in the sample. In this way a very simple test for complete
independence may be applied.
But this is not by any means the only possible test of complete independence
between two sets. Indeed, the distribution of the vector alienation coefficient
(Section 4),
D
has been found by Wilks under this same hypothesis of complete independence
and normality}. This distribution, which was obtained by means of its moments,
reduces for the case s = 2 which we are now studying to
* "The Distribution of the Partial Correlation Coefficient" in Metron, Vol. m. (1924), pp. 829—332.
t Biometrika Office, 1984. t Wilks, op. cit.
HAROLD HOTELLING 363
shown in Figure 1. The best agreement with the hypothesis of complete independence
is shown by a sample for which z = 1 and q = 0, and which therefore corresponds to
Fig. l.
The same conclusion is given even greater definiteness [by the z test (11'6), from
which we have F = -0001.
The comparison of arithmetical with memory tests in Section 6 was for the
values « = 2, t = 3. In this case we find from q that P = 86 x 10"10, while the test
for complete independence by means of z gives _P'= 10 x 10~M. Thus z gives a more
sensitive test, and a more conclusive demonstration, of complete independence in
both these cases than does q. The underlying reason for this is the considerable
inequality between the two canonical correlations in each case.
One practical consideration in favour of the q test is that q is somewhat easier
to calculate than z. The chief ground for distinction between them is however their
sensitiveness to different types of deviations from complete independence.
yi,yt, ••-,yn
is specified by the alternants
(121),
which are analogous to the Pliicker coordinates of a line. Indeed, the planes
through a point in n-space are in one-to-one correspondence with the lines in which
they meet an (n— l)-space not containing the point.
* " O n the Independence of It Sets of Normally Distributed Statistical Variables" in Econometriea,
Vol. m. (1985), pp. 309—326.
366 Relations between Two Sets of Variates
The relations connecting these ajtemants, apart from the obvious relations
Pu=-pit (122),
are obtainable from the fact that, on account of identical rows,
x{ x, xk xn
Vi Vt y* y«
• 0.
X{ X, Xk Xn
yi y> y* y«
Applying a Laplace expansion to the first two rows of this determinant we obtain
0 (123).
for which the first subscript is less than the second. This condition on the
quantities (12-4) shows that the number of independent alternants is 2n — 4, which
is the number of degrees of freedom of the plane.
HAROLD HOTELLING 367
y» y« y« • •• .Vn
.(127).
1 0 0 0
0 o . ..
0 I 0 0 0 o . ..
The determinant D of the correlations among the four variates is the deter-
minant of sums of squares and products, since each sum of squares is unity.
Hence, by Section 2, D is the sum of the squares of the four-rowed determinants
in the matrix (127). But all these determinants are zero except those containing
the first and second columns. The determinant consisting of the first, second,
tth andjth columns equals o^y, —o^y,. Defining this as p^, we obtain the result
Z) = E ' V (128),
where 2 ' denotes summation from 3 to n with respect to t, and from t + 1 to n with
respect to j . It is further evident from (127) that the determinants of correlations
within the sets are
1 r.
.(12-9).
The equation (3'6) for the canonical correlations of the sample is
A Arjs rla ru
A?*Dt A T"j8 T*f|
= 0 (1210).
7*U f*2B A
'"14 I'M
368 Relations between Two Sets of Varieties
The coefficient of A* in this equation is AB = Ep4if. The term independent of A is
the square of the sum of the products of the determinants in the first two rows
of (127) by the corresponding determinants in the last two rows. The latter
determinants, however, are all zero but the first; hence the constant term in the
equation is pu1. The coefficient of A* may be obtained by putting A = 1 in the left
member of (12'10) and subtracting the coefficient of A4 and the constant term.
This coefficient is therefore equal to
-(l-r^iz-idrdzdq (136).
7T
This result and the following theorem will be used in Section 15 in extending the
distribution to a general value of n.
14. Theorem on Circularly Distributed Variates. The sum or difference of two
variates distributed independently and with uniform density over a particular range
is known to have a distribution represented by an isosceles triangle whose base has
double the breadth of the original range. If however each value of the sum or
difference is reduced with the original range as modulus—that is, is replaced by
the remainder after dividing by the range—the resulting distribution is exactly
the original one, with uniform density over the same range. This is a special case
of the following rather remarkable
THEOREM : / / any number of variates are distributed independently and with
uniform density from 0 to a, then any linear function of these variates with integral
coefficients, when reduced modulo a, is likewise distributed with uniform density from
0 to a. Any number of such functions, if algebraically independent, are also inde-
pendent in the probability sense.
The truth of this theorem becomes evident when we regard each set of values
of the variates as a point in a space having the metrical properties of a hypercube
of as many dimensions as there are variates, but with a topological nature deter-
mined by making each pair of diametrically opposite faces of the hypercube
correspond to a single region of the space. The space is thus a closed manifold
generalizing a torus in its topology, but not contained in a euclidean space, because
of its metrical nature. For two variates this representing space would be approxi-
mated by a torus obtained by revolving a very small circle about a very distant
line in its plane. Another representation in this case would be by means of the
24-2
372 Relations between Two Sets of Variates
squares of side a into which a plane is divided by two sets of parallel lines, all
points occupying a particular position within their respective squares being regarded
as identical. If we call the variates, or coordinates, X\, x\, ..., xm, the linear
functions
y{ = S a,,*, (i = 1, 2, ...,&) (14-1),
i-i
in which the coefficients a u are positive or negative integers or zero (but are not
all zero for any value of i), are constants over loci which, on the representation on
a plane or flat space of m dimensions, are parallel lines, or hyperplanes of m — 1
dimensions. In the space itself, in which only one point corresponds to each set of
which satisfy Sa^ = l identically, then the element of (n —l)-dimensioniil area for
the appoint may be written
where g is a determinant of n — 1 rows, in which the element in the tth row and jth
column is
(Lr2)
?-iWtW§ -
All these quantities are readily seen from (15\L) to vanish except those for which
i =j. The successive diagonal elements of g are
sin"#2, 1, cos 2 0 2 , cos 2 0 s sin*0 S) ....
cos fa sin fa
COSfaCOS fa •(15-4),
cos^jain <£3cos<
_ 1 -r1 (loJU).
To simplify the notation we shall replace 9% and <f>% simply by 9 and <f> respectively.
We shall also put
to = 0
0i —f
Since only the sines and cosines of to will enter into our discussion, we may
regard a> as reduced modulo 2TT. NOW 6% and <f>i vary independently and, as is seen
from (15-3) and (15'5), with uniform density from 0 to 2TT. Hence their difference
to must, by the theorem of the last section, have a distribution of uniform density
from 0 to 2TT. Moreover, since <?i and <f>i have been seen to be independent of 9,
<f> and A, it follows that 10 is likewise independent of them; indeed, to, 8, <f> and A
constitute a completely independent set. The distributions of 6 and <j> are
determined by integrating (15'3) and (15"5) between constant limits with respect
to all the variates appearing in them except 9% and <f>t respectively. Combining
with (15'7) the result of this integration and the uniformity of the distribution of to,
we have that the element of probability is of the form
(15-11),
where kn depends only on n. The limits for 8 and <f> are 0 and - ; for A they are 0
and 77; for to they are 0 and 2TT.
In the new notation, (15 - 9) and (15 - 10) become
r = cos at sin 9 sin <f> + cos A cos 9 cos (f> (15-12),
. sin8 en sin 9 0sin2<£ sin 2 A cosa 8 cos8<f> ,,rio\
= z=
^ —r^—• —r^— (is-13).
We next consider a transformation to the variates q, z, r and co. Without troubling
to compute the Jacobian J of this transformation, we need observe only that it is
independent of n, since the functional relations (15'12) and (1513) do not involve n.
Substituting in (15'11) from the second of the equations (15'13) we find that the
distribution is of the form
n-8 «_-8
knifiz 2 (1 - }•») "2 " dqdzdrdut,
H A R O L D HOTELLING 376
where ip does not involve n. Upon integrating this with respect to at between
certain limits depending on q, t and r, but not on n, we have the distribution
n-3 n-8
2
kn^Yz (1 — r8) 2
dqdzdr,
2
which for n — 4 must reduce to (136). Comparing with (136) we have & 4 Y= —.
7tZ
Inasmuch as the distribution of r is known to be
n-8
(n-2)(n-9)(rl'-rat)(l-n*)~(l-rS)~dr1d>; (15-15).
16. Further Problems. The foregoing treatment of sampling distributions is
obviously incomplete. I t would be desirable to have exact distributions, both of
sample canonical correlations and of various functions of them, for cases in which
the canonical correlations in the population have arbitrary values. The coefficients
obtained for the canonical variates have sampling distributions which remain to be
determined. Furthermore, various possible comparisons among different samples
remain to be investigated; for example, there is the problem of testing the signifi-
cance of the difference between vector correlations obtained from different samples.
A generalization of the problem of relations between two sets of variates invariant
under internal linear transformations is that of invariants under such transformations
of three or more sets of variates. A beginning of this theory has been made by
376 Relations between Two Sets of Variates
Wilks in the work previously alluded to; the e we have used is only a special case
of a statistic of his, which in general is defined with reference to any number of sets
of variates as a fraction, whose numerator is the determinant of the correlations
among all the variates, and whose denominator is the product of the determinants
of correlations within sets. It is obvious also that the invariants we have discussed,
taken between every two of the sets, are invariants of such a system. An additional
set of invariants will be the roots of the equation in A resembling (3'6), obtained
from the' determinant of all the correlations or covariances by multiplying those
between variates in the same set by — A. It is easy to prove with the help of the
theory of A-matrices that the roots and coefficients of this equation are actually
invariants.
If all the costs and profits of the mixing or manufacturing operation are regarded
as prices of constituents, the value of one set of commodities will equal that of the
other, so that
(16-2),
where the pf are the prices of the original commodities and the p/ are those of the
products. If we regard (161) as a linear transformation of the quantities, there
will be a corresponding linear transformation of the prices, whose coefficients may
be determined in terms of the c# by substituting (161) and
p/ = T.dikpk
in (16"2) and then equating coefficients of like terms. This process shows that
'Ec{id{k = 8jk, =li{j=>fc, =
These equations fully determine the dik as functions of the cit. The relation is such
that the transformation of prices is contragredient to that of quantities.
An important class of relations between prices and quantities of a group of
commodities would be the class of relations invariant under mixings of the kind
described above. The canonical correlations and their functions, which are the main
subject of this paper, are such invariants. But on account of the restriction that
linear transformations of one set of variates shall be contragredient to those of the
other, there will be additional invariants for this case, which remain to be in-
vestigated.
HAROLD HOTELLING 377
Other important problems are connected with the case of equal canonical
correlations, which had to be excluded in deriving the approximate standard errors
in Section 5. If two or more canonical correlations in the population are equal, it
appears that the distribution of the corresponding sample values does not approach
the multivariate normal form. This case is of much practical importance, owing to
the practice of devising tests designed to measure the same character with equal
accuracy. The psychologists' use of "reliability coefficients," and of "correlations
corrected for attenuation " has been recognized as unsatisfactory. One symptom of
trouble is that the formula for correlations corrected for attenuation sometimes
gives values greater than unity. A satisfactory treatment of this difficulty should
be possible with the help of the distribution function, when found, of sample