A HIGH DIMENSIONAL TWO SAMPLE SIGNIFICANCE TEST

0. Summary. The classical multivariate 2 sample significance test based on Hotelling's T2 is undefined when the number k of variables exceeds the number of within sample degrees of freedom available for estimation of variances and covariances. Addition of an a priori Euclidean metric to the affine k-space assumed by the classical method leads to an alternative approach to the same problem. A test statistic F which is the ratio of 2 mean square distances is proposed and 3 methods of attaching a significance level to F are described. The third method is considered in detail and leads to a "non-exact" significance test where the null hypothesis distribution of F depends, in approximation, on a single unknown parameter r for which an estimate must be substituted. Approximate distribution theory leads to 2 independent estimates of r based on nearly sufficient statistics and these may be combined to yield a single estimate. A test of F nominally at the 5 % level but based on an estimate of r rather than r itself has a true significance level which is a function of r. This function is investigated and shown to be quite near 5 %. The sensitivity of the test to a parameter measuring statistical distance between population means is discussed and it is shown that arbitrarily small differences in each individual variable can result in a detectable overall difference provided the number of variables (or, more precisely, r) can be made sufficiently large. This sensitivity discussion has stated implications for the a priori choice of metric in k-space. Finally a geometrical description of the case of large r is presented. 1. Introduction. The statistical problem here treated is that of significance testing for the difference of the means of 2 k-variate populations which may be assumed to have the same structure of variances and covariances, the test being based on a sample from each population with sample sizes denoted by ni and n2 . It is intended to provide a method applicable to data where the number k of characteristics measured on each individual is large but where the number of individuals measured may be quite small. The usual method of classical multivariate statistics encounters a mathematical barrier and becomes inapplicable when k > ni + n2 - 2, but certainly the need has arisen in applied statistical work for techniques handling small samples of highly described individuals. The classical method has 2 equivalent formulations in terms of the T2 statistic