论文信息 - A HIGH DIMENSIONAL TWO SAMPLE SIGNIFICANCE TEST

A HIGH DIMENSIONAL TWO SAMPLE SIGNIFICANCE TEST

0. Summary. The classical multivariate 2 sample significance test based on Hotelling's T2 is undefined when the number k of variables exceeds the number of within sample degrees of freedom available for estimation of variances and covariances. Addition of an a priori Euclidean metric to the affine k-space assumed by the classical method leads to an alternative approach to the same problem. A test statistic F which is the ratio of 2 mean square distances is proposed and 3 methods of attaching a significance level to F are described. The third method is considered in detail and leads to a "non-exact" significance test where the null hypothesis distribution of F depends, in approximation, on a single unknown parameter r for which an estimate must be substituted. Approximate distribution theory leads to 2 independent estimates of r based on nearly sufficient statistics and these may be combined to yield a single estimate. A test of F nominally at the 5 % level but based on an estimate of r rather than r itself has a true significance level which is a function of r. This function is investigated and shown to be quite near 5 %. The sensitivity of the test to a parameter measuring statistical distance between population means is discussed and it is shown that arbitrarily small differences in each individual variable can result in a detectable overall difference provided the number of variables (or, more precisely, r) can be made sufficiently large. This sensitivity discussion has stated implications for the a priori choice of metric in k-space. Finally a geometrical description of the case of large r is presented. 1. Introduction. The statistical problem here treated is that of significance testing for the difference of the means of 2 k-variate populations which may be assumed to have the same structure of variances and covariances, the test being based on a sample from each population with sample sizes denoted by ni and n2 . It is intended to provide a method applicable to data where the number k of characteristics measured on each individual is large but where the number of individuals measured may be quite small. The usual method of classical multivariate statistics encounters a mathematical barrier and becomes inapplicable when k > ni + n2 - 2, but certainly the need has arisen in applied statistical work for techniques handling small samples of highly described individuals. The classical method has 2 equivalent formulations in terms of the T2 statistic

A. Dempster

[1] R. Fisher. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2] B. L. Welch. ON THE z-TEST IN RANDOMIZED BLOCKS AND LATIN SQUARES , 1937 .

[3] E. Pitman. Significance Tests Which May be Applied to Samples from Any Populations , 1937 .

[4] D. Kendall,et al. The Statistical Analysis of Variance‐Heterogeneity and the Logarithmic Transformation , 1946 .

[5] J. Wishart. THE CUMULANTS OF THE zAND OF THE LOGARITHMIC x2 AND t DISTRIBUTIONS , 1947 .

[6] G. Box. Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, I. Effect of Inequality of Variance in the One-Way Classification , 1954 .