The Use of Shrinkage Estimators in Linear Discriminant Analysis

Probably the most common single discriminant algorithm in use today is the linear algorithm. Unfortunately, this algorithm has been shown to frequently behave poorly in high dimensions relative to other algorithms, even on suitable Gaussian data. This is because the algorithm uses sample estimates of the means and covariance matrix which are of poor quality in high dimensions. It seems reasonable that if these unbiased estimates were replaced by estimates which are more stable in high dimensions, then the resultant modified linear algorithm should be an improvement. This paper studies using a shrinkage estimate for the covariance matrix in the linear algorithm. We chose the linear algorithm, not because we particularly advocate its use, but because its simple structure allows one to more easily ascertain the effects of the use of shrinkage estimates. A simulation study assuming two underlying Gaussian populations with common covariance matrix found the shrinkage algorithm to significantly outperform the standard linear algorithm in most cases. Several different means, covariance matrices, and shrinkage rules were studied. A nonparametric algorithm, which previously had been shown to usually outperform the linear algorithm in high dimensions, was included in the simulation study for comparison.

[1]  L. R. Haff ESTIMATION OF THE INVERSE COVARIANCE MATRIX: RANDOM MIXTURES OF THE INVERSE WISHART MATRIX AND THE IDENTITY , 1979 .

[2]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Richard A. Reyment,et al.  Discriminant analysis of a Cretaceous foraminifer using shrunken estimators , 1978 .

[4]  Richard O. Duda,et al.  Experiments in the recognition of hand-printed text, part II: context analysis , 1968, AFIPS '68 (Fall, part II).

[5]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[6]  Andrew S. Tanenbaum,et al.  Structured Computer Organization , 1976 .

[7]  Godfried T. Toussaint,et al.  A bottom-up and top-down approach to using context in text recognition , 1979 .

[8]  A. Wald On a Statistical Problem Arising in the Classification of an Individual into One of Two Groups , 1944 .

[9]  Sargur N. Srihari,et al.  Integrating diverse knowledge sources in text recognition , 1982, TOIS.

[10]  L. R. Haff Empirical Bayes Estimation of the Multivariate Normal Covariance Matrix , 1980 .

[11]  B. Efron,et al.  Multivariate Empirical Bayes and Estimation of Covariance Matrices , 1976 .

[12]  J. M. Brady,et al.  Using knowledge in the computer interpretation of handwritten FORTRAN coding sheets , 1976 .

[13]  Pasquale J. Di Pillo Further applications of bias to discriminant analysis , 1976 .

[14]  EDWARD M. RISEMAN,et al.  Contextual Word Recognition Using Binary Digrams , 1971, IEEE Transactions on Computers.

[15]  Godfried T. Toussaint,et al.  Experiments in Text Recognition with the Modified Viterbi Algorithm , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Allen R. Hanson,et al.  A Contextual Postprocessing System for Error Correction Using Binary n-Grams , 1974, IEEE Transactions on Computers.

[17]  Pasquale J. Dipillo Biased discriminant analysis: Evaluation of the optimum probability of misclassification , 1979 .

[18]  S. Gupta OPTIMUM CLASSIFICATION RULES FOR CLASSIFICATION INTO TWO MULTIVARIATE NORMAL POPULATIONS , 1965 .

[19]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[20]  S. Gupta THEORIES AND METHODS IN CLASSIFICATION: A REVIEW , 1973 .

[21]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[22]  N. Campbell Shrunken Estimators in Discriminant and Canonical Variate Analysis , 1980 .

[23]  Pasquale J. Di Pillo Further applications of bias to discriminant analysis , 1976 .

[24]  M. M. Barnard THE SECULAR VARIATIONS OF SKULL CHARACTERS IN FOUR SERIES OF EGYPTIAN SKULLS , 1935 .

[25]  J. V. Ness,et al.  On the Effects of Dimension in Discriminant Analysis , 1976 .

[26]  Calyampudi R. Rao A General Theory of Discrimination When the Information About Alternative Population Distributions is Based on Samples , 1954 .

[27]  Edward George Fisher The use of context in character recognition. , 1976 .

[28]  Clifford S. Stein Estimation of a covariance matrix , 1975 .

[29]  J. V. Ness On the Effects of Dimension in Discriminant Analysis for Unequal Covariance Populations , 1979 .

[30]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[31]  David L. Neuhoff,et al.  The Viterbi algorithm as an aid in text recognition (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[32]  Di Pillo,et al.  Biased discriminant analysis : evaluation of the optimum probability of misclassification , 1979 .

[33]  Ned Glick,et al.  Additive estimators for probabilities of correct classification , 1978, Pattern Recognit..

[34]  John Van Ness,et al.  On the dominance of non-parametric Bayes rule discriminant algorithms in high dimensions , 1980, Pattern Recognit..

[35]  J. Remme,et al.  A simulative comparison of linear, quadratic and kernel discrimination , 1980 .

[36]  S. John Errors in Discrimination , 1961 .

[37]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .