A New Approach for Detecting Multivariate Outliers

ABSTRACT This article proposes a new procedure named Max-Eigen difference (MED) for identifying outliers in multivariate data sets. Theoretical aspects of the procedure are briefly discussed. The proposed procedure is compared with the Mahalanobis distance (MD) and robust distance (RD) via two examples. It is indicated that the MED works better than MD and is comparable with RD. Finally, this procedure is applied during constructing a quadratic discriminant analysis which is used to splicing sites prediction for DNA sequences. Through the results of rice and human genome data sets, it can be seen that the robustified discriminant provides higher prediction accuracy than the usual discrimination method.

[1]  R. Losick,et al.  Inactivation of FtsI inhibits constriction of the FtsZ cytokinetic ring and delays the assembly of FtsZ rings at potential division sites. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[2]  N. Campbell Robust Procedures in Multivariate Analysis I: Robust Covariance Estimation , 1980 .

[3]  G. V. Kass,et al.  Location of Several Outliers in Multiple-Regression Data Using Elemental Sets , 1984 .

[4]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) , 2002, Science.

[5]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[6]  Victor V. Solovyev,et al.  INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects , 1999, Nucleic Acids Res..

[7]  Michael Ruogu Zhang,et al.  Identification of protein coding regions in the human genome by quadratic discriminant analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[8]  A. Hadi Identifying Multiple Outliers in Multivariate Data , 1992 .

[9]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[10]  P. Rousseeuw,et al.  Breakdown Points of Affine Equivariant Estimators of Multivariate Location and Covariance Matrices , 1991 .

[11]  George A. Anderson,et al.  An Asymptotic Expansion for the Distribution of the Latent Roots of the Estimated Covariance Matrix , 1965 .