Improvement on enhanced Monte-Carlo outlier detection method

Abstract Highly predictive multivariate calibration model depends on samples in training set. In this study, we introduced an outlier detection method and developed its improvement for shorter run time. Improved Monte-Carlo outlier detection (IMCOD) was proposed to establish cross-prediction models for determining normal samples, which were subsequently used to analyze the distribution of prediction errors for all of dubious samples together. Four real datasets were employed to illustrate and validate the performance of IMCOD. After sample selection for training set of NIR of soy flour samples, the Root Mean Square Error of Prediction (RMSEP) of PLS model decreased from 1.4811 to 0.7650. This method benefits the establishment of a good model for QSAR and NIR datasets.

[1]  Dong-Sheng Cao,et al.  Model-population analysis and its applications in chemical and biological modeling , 2012 .

[2]  David E. Tyler,et al.  Constrained M-estimation for multivariate location and scatter , 1996 .

[3]  R. Gnanadesikan,et al.  Better alternatives to current methods of scaling and weighting data for cluster analysis , 2007 .

[4]  S. Morgan,et al.  Outlier detection in multivariate analytical chemical data. , 1998, Analytical chemistry.

[5]  Qi Zhang,et al.  An enhanced Monte Carlo outlier detection method , 2015, J. Comput. Chem..

[6]  M. Forina,et al.  Transfer of calibration function in near-infrared spectroscopy , 1995 .

[7]  Richard A. Becker,et al.  The New S Language , 1989 .

[8]  H. P. Lopuhaä Multivariate τ‐estimators for location and scatter , 1991 .

[9]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[10]  David E. Tyler,et al.  On the uniqueness of S-functionals and M-functionals under nonelliptical distributions , 2000 .

[11]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[12]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[13]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[14]  D. Hawkins,et al.  An Anscombe type robust regression statistic , 1995 .

[15]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[16]  P. L. Davies,et al.  Asymptotic behaviour of S-estimates of multivariate location parameters and dispersion matrices , 1987 .

[17]  D. Donoho,et al.  Breakdown Properties of Location Estimates Based on Halfspace Depth and Projected Outlyingness , 1992 .

[18]  H. Oja,et al.  Sign and rank covariance matrices , 2000 .

[19]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .