Model population analysis in chemometrics

Abstract Model population analysis (MPA) is a general framework for designing new types of chemometrics algorithms that has attracted increasing interest in the chemometrics community in recent years. The goal of MPA is to extract statistical information from the model, towards better understanding of the chemical data. Two key elements of MPA are random sampling and statistical analysis. The core idea of MPA is quite universal with potential applications in the fields, such as chemoinformatics, biostatistics and bioinformatics. In this article, we review the development of MPA in chemometrics. We first present the key elements of MPA. Then, the application of MPA in chemometrics is discussed, such as variable selection, model evaluation, outlier detection, applicability domain definition and so on. Finally, the potential application areas of MPA in future research are prospected.

[1]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[2]  Qing-Song Xu,et al.  Variable complementary network: a novel approach for identifying biomarkers and their mutual associations , 2012, Metabolomics.

[3]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[4]  Qing-Song Xu,et al.  Model Population Analysis for Statistical Model Comparison , 2012 .

[5]  A. G. Frenich,et al.  Wavelength selection method for multicomponent spectrophotometric determinations using partial least squares , 1995 .

[6]  Yong-Huan Yun,et al.  A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. , 2015, The Analyst.

[7]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[8]  Lunzhao Yi,et al.  A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. , 2014, The Analyst.

[9]  Kaiyi Zheng,et al.  Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra , 2012 .

[10]  Hiromasa Kaneko,et al.  Applicability Domain Based on Ensemble Learning in Classification and Regression Analyses , 2014, J. Chem. Inf. Model..

[11]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[12]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[13]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery. 4. Prediction of Aqueous Solubility Based on Atom Contribution Approach , 2004, J. Chem. Inf. Model..

[14]  T. Næs,et al.  Ensemble methods and partial least squares regression , 2004 .

[15]  Eugene S. Edgington,et al.  Randomization Tests , 2011, International Encyclopedia of Statistical Science.

[16]  Qing-Song Xu,et al.  The continuity of sample complexity and its relationship to multivariate calibration: A general perspective on first-order calibration of spectral data in analytical chemistry , 2013 .

[17]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  Qing-Song Xu,et al.  Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. , 2012, Analytica chimica acta.

[20]  C. L. Mallows Some comments on C_p , 1973 .

[21]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[22]  Dong-Sheng Cao,et al.  Model population analysis for variable selection , 2010 .

[23]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[24]  H. Akaike A new look at the statistical model identification , 1974 .

[25]  N. M. Faber,et al.  How to avoid over-fitting in multivariate calibration--the conventional validation approach and an alternative. , 2007, Analytica chimica acta.

[26]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[27]  S. Morgan,et al.  Outlier detection in multivariate analytical chemical data. , 1998, Analytical chemistry.

[28]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[29]  Dong-Sheng Cao,et al.  Recipe for revealing informative metabolites based on model population analysis , 2010, Metabolomics.

[30]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[31]  Yi-Zeng Liang,et al.  A model population analysis method for variable selection based on mutual information , 2013 .

[32]  D. Massart Chemometrics: A Textbook , 1988 .

[33]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[34]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[35]  Dong-Sheng Cao,et al.  A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. , 2014, Analytica chimica acta.

[36]  Yizeng Liang,et al.  A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. , 2013, The Analyst.

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  Haiyan Wang,et al.  Improving accuracy for cancer classification with a new algorithm for genes selection , 2012, BMC Bioinformatics.

[39]  Dong-Sheng Cao,et al.  A new strategy to prevent over-fitting in partial least squares models based on model population analysis. , 2015, Analytica chimica acta.

[40]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[41]  B. Efron,et al.  The Jackknife: The Bootstrap and Other Resampling Plans. , 1983 .

[42]  Qing-Song Xu,et al.  Using variable combination population analysis for variable selection in multivariate calibration. , 2015, Analytica chimica acta.

[43]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[44]  P. Bertail,et al.  The Weighted Bootstrap , 1995 .

[45]  K. Héberger Sum of ranking differences compares methods or models fairly , 2010 .

[46]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[47]  Garland R. Marshall,et al.  3D-QSAR of angiotensin-converting enzyme and thermolysin inhibitors: A comparison of CoMFA models based on deduced and experimentally determined active site geometries , 1993 .

[48]  C. Mallows Some Comments on Cp , 2000, Technometrics.

[49]  Károly Héberger,et al.  Conditional Fisher's exact test as a selection criterion for pair-correlation method. Type I and Type II errors , 2001 .

[50]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[51]  Avraham Lorber,et al.  The effect of interferences and calbiration design on accuracy: Implications for sensor and sample selection , 1988 .

[52]  Dong-Sheng Cao,et al.  An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration. , 2013, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[53]  Dongsheng Cao,et al.  Plasma metabolic fingerprinting of childhood obesity by GC/MS in conjunction with multivariate statistical analysis. , 2010, Journal of pharmaceutical and biomedical analysis.

[54]  Yizeng Liang,et al.  Noise incorporated subwindow permutation analysis for informative gene selection using support vector machines. , 2011, The Analyst.

[55]  Xueguang Shao,et al.  A wavelength selection method based on randomization test for near-infrared spectral analysis , 2009 .

[56]  Dong-Sheng Cao,et al.  Model-population analysis and its applications in chemical and biological modeling , 2012 .

[57]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[58]  H. Martens,et al.  Near-Infrared Absorption and Scattering Separated by Extended Inverted Signal Correction (EISC): Analysis of Near-Infrared Transmittance Spectra of Single Wheat Seeds , 2002 .

[59]  Rupert G. Miller The jackknife-a review , 1974 .

[60]  Yi-Zeng Liang,et al.  A Combinational Strategy of Model Disturbance and Outlier Comparison to Define Applicability Domain in Quantitative Structural Activity Relationship , 2014, Molecular informatics.

[61]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .