Model-population analysis and its applications in chemical and biological modeling

Abstract Model-population analysis (MPA) was recently proposed as a general framework for designing new types of chemometrics and bioinformatics algorithms, and it has found promising applications in chemistry and biology. The goal of MPA is to extract useful information from complex analytical systems, so as to lead to better understanding and better modeling of chemical and biological data. To give an overall picture of MPA, we first review its key elements. Then, we describe the theories and the applications of selected methods that focus on the two fundamental aspects in chemical and biological modeling: outlier detection and variable selection. We highlight the key common principles of these methods and pinpoint the critical differences underlying each method.

[1]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[2]  M. Hubert,et al.  Robust methods for partial least squares regression , 2003 .

[3]  F. Ausubel Metabolomics , 2012, Nature Biotechnology.

[4]  Randy J. Pell,et al.  Multiple outlier detection for multivariate calibration using robust statistical techniques , 2000 .

[5]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  A. Gronenborn,et al.  Assessing the quality of solution nuclear magnetic resonance structures by complete cross-validation. , 1993, Science.

[8]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[9]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[10]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[11]  Nanjiang Shu,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm618 Sequence analysis Prediction of zinc-binding sites in proteins from sequence , 2008 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[14]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[15]  O. Kvalheim,et al.  Biomarker discovery in mass spectral profiles by means of selectivity ratio plot , 2009 .

[16]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[17]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[18]  Elaine Martin,et al.  Bayesian linear regression and variable selection for spectroscopic calibration. , 2009, Analytica chimica acta.

[19]  Peter J. Ludovice,et al.  Science in Washington , 2010 .

[20]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[21]  R. Brereton,et al.  Supervised self organizing maps for classification and determination of potentially discriminatory variables: illustrated by application to nuclear magnetic resonance metabolomic profiling. , 2010, Analytical chemistry.

[22]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[23]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[24]  G. Geffen,et al.  Double Cross-Validation and Improved Sensitivity of the Rapid Screen of Mild Traumatic Brain Injury , 2004, Journal of clinical and experimental neuropsychology.

[25]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[26]  Ericka Stricklin-Parker,et al.  Ann , 2005 .

[27]  Desire L. Massart,et al.  MULTIPLE OUTLIER DETECTION REVISITED , 1998 .

[28]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[29]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[30]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  William Stafford Noble,et al.  Support vector machine , 2013 .

[32]  Tarja Rajalahti,et al.  Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. , 2009, Analytical chemistry.

[33]  Qing-Song Xu,et al.  Support vector machines and its applications in chemistry , 2009 .

[34]  Yang Ai-jun,et al.  Bayesian variable selection for disease classification using gene expression data , 2010 .

[35]  J. Brezmes,et al.  Variable selection for support vector machine based multisensor systems , 2007 .

[36]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[37]  Yue Wang,et al.  SVM margin-based feature elimination applied to high-dimensional microarray gene expression data , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[38]  Yizeng Liang,et al.  Noise incorporated subwindow permutation analysis for informative gene selection using support vector machines. , 2011, The Analyst.

[39]  S. Morgan,et al.  Outlier detection in multivariate analytical chemical data. , 1998, Analytical chemistry.

[40]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[41]  Adrian E. Raftery,et al.  Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data , 2005, Bioinform..

[42]  Dong-Sheng Cao,et al.  Recipe for revealing informative metabolites based on model population analysis , 2010, Metabolomics.

[43]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[44]  Prasenjit Mukherjee,et al.  Computational approaches for the discovery of cysteine protease inhibitors against malaria and SARS. , 2010, Current computer-aided drug design.

[45]  Dong-Sheng Cao,et al.  Model population analysis for variable selection , 2010 .

[46]  J. Shao Linear Model Selection by Cross-validation , 1993 .