Variable selection for noisy data applied in proteomics

The paper proposes a variable selection method for proteomics. It aims at selecting, among a set of proteins, those (named biomarkers) which enable to discriminate between two groups of individuals (healthy and pathological). To this end, data is available for a cohort of individuals: the biological state and a measurement of concentrations for a list of proteins. The proposed approach is based on a Bayesian hierarchical model for the dependencies between biological and instrumental variables. The optimal selection function minimizes the Bayesian risk, that is to say the selected set of variables maximizes the posterior probability. The two main contributions are: (1) we do not impose ad-hoc relationships between the variables such as a logistic regression model and (2) we account for instrumental variability through measurement noise. We are then dealing with indirect observations of a mixture of distributions and it results in intricate probability distributions. A closed-form expression of the posterior distributions cannot be derived. Thus, we discuss several approximations and study the robustness to the noise level. Finally, the method is evaluated both on simulated and clinical data.

[1]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[2]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[3]  H. Akaike A new look at the statistical model identification , 1974 .

[4]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[5]  Pascal Szacherski,et al.  Joint Bayesian hierarchical inversion-classification and application in proteomics , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[6]  Laurent Gerfault,et al.  Apprentissage supervisé robuste de caractéristiques de classes. Application en protéomique , 2011 .

[7]  Z. Q. John Lu Bayesian Inference for Gene Expression and Proteomics , 2007 .

[8]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[9]  Jean-Charles Sanchez,et al.  Proteomics: new perspectives, new biomedical opportunities , 2000, The Lancet.

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[12]  Noura Dridi,et al.  Variable selection for a mixed population applied in proteomics , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.