Peakbin Selection in Mass Spectrometry Data Using a Consensus Approach with Estimation of Distribution Algorithms

Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mass spectrometry (MS) data generally contain a limited number of samples described by a high number of features. In this work, we present a novel class of evolutionary algorithms, estimation of distribution algorithms (EDA), as an efficient peak selector in this MS domain. There is a trade-of f between the reliability of the detected biomarkers and the low number of samples for analysis. For this reason, we introduce a consensus approach, built upon the classical EDA scheme, that improves stability and robustness of the final set of relevant peaks. An entire data workflow is designed to yield unbiased results. Four publicly available MS data sets (two MALDI-TOF and another two SELDI-TOF) are analyzed. The results are compared to the original works, and a new plot (peak frequential plot) for graphically inspecting the relevant peaks is introduced. A complete online supplementary page, which can be found at http://www.sc.ehu.es/ccwbayes/members/ruben/ms, includes extended info and results, in addition to Matlab scripts and references.

[1]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[2]  Jeffrey S. Morris,et al.  Signal in noise: evaluating reported reproducibility of serum proteomic tests for ovarian cancer. , 2005, Journal of the National Cancer Institute.

[3]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[4]  Jeffrey S. Morris,et al.  Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments , 2004, Bioinform..

[5]  Habtom W. Ressom,et al.  Peak selection from MALDI-TOF mass spectra using ant colony optimization , 2007, Bioinform..

[6]  Marcel J. T. Reinders,et al.  Comparison of normalisation methods for surface-enhanced laser desorption and ionisation (SELDI) time-of-flight (TOF) mass spectrometry data , 2008, BMC Bioinformatics.

[7]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[8]  Concha Bielza,et al.  A review of estimation of distribution algorithms in bioinformatics , 2008, BioData Mining.

[9]  Dirk Thierens,et al.  Linkage Information Processing In Distribution Estimation Algorithms , 1999, GECCO.

[10]  Cesare Furlanello,et al.  Machine learning methods for predictive proteomics , 2007, Briefings Bioinform..

[11]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[12]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[13]  Melanie Hilario,et al.  On Preprocessing of SELDI-MS Data and its Evaluation , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[14]  Robert Tibshirani,et al.  Sample classification from protein mass spectrometry, by 'peak probability contrasts' , 2004, Bioinform..

[15]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[16]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[17]  Yehia Mechref,et al.  Analysis of MALDI-TOF mass spectrometry data for detection of glycan biomarkers. , 2008, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[18]  M. Hilario,et al.  Processing and classification of protein mass spectra. , 2006, Mass spectrometry reviews.

[19]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[20]  M. Karas,et al.  Matrix-assisted ultraviolet laser desorption of non-volatile compounds , 1987 .

[21]  David E. Goldberg,et al.  Hierarchical Bayesian Optimization Algorithm , 2006, Scalable Optimization via Probabilistic Modeling.

[22]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[23]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[24]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[25]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[26]  Edmond J. Breen,et al.  Automatic Poisson peak harvesting for high throughput protein identification , 2000, Electrophoresis.

[27]  T. Yip,et al.  New desorption strategies for the mass spectrometric analysis of macromolecules , 1993 .

[28]  Jeffrey S. Morris,et al.  Pre-Processing Mass Spectrometry Data , 2007 .

[29]  Pedro Larrañaga,et al.  A Review on Estimation of Distribution Algorithms , 2002, Estimation of Distribution Algorithms.

[30]  Maria Joseph,et al.  Guilt-by-association feature selection: Identifying biomarkers from proteomic profiles , 2008, J. Biomed. Informatics.

[31]  Martin Pelikan,et al.  Hierarchical Bayesian optimization algorithm: toward a new generation of evolutionary algorithms , 2010, SICE 2003 Annual Conference (IEEE Cat. No.03TH8734).

[32]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[33]  Anna Gambin,et al.  On consensus biomarker selection , 2007, BMC Bioinformatics.

[34]  David G. Stork,et al.  Pattern Classification , 1973 .

[35]  Z. Shkedy,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[36]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[37]  Pedro Larrañaga,et al.  Microarray Analysis of Autoimmune Diseases by Machine Learning Procedures , 2009, IEEE Transactions on Information Technology in Biomedicine.

[38]  Habtom W. Ressom,et al.  Analysis of mass spectral serum profiles for biomarker selection , 2005, Bioinform..

[39]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[40]  Anastasios Bezerianos,et al.  An intensity-region driven multi-classifier scheme for improving the classification accuracy of proteomic MS-spectra , 2010, Comput. Methods Programs Biomed..

[41]  Elena Marchiori,et al.  Robust SVM-Based Biomarker Selection with Noisy Mass Spectrometric Proteomic Data , 2006, EvoWorkshops.

[42]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[43]  Yvan Saeys,et al.  Feature selection for splice site prediction: A new method using EDA-based feature ranking , 2004, BMC Bioinformatics.

[44]  Jeffrey S. Morris,et al.  Improved peak detection and quantification of mass spectrometry data acquired from surface‐enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform , 2005, Proteomics.

[45]  Terence P. Speed,et al.  NORMALIZATION , BASELINE CORRECTION AND ALIGNMENT OF HIGH-THROUGHPUT MASS SPECTROMETRY DATA , 2004 .

[46]  Naren Ramakrishnan,et al.  Clustering mass spectrometry data using order statistics , 2003, Proteomics.

[47]  Joshua D. Knowles,et al.  Multiobjective Optimization in Bioinformatics and Computational Biology , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Lin Dan,et al.  A cross-validation study to select a classification procedure for clinical diagnosis based on proteomic mass spectrometry. , 2008 .

[49]  Melanie Hilario,et al.  Stability of feature selection algorithms , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[50]  Knut Reinert,et al.  OpenMS – An open-source software framework for mass spectrometry , 2008, BMC Bioinformatics.

[51]  E. Petricoin,et al.  Toxicoproteomics: Serum Proteomic Pattern Diagnostics for Early Detection of Drug Induced Cardiac Toxicities and Cardioprotection , 2004, Toxicologic pathology.

[52]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[53]  Pedro Larrañaga,et al.  Towards a New Evolutionary Computation - Advances in the Estimation of Distribution Algorithms , 2006, Towards a New Evolutionary Computation.

[54]  Michael L. Bittner,et al.  Ratio statistics of gene expression levels and applications to microarray data analysis , 2002, Bioinform..

[55]  Concha Bielza,et al.  Mateda-2.0: A MATLAB package for the implementation and analysis of estimation of distribution algorithms , 2010 .

[56]  Mia K. Markey,et al.  A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples , 2006, J. Biomed. Informatics.

[57]  Habtom W. Ressom,et al.  Ant Colony Optimization for Biomarker Identification from MALDI-TOF Mass Spectra , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.