biosigner: A New Method for the Discovery of Significant Molecular Signatures from Omics Data

High-throughput technologies such as transcriptomics, proteomics, and metabolomics show great promise for the discovery of biomarkers for diagnosis and prognosis. Selection of the most promising candidates between the initial untargeted step and the subsequent validation phases is critical within the pipeline leading to clinical tests. Several statistical and data mining methods have been described for feature selection: in particular, wrapper approaches iteratively assess the performance of the classifier on distinct subsets of variables. Current wrappers, however, do not estimate the significance of the selected features. We therefore developed a new methodology to find the smallest feature subset which significantly contributes to the model performance, by using a combination of resampling, ranking of variable importance, significance assessment by permutation of the feature values in the test subsets, and half-interval search. We wrapped our biosigner algorithm around three reference binary classifiers (Partial Least Squares—Discriminant Analysis, Random Forest, and Support Vector Machines) which have been shown to achieve specific performances depending on the structure of the dataset. By using three real biological and clinical metabolomics and transcriptomics datasets (containing up to 7000 features), complementary signatures were obtained in a few minutes, generally providing higher prediction accuracies than the initial full model. Comparison with alternative feature selection approaches further indicated that our method provides signatures of restricted size and high stability. Finally, by using our methodology to seek metabolites discriminating type 1 from type 2 diabetic patients, several features were selected, including a fragment from the taurochenodeoxycholic bile acid. Our methodology, implemented in the biosigner R/Bioconductor package and Galaxy/Workflow4metabolomics module, should be of interest for both experimenters and statisticians to identify robust molecular signatures from large omics datasets in the process of developing new diagnostics.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Age K. Smilde,et al.  Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies , 2011, Metabolomics.

[3]  Mehdi Mesri,et al.  Evolution of clinical proteomics and its role in medicine. , 2011, Journal of proteome research.

[4]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[5]  R Ohno,et al.  The percentage of myeloperoxidase-positive blast cells is a strong independent prognostic factor in acute myeloid leukemia, even in the patients with normal karyotype , 2003, Leukemia.

[6]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[7]  R. Abagyan,et al.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. , 2006, Analytical chemistry.

[8]  C. Barbas,et al.  Metabolomics in cancer biomarker discovery: current trends and future perspectives. , 2014, Journal of pharmaceutical and biomedical analysis.

[9]  Ian D. Wilson,et al.  Metabolic Phenotyping in Health and Disease , 2008, Cell.

[10]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[11]  J. Sjövall,et al.  Bile acid metabolism. , 1975, Annual review of biochemistry.

[12]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[13]  C. le Roux,et al.  Urine Bile Acids Relate to Glucose Control in Patients with Type 2 Diabetes Mellitus and a Body Mass Index Below 30 kg/m2 , 2014, PloS one.

[14]  Mario Lauria,et al.  Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge , 2013, Bioinform..

[15]  A. Cambrosio,et al.  Too many numbers: Microarrays in clinical cancer research. , 2012, Studies in history and philosophy of biological and biomedical sciences.

[16]  Johan Trygg,et al.  Chemometrics in metabolomics--a review in human disease diagnosis. , 2010, Analytica chimica acta.

[17]  Christophe Junot,et al.  Annotation of the human adult urinary metabolome and metabolite identification using ultra high performance liquid chromatography coupled to a linear quadrupole ion trap-Orbitrap mass spectrometer. , 2012, Analytical chemistry.

[18]  Shyam Visweswaran,et al.  Measuring Stability of Feature Selection in Biomedical Datasets , 2009, AMIA.

[19]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[20]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[22]  Ying-yong Zhao Metabolomics in chronic kidney disease. , 2013, Clinica chimica acta; international journal of clinical chemistry.

[23]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Marta Díaz,et al.  AStream: an R package for annotating LC/MS metabolomic data , 2011, Bioinform..

[25]  E. Thévenot,et al.  Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index, and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses. , 2015, Journal of proteome research.

[26]  Ping Liu,et al.  Serum and Urine Metabolite Profiling Reveals Potential Biomarkers of Human Hepatocellular Carcinoma* , 2011, Molecular & Cellular Proteomics.

[27]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[28]  Maria Liakata,et al.  Merits of random forests emerge in evaluation of chemometric classifiers by external validation. , 2013, Analytica chimica acta.

[29]  C. Hölscher,et al.  Investigation of the human brain metabolome to identify potential markers for early diagnosis and therapeutic targets of Alzheimer's disease. , 2013, Analytical chemistry.

[30]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[31]  Steven A Carr,et al.  Protein biomarker discovery and validation: the long and uncertain path to clinical utility , 2006, Nature Biotechnology.

[32]  Monya Baker,et al.  In biomarkers we trust? , 2005, Nature Biotechnology.

[33]  Pietro Franceschi,et al.  A benchmark spike‐in data set for biomarker identification in metabolomics , 2012 .

[34]  Derick R. Peterson,et al.  Plasma phospholipids identify antecedent memory impairment in older adults , 2014, Nature Medicine.

[35]  R. Bernards,et al.  Enabling personalized cancer medicine through analysis of gene-expression patterns , 2008, Nature.

[36]  Pratik D Jagtap,et al.  Multi-omic data analysis using Galaxy , 2015, Nature Biotechnology.

[37]  S. Neumann,et al.  CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. , 2012, Analytical chemistry.

[38]  A. Boletta,et al.  Defective Glucose Metabolism in Polycystic Kidney Disease Identifies A Novel Therapeutic Paradigm , 2016 .

[39]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[40]  B. Staels,et al.  Bile Acid Metabolism and the Pathogenesis of Type 2 Diabetes , 2011, Current diabetes reports.

[41]  Johan Trygg,et al.  Chemometrics in metabonomics. , 2007, Journal of proteome research.

[42]  Hui Sun,et al.  Metabolomics for Biomarker Discovery: Moving to the Clinic , 2015, BioMed research international.

[43]  B. Fernández-Fernández,et al.  Identification of a urine metabolomic signature in patients with advanced-stage chronic kidney disease. , 2014, Kidney international.

[44]  Joshua D. Knowles,et al.  Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry , 2011, Nature Protocols.

[45]  J. Nicholson Global systems biology, personalized medicine and molecular epidemiology , 2006, Molecular systems biology.

[46]  Age K Smilde,et al.  A Critical Assessment of Feature Selection Methods for Biomarker Discovery in Clinical Proteomics* , 2012, Molecular & Cellular Proteomics.

[47]  Yu Guo,et al.  Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms , 2010, BMC Bioinformatics.

[48]  J. Griffin,et al.  Towards metabolic biomarkers of insulin resistance and type 2 diabetes: progress from the metabolome. , 2014, The lancet. Diabetes & endocrinology.

[49]  Y.S. Hung,et al.  Gene selection for Brain Cancer Classification , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[50]  Paul Geladi,et al.  Principles of Proper Validation: use and abuse of re‐sampling for validation , 2010 .

[51]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[52]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[53]  Daniel Jacob,et al.  Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics , 2014, Bioinform..

[54]  Ron Wehrens,et al.  Meta-Statistics for Variable Selection: The R Package BioMark , 2012 .

[55]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[56]  Charles E. Determan Optimal Algorithm for Metabolomics Classification and Feature Selection varies by Dataset , 2014 .

[57]  Nigel W. Hardy,et al.  Proposed minimum reporting standards for chemical analysis , 2007, Metabolomics.

[58]  Kjell Johnson,et al.  An Introduction to Feature Selection , 2013 .

[59]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[61]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[62]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[63]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[64]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[65]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[66]  E. Marengo,et al.  Biomarkers Discovery through Multivariate Statistical Methods: A Review of Recently Developed Methods and Applications in Proteomics , 2014 .

[67]  V. Mootha,et al.  Metabolite profiles and the risk of developing diabetes , 2011, Nature Medicine.