BOSO: A novel feature selection algorithm for linear regression with high-dimensional data

Motivation With the frenetic growth of high-dimensional datasets in different biomedical domains, there is an urgent need to develop predictive methods able to deal with this complexity. Feature selection is a relevant strategy in machine learning to address this challenge. Results We introduce a novel feature selection algorithm for linear regression called BOSO (Bilevel Optimization Selector Operator). We conducted a benchmark of BOSO with key algorithms in the literature, finding a superior performance in highdimensional datasets. Proof-of-concept of BOSO for predicting drug sensitivity in cancer is presented. A detailed analysis is carried out for methotrexate, a well-studied drug targeting cancer metabolism. Availability A Matlab implementation of BOSO is available as a Supplementary Material. Contact fplanes@tecnun.es Supplementary Information Supplementary data are available at Bioinformatics online.

[1]  Beth Wilmot,et al.  Functional Genomic Landscape of Acute Myeloid Leukemia , 2018, Nature.

[2]  Ryan R. Wick,et al.  Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads , 2016, bioRxiv.

[3]  John C. Earls,et al.  Blood metabolome predicts gut microbiome α-diversity in humans , 2019, Nature Biotechnology.

[4]  Robert Petryszak,et al.  Discovering and linking public omics data sets using the Omics Discovery Index , 2017, Nature Biotechnology.

[5]  Jon R Lorsch,et al.  Perspective: Sustaining the big-data ecosystem , 2015, Nature.

[6]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[7]  Richard Millham,et al.  Elitist Binary Wolf Search Algorithm for Heuristic Feature Selection in High-Dimensional Bioinformatics Datasets , 2017, Scientific Reports.

[8]  Alfonso Valencia,et al.  Big data analytics for personalized medicine. , 2019, Current opinion in biotechnology.

[9]  N. Draper,et al.  Applied Regression Analysis: Draper/Applied Regression Analysis , 1998 .

[10]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[11]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[12]  Michael Gruenstaeudl,et al.  PACVr: plastome assembly coverage visualization in R , 2020, BMC Bioinformatics.

[13]  M. Cecchini,et al.  Ultrastructural Characterization of the Lower Motor System in a Mouse Model of Krabbe Disease , 2016, Scientific Reports.

[14]  Susana Vinga,et al.  Structured sparsity regularization for analyzing high-dimensional omics data , 2020, Briefings Bioinform..

[15]  Silvia Casado Yusta,et al.  Different metaheuristic strategies to solve the feature selection problem , 2009, Pattern Recognit. Lett..

[16]  Dijun Chen,et al.  Architecture of gene regulatory networks controlling flower development in Arabidopsis thaliana , 2018, Nature Communications.

[17]  A. Pollard,et al.  Limb proportions show developmental plasticity in response to embryo movement , 2017, Scientific Reports.

[18]  Sridhar Ramaswamy,et al.  Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells , 2012, Nucleic Acids Res..

[19]  Joshua M. Korn,et al.  Next-generation characterization of the Cancer Cell Line Encyclopedia , 2019, Nature.

[20]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[21]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  M. Stratton,et al.  Abstract 2206: Genomics of Drug Sensitivity in Cancer (GDSC): A resource for therapeutic biomarker discovery in cancer cells. , 2013 .

[24]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[25]  W. J. Niessen,et al.  HASE: Framework for efficient high-dimensional association analyses , 2016, Scientific Reports.

[26]  Phillip G. Montgomery,et al.  Defining a Cancer Dependency Map , 2017, Cell.

[27]  Charles K. Fisher,et al.  Bayesian feature selection for high-dimensional linear regression via the Ising approximation with applications to genomics , 2015, Bioinform..

[28]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[29]  Matteo Fischetti,et al.  On handling indicator constraints in mixed integer programming , 2016, Comput. Optim. Appl..

[30]  R. Tibshirani,et al.  Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso , 2017, 1707.08692.

[31]  Florian Rohart,et al.  mixOmics: an R package for ‘omics feature selection and multiple data integration , 2017 .

[32]  D. Hasselquist,et al.  No evidence that carotenoid pigments boost either immune or antioxidant defenses in a songbird , 2018, Nature Communications.

[33]  Mattia Chiesa,et al.  GARS: Genetic Algorithm for the identification of a Robust Subset of features in high-dimensional datasets , 2020, BMC Bioinformatics.

[34]  Kim-Anh Lê Cao,et al.  mixOmics: An R package for ‘omics feature selection and multiple data integration , 2017, bioRxiv.

[35]  Vijay Kumar,et al.  A comparative analysis of optimization solvers , 2017 .

[36]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[37]  M. V. Vander Heiden,et al.  Targeting Metabolism for Cancer Therapy. , 2017, Cell chemical biology.

[38]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[39]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.