Scaling tree-based automated machine learning to biomedical big data with a dataset selector

Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist’s prediction problem. However, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. We introduce two new features implemented in TPOT that helps increase the system’s scalability: Dataset selector and Template. Dataset selector (DS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. Built in at the beginning of each pipeline structure, DS reduces the computational expense of TPOT to only evaluate on a smaller subset of data rather than the entire dataset. Consequently, DS increases TPOT’s efficiency in application on big data by slicing the dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of DS at the beginning of each pipeline. We show that DS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-DS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-DS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of the enrichment scores of two modules, in an automated fashion, TPOT-DS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual. Author Summary Big data have recently become prevalent in many fields including meteorology, complex physics simulations, large scale imaging, genomics, biomedical research, environmental research and more. TPOT is a Python Automated Machine Learning (AutoML) tool that uses genetic programming to optimize machine learning pipelines for analyzing biomedical data. However, like other AutoML tools, when analyzing big data, the early implementations of TPOT face the challenges of long runtime, high computational expense as well complex pipeline with low interpretability. Here, we develop two novel features for TPOT, Dataset Selector and Template, that leverage domain knowledge, greatly reduce the computational expense and flexibly extend TPOT’s application to biomedical big data analysis.

[1]  Thomas Meitinger,et al.  Polymorphisms in FKBP5 are associated with increased recurrence of depressive episodes and rapid response to antidepressant treatment , 2004, Nature Genetics.

[2]  Masahiko Watanabe,et al.  Distribution of Caskin1 protein and phenotypic characterization of its knockout mice using a comprehensive behavioral test battery , 2018, Molecular Brain.

[3]  Jason H. Moore,et al.  STatistical Inference Relief (STIR) feature selection , 2018, bioRxiv.

[4]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[5]  Long Chen,et al.  Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation , 2017 .

[6]  Marc Parizeau,et al.  DEAP: evolutionary algorithms made easy , 2012, J. Mach. Learn. Res..

[7]  Gisele L. Pappa,et al.  RECIPE: A Grammar-Based Framework for Automatically Evolving Classification Pipelines , 2017, EuroGP.

[8]  Naomi R. Wray,et al.  Genetic Studies of Major Depressive Disorder: Why Are There No Genome-wide Association Study Findings and What Can We Do About It? , 2014, Biological Psychiatry.

[9]  Randal S. Olson,et al.  Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science , 2016, GECCO.

[10]  Hod Lipson,et al.  Autostacker: a compositional evolutionary learning system , 2018, GECCO.

[11]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[12]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[13]  M. Kaul,et al.  Modulation of glucocorticoid receptor nuclear translocation in neurons by immunophilins FKBP51 and FKBP52: Implications for major depressive disorder , 2009, Brain Research.

[14]  Trang T. Le,et al.  Integrated machine learning pipeline for aberrant biomarker enrichment (i-mAB): characterizing clusters of differentiation within a compendium of systemic lupus erythematosus patients , 2018, AMIA.

[15]  L. Qiu,et al.  A preliminary study , 2018, Medicine.

[16]  Rui Mei,et al.  Type I interferon signaling genes in recurrent major depression: increased expression detected by whole-blood RNA sequencing , 2013, Molecular Psychiatry.

[17]  Jason H. Moore,et al.  Statistical Inference Relief (STIR) feature selection , 2018 .

[18]  Randal S. Olson,et al.  Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool , 2016, GPTP.

[19]  Bill C. White,et al.  Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure , 2015, BioData Mining.

[20]  Randal S. Olson,et al.  Data-driven advice for applying machine learning to bioinformatics problems , 2017, PSB.

[21]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[22]  V. Zakuth,et al.  Increased apoptosis in patients with major depression: A preliminary study. , 1999, Journal of immunology.

[23]  Bill C. White,et al.  Identification and replication of RNA-Seq gene network modules associated with depression severity , 2018, Translational Psychiatry.

[24]  Michael Snyder,et al.  High-Coverage Whole-Exome Sequencing Identifies Candidate Genes for Suicide in Victims with Major Depressive Disorder , 2017, Scientific Reports.

[25]  A L Oberg,et al.  An interaction quantitative trait loci tool implicates epistatic functional variants in an apoptosis pathway in smallpox vaccine eQTL data , 2016, Genes and Immunity.

[26]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[27]  Jianhua Li,et al.  A Novel Image Classification Method with CNN-XGBoost Model , 2017, IWDW.

[28]  G. MacQueen,et al.  A meta-analysis examining clinical predictors of hippocampal volume in patients with major depressive disorder. , 2009, Journal of psychiatry & neuroscience : JPN.

[29]  Peter Nordin,et al.  Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[30]  Y. Forsell,et al.  Variations in FKBP5 and BDNF genes are suggestively associated with depression in a Swedish population-based cohort. , 2010, Journal of affective disorders.