Planning bioinformatics workflows using an expert system

Motivation: Bioinformatic analyses are becoming formidably more complex due to the increasing number of steps required to process the data, as well as the proliferation of methods that can be used in each step. To alleviate this difficulty, pipelines are commonly employed. However, pipelines are typically implemented to automate a specific analysis, and thus are difficult to use for exploratory analyses requiring systematic changes to the software or parameters used. Results: To automate the development of pipelines, we have investigated expert systems. We created the Bioinformatics ExperT SYstem (BETSY) that includes a knowledge base where the capabilities of bioinformatics software is explicitly and formally encoded. BETSY is a backwards‐chaining rule‐based expert system comprised of a data model that can capture the richness of biological data, and an inference engine that reasons on the knowledge base to produce workflows. Currently, the knowledge base is populated with rules to analyze microarray and next generation sequencing data. We evaluated BETSY and found that it could generate workflows that reproduce and go beyond previously published bioinformatics results. Finally, a meta‐investigation of the workflows generated from the knowledge base produced a quantitative measure of the technical burden imposed by each step of bioinformatics analyses, revealing the large number of steps devoted to the pre‐processing of data. In sum, an expert system approach can facilitate exploratory bioinformatic analysis by automating the development of workflows, a task that requires significant domain expertise. Availability and Implementation: https://github.com/jefftc/changlab Contact: jeffrey.t.chang@uth.tmc.edu

[1]  X. Chen,et al.  The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells , 2006, Nature Genetics.

[2]  William Stafford Noble,et al.  Support vector machine , 2013 .

[3]  J. Mesirov,et al.  GenePattern 2.0 , 2006, Nature Genetics.

[4]  Philippe Roussel,et al.  The birth of Prolog , 1993, HOPL-II.

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[7]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Ezgi O. Booth,et al.  Epistasis analysis with global transcriptional phenotypes , 2005, Nature Genetics.

[10]  V. Curcin,et al.  Scientific workflow systems - can one size fit all? , 2008, 2008 Cairo International Biomedical Engineering Conference.

[11]  Jeffrey Chang,et al.  Core services: Reward bioinformaticians , 2015, Nature.

[12]  Yolanda Gil,et al.  A semantic framework for automatic generation of computational workflows using distributed data and component catalogues , 2011, J. Exp. Theor. Artif. Intell..

[13]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[14]  Mads Thomassen,et al.  Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data , 2016, PloS one.

[15]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[16]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[17]  Thomas J. Bergin,et al.  History of programming languages---II , 1996 .

[18]  Shawn Hoon,et al.  Biopipe: a flexible framework for protocol-based bioinformatics analysis. , 2003, Genome research.

[19]  Alain Colmerauer,et al.  The birth of Prolog , 1996 .

[20]  Peilin Jia,et al.  Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers , 2013, Genome Medicine.

[21]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[22]  Sven Rahmann,et al.  Genome analysis , 2022 .

[23]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[24]  Stuart I. Feldman,et al.  Make — a program for maintaining computer programs , 1979, Softw. Pract. Exp..

[25]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[26]  Harald Piringer,et al.  Data Wrangling: Making data useful again , 2015 .

[27]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[28]  Michael L. Gatza,et al.  A pathway-based classification of human breast cancer , 2010, Proceedings of the National Academy of Sciences.

[29]  William T. Barry,et al.  SIGNATURE: A workbench for gene expression signature analysis , 2011, BMC Bioinformatics.

[30]  Shannon McWeeney,et al.  Using semantic workflows to disseminate best practices and accelerate discoveries in multi-omic data analysis , 2013, AAAI 2013.