SegMine workflows for semantic microarray data analysis in Orange4WS

BackgroundIn experimental data analysis, bioinformatics researchers increasingly rely on tools that enable the composition and reuse of scientific workflows. The utility of current bioinformatics workflow environments can be significantly increased by offering advanced data mining services as workflow components. Such services can support, for instance, knowledge discovery from diverse distributed data and knowledge sources (such as GO, KEGG, PubMed, and experimental databases). Specifically, cutting-edge data analysis approaches, such as semantic data mining, link discovery, and visualization, have not yet been made available to researchers investigating complex biological datasets.ResultsWe present a new methodology, SegMine, for semantic analysis of microarray data by exploiting general biological knowledge, and a new workflow environment, Orange4WS, with integrated support for web services in which the SegMine methodology is implemented. The SegMine methodology consists of two main steps. First, the semantic subgroup discovery algorithm is used to construct elaborate rules that identify enriched gene sets. Then, a link discovery service is used for the creation and visualization of new biological hypotheses. The utility of SegMine, implemented as a set of workflows in Orange4WS, is demonstrated in two microarray data analysis applications. In the analysis of senescence in human stem cells, the use of SegMine resulted in three novel research hypotheses that could improve understanding of the underlying mechanisms of senescence and identification of candidate marker genes.ConclusionsCompared to the available data analysis systems, SegMine offers improved hypothesis generation and data interpretation for bioinformatics in an easy-to-use integrated workflow environment.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[3]  Nada Lavrac,et al.  Relational Subgroup Discovery for Descriptive Analysis of Microarray Data , 2006, CompLife.

[4]  Rodrigo Lopez,et al.  Web services at the European Bioinformatics Institute-2009 , 2009, Nucleic Acids Res..

[5]  Carmen Koch,et al.  How to track cellular aging of mesenchymal stromal cells? , 2010, Aging.

[6]  Joachim Selbig,et al.  Extension of the Visualization Tool MapMan to Allow Statistical Analysis of Arrays, Display of Coresponding Genes, and Comparison with Known Responses1 , 2005, Plant Physiology.

[7]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[8]  Nada Lavrac,et al.  Towards Service-Oriented Knowledge Discovery A Case Study , 2008 .

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  Sven Bergmann,et al.  Modular analysis of gene expression data with R , 2010, Bioinform..

[11]  Zhen Jiang,et al.  Bioconductor Project Bioconductor Project Working Papers Year Paper Extensions to Gene Set Enrichment , 2013 .

[12]  Jérôme Larghero,et al.  Bone marrow microenvironment in fanconi anemia: a prospective functional study in a cohort of fanconi anemia patients. , 2010, Stem cells and development.

[13]  鶴谷 悠也,et al.  The roles of transforming growth factor-β and Smad3 signaling in adipocyte differentiation and obesity , 2011 .

[14]  Rodrigo Lopez,et al.  Web Services at the European Bioinformatics Institute , 2007, Nucleic Acids Res..

[15]  Hannu Toivonen,et al.  Link Discovery in Graphs Derived from Biological Databases , 2006, DILS.

[16]  I. Melzer Web Services Description Language , 2010 .

[17]  M Schena,et al.  Microarrays: biotechnology's discovery platform for functional genomics. , 1998, Trends in biotechnology.

[18]  Neerja Karnani,et al.  The effect of the intra-S-phase checkpoint on origins of replication in human cells. , 2011, Genes & development.

[19]  Taeho Hwang,et al.  FiGS: a filter-based gene selection workbench for microarray data , 2010, BMC Bioinformatics.

[20]  Jay Snoddy,et al.  Gene expression profiling in human preadipocytes and adipocytes by microarray analysis. , 2004, The Journal of nutrition.

[21]  Christopher J. Rawlings,et al.  Graph-based analysis and visualization of experimental results with ONDEX , 2006, Bioinform..

[22]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[23]  David Booth,et al.  Web Services Description Language (WSDL) Version 2.0 Part 0: Primer , 2007 .

[24]  Young Do Kwon,et al.  The molecular signature of in vitro senescence in human mesenchymal stem cells , 2010, Genes & Genomics.

[25]  Korbinian Strimmer,et al.  A general modular framework for gene set enrichment analysis , 2009, BMC Bioinformatics.

[26]  M. Dehmer,et al.  A Systems Approach to Gene Ranking from DNA Microarray Data of Cervical Cancer , 2007 .

[27]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[28]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[29]  Dan Nettleton,et al.  Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis , 2008, Bioinform..

[30]  Ian J. Taylor,et al.  Triana: a graphical Web service composition and execution toolkit , 2004, Proceedings. IEEE International Conference on Web Services, 2004..

[31]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[32]  BMC Bioinformatics , 2005 .

[33]  Junjun Zhang,et al.  BioMart Central Portal—unified access to biological data , 2009, Nucleic Acids Res..

[34]  Petra Kralj Novak,et al.  TOWARDS SEMANTIC DATA MININGWITH g-SEGS , 2009 .

[35]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[36]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[37]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[38]  Takahiro Ishikawa,et al.  The roles of transforming growth factor-β and Smad3 signaling in adipocyte differentiation and obesity. , 2011, Biochemical and biophysical research communications.

[39]  Peter M Lansdorp,et al.  Repair of telomeric DNA prior to replicative senescence , 2000, Mechanisms of Ageing and Development.

[40]  Lincoln Stein,et al.  Reactome knowledgebase of human biological pathways and processes , 2008, Nucleic Acids Res..

[41]  Yixin Wang,et al.  POWER_SAGE: comparing statistical tests for SAGE experiments , 2000, Bioinform..

[42]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[43]  Jill P. Mesirov,et al.  GSEA-P: a desktop application for Gene Set Enrichment Analysis , 2007, Bioinform..

[44]  V. Beneš,et al.  Replicative Senescence of Mesenchymal Stem Cells: A Continuous and Organized Process , 2008, PloS one.

[45]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[47]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[48]  Nada Lavrac,et al.  SEGS: Search for enriched gene sets in microarray data , 2008, J. Biomed. Informatics.

[49]  Wolfgang Wagner,et al.  Replicative senescence-associated gene expression changes in mesenchymal stromal cells are similar under different culture conditions , 2010, Haematologica.

[50]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[51]  Hiroaki Kitano,et al.  Foundations of systems biology , 2001 .

[52]  Nada Lavrac,et al.  Interpreting Gene Expression Data by Searching for Enriched Gene Sets , 2007, AIME.

[53]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[54]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[55]  Debahuti Mishra,et al.  Feature Selection for Cancer Classification: A Signal-to-noise Ratio Approach , 2011 .

[56]  Anton J. Enright,et al.  Network visualization and analysis of gene expression data using BioLayout Express3D , 2009, Nature Protocols.

[57]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[58]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..

[59]  Jing Cao,et al.  GO-Bayes: Gene Ontology-based overrepresentation analysis using a Bayesian approach , 2010, Bioinform..

[60]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.