The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration

BackgroundIn many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications.ResultsThis work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds.ConclusionsThe number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.

[1]  Nabil Belacel,et al.  Clustering: Unsupervised Learning in Large Biological Data , 2010 .

[2]  Peter Ertl,et al.  The Molecule Cloud - compact visualization of large collections of molecules , 2012, Journal of Cheminformatics.

[3]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Robert Stevens,et al.  Combining RapidMiner Operators with Bioinformatics Services - A Powerful Combination , 2011 .

[6]  Daniela Digles,et al.  Self‐Organizing Maps for In Silico Screening and Data Visualization , 2011, Molecular informatics.

[7]  Jae K. Lee,et al.  Statistical Bioinformatics: A Guide for Life and Biomedical Science Researchers , 2010 .

[8]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[9]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[10]  Christian Borgelt,et al.  MoSS: a program for molecular substructure mining , 2005 .

[11]  Robert Stevens,et al.  Structure-based classification and ontology in chemistry , 2012, Journal of Cheminformatics.

[12]  Giuseppe Di Fatta,et al.  Simulated annealing technique for fast learning of SOM networks , 2013, Neural Computing and Applications.

[13]  M. Berthold,et al.  Context-Aware Visual Exploration of Molecular Databases , 2006 .

[14]  Alfred Ultsch,et al.  Emergence in Self Organizing Feature Maps , 2007 .

[15]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[16]  Giuseppe Di Fatta,et al.  Clustering Quality and Topology Preservation in Fast Learning SOMs , 2008, ICANN.

[17]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[18]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[19]  Sereina Riniker,et al.  Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods , 2013, Journal of Cheminformatics.

[20]  Giuseppe Di Fatta,et al.  A New Linear Initialization in SOM for Biomolecular Data , 2008, CIBB.

[21]  Otto Opitz,et al.  Information and Classification , 1993 .

[22]  Natalie Wilson Systems biology: A powerful combination , 2004, Nature Reviews Molecular Cell Biology.

[23]  A. Ultsch,et al.  Self-Organizing Neural Networks for Visualisation and Classification , 1993 .

[24]  Egon L. Willighagen,et al.  New developments on the cheminformatics open workflow environment CDK-Taverna , 2011, J. Cheminformatics.

[25]  Giuseppe Di Fatta,et al.  Context-Aware Visual Exploration of Molecular Datab , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).