[KD3] A Workflow-Based Application for Exploration of Biomedical Data Sets

Based on the biotechnological revolution in the past years, molecular biology has become increasingly data-driven. Knowledge Discovery in Databases, a well-known process in the field of bioinformatics, is supporting the biological research process from data integration, knowledge mining to data interpretation. This work proposes a new software suite, termed Knowledge Discovery inDatabases Designer (KD3), covering the completeKnowledge Discovery in Databases process using a workflow-oriented architecture. Three different application-oriented modules are implemented in KD3: First, the Designer for designing specific workflows. These workflows can be used by the Interpreter, which allows to load and parameterize existing workflows. The Launcher encapsulates one dedicated workflow into an independent application to answer one specific biomedical question. KD3 offers a variety of implemented methods, which can be easily extended with new customized components using functional objects. All components can be connected to workflows, which may contain elements of other applications.

[1]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[2]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[3]  Ricco Rakotomalala,et al.  TANAGRA : un logiciel gratuit pour l'enseignement et la recherche , 2005, EGC.

[4]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[5]  Bernhard Pfeifer,et al.  A new data mining approach for profiling and categorizing kinetic patterns of metabolic biomarkers after myocardial injury , 2010, Bioinform..

[6]  D. Okada,et al.  Digital Image Processing for Medical Applications , 2009 .

[7]  Ibrahim Emam,et al.  ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression , 2008, Nucleic Acids Res..

[8]  Bernhard Pfeifer,et al.  A Cellular Automaton Framework for Infectious Disease Spread Simulation , 2008, The open medical informatics journal.

[9]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[10]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[11]  Robert Gentleman,et al.  R Programming for Bioinformatics , 2008 .

[12]  James C. Benneyan,et al.  Statistical Control Charts Based on a Geometric Distribution , 1992 .

[13]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[14]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[15]  Manfred Thiel,et al.  Real-time Monitoring of Propofol in Expired Air in Humans Undergoing Total Intravenous Anesthesia , 2007, Anesthesiology.

[16]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[17]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[18]  Andreas Wierse,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[19]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[20]  Luc De Raedt,et al.  Machine Learning: ECML-94 , 1994, Lecture Notes in Computer Science.

[21]  Oliver Faust,et al.  A pervasive design strategy for distributed health care systems , 2008 .

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[24]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[25]  C. Baumgartner,et al.  Non-invasive diagnosis of liver diseases by breath analysis using an optimized ion–molecule reaction-mass spectrometry approach: a pilot study , 2010, Biomarkers : biochemical indicators of exposure, response, and susceptibility to chemicals.

[26]  J. Gasteiger,et al.  Chemoinformatics: A Textbook , 2003 .

[27]  Bernhard Pfeifer,et al.  A data warehouse for prostate cancer biomarker discovery , 2007, International Conference on Bioinformatics & Computational Biology.

[28]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[29]  Bernhard Pfeifer,et al.  A new ensemble-based algorithm for identifying breath gas marker candidates in liver disease using ion molecule reaction mass spectrometry , 2009, Bioinform..

[30]  Matthew O. Ward,et al.  Introduction to data visualization , 2001 .

[31]  D. Williamson,et al.  The box plot: a simple visual method to interpret data. , 1989, Annals of internal medicine.

[32]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .