Parallel and Distributed Data Pipelining with KNIME

In recent years a new category of data analysis applications have evolved, known as data pipelining tools, which enable even nonexperts to perform complex analysis tasks on potentially huge amounts of data. Due to the complex and computing intensive analysis processes and methods used, it is often neither sufficient nor possible to simply rely on the increase of performance of single processors. Promising solutions to this problem are parallel and distributed approaches that can accelerate the analysis process. In this paper we discuss the parallel and distribution potential of pipelining tools by demonstrating several parallel and distributed implementations in the open source pipelining platform KNIME. We verify the practical applicability in a number of real world experiments.

[1]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  Thorsten Meinl,et al.  Mining Molecular Datasets on Symmetric Multiprocessor Systems , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[4]  Jarke J. van Wijk,et al.  The value of visualization , 2005, VIS 05. IEEE Visualization, 2005..

[5]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[6]  Kristin A. Cook,et al.  Illuminating the Path: The Research and Development Agenda for Visual Analytics , 2005 .

[7]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[8]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[9]  David J. Hand,et al.  Intelligent Data Analysis: An Introduction , 2005 .

[10]  J. J. van Wijk The value of visualization , 2005 .

[11]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[12]  Giuseppe Di Fatta,et al.  A Hierarchical Distributed Approach for Mining Molecular Fragments , 2006 .

[13]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Cheng-Zhong Xu,et al.  Iterative Dynamic Load Balancing in Multicomputers , 1994 .

[15]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[16]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[17]  ThiesWilliam,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006 .

[18]  Claudia Leopold,et al.  Parallel and Distributed Computing: A Survey of Models, Paradigms and Approaches , 2008 .

[19]  Michael R. Berthold,et al.  Adaptive Active Classification of Cell Assay Images , 2006, PKDD.

[20]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[21]  Ian Witten,et al.  Data Mining , 2000 .