YALE: rapid prototyping for complex data mining tasks

KDD is a complex and demanding task. While a large number of methods has been established for numerous problems, many challenges remain to be solved. New tasks emerge requiring the development of new methods or processing schemes. Like in software development, the development of such solutions demands for careful analysis, specification, implementation, and testing. Rapid prototyping is an approach which allows crucial design decisions as early as possible. A rapid prototyping system should support maximal re-use and innovative combinations of existing methods, as well as simple and quick integration of new ones.This paper describes Yale, a free open-source environment forKDD and machine learning. Yale provides a rich variety of methods whichallows rapid prototyping for new applications and makes costlyre-implementations unnecessary. Additionally, Yale offers extensive functionality for process evaluation and optimization which is a crucial property for any KDD rapid prototyping tool. Following the paradigm of visual programming eases the design of processing schemes. While the graphical user interface supports interactive design, the underlying XML representation enables automated applications after the prototyping phase.After a discussion of the key concepts of Yale, we illustrate the advantages of rapid prototyping for KDD on case studies ranging from data pre-processing to result visualization. These case studies cover tasks like feature engineering, text mining, data stream mining and tracking drifting concepts, ensemble methods and distributed data mining. This variety of applications is also reflected in a broad user base, we counted more than 40,000 downloads during the last twelve months.

[1]  KlinkenbergRalf Learning drifting concepts: Example selection vs. example weighting , 2004 .

[2]  Mario Cannataro,et al.  Grid-Based Data Mining and Knowledge Discovery , 2004 .

[3]  A M. Tjoa,et al.  GridMiner : A Framework for Knowledge Discovery on the Grid-from a Vision to Design and Implementation , 2005 .

[4]  Hans-Paul Schwefel,et al.  Advances in Computational Intelligence: Theory and Practice , 2002 .

[5]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[6]  Jörg-Uwe Kietz,et al.  Mining Mart: Metadata-Driven Preprocessing , 2001 .

[7]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[8]  Anthony Rowe,et al.  The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery , 2003, Int. J. High Perform. Comput. Appl..

[9]  Yorick Wilks,et al.  Software Infrastructure for Natural Language Processing , 1997, ANLP.

[10]  Katharina Morik,et al.  Automatic Feature Extraction for Classifying Audio Data , 2005, Machine Learning.

[11]  Katharina Morik,et al.  The MiningMart Approach to Knowledge Discovery in Databases , 2004 .

[12]  Martin Scholz,et al.  Sampling-based sequential subgroup mining , 2005, KDD '05.

[13]  Ingo Mierswa,et al.  Information preserving multi-objective feature selection for unsupervised learning , 2006, GECCO.

[14]  Ralf Klinkenberg,et al.  Boosting classifiers for drifting concepts , 2007, Intell. Data Anal..

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[16]  Timm Euler Operational Models of Data Mining Case Studies , 2005 .

[17]  Franco Turini,et al.  KDDML: A middleware language and system for knowledge discovery in databases , 2006, Data Knowl. Eng..

[18]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[19]  Ingo Mierswa,et al.  Efficient Feature Construction by Meta Learning – Guiding the Search in Meta Hypothesis Space , 2005 .

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Timm Euler,et al.  Publishing Operational Models of Data Mining Case Studies , 2005 .