Debellor : Open Source Modular Platform for Scalable Data Mining

This paper introduces Debellor (www.debellor.org) – an open source extensible data mining platform with stream-oriented architecture, where all data transfers between elementary algorithms take the form of a stream of samples. Data streaming enables implementation of scalable algorithms, which can efficiently process large volumes of data, exceeding available memory. This is very important for data mining research and applications, since the most challenging data mining tasks involve voluminous data, either produced by a data source or generated at some intermediate stage of a complex data processing network. Advantages of data streaming are illustrated by experiments with clustering time series. The experimental results show that even for moderate-size data sets streaming is indispensable for successful execution of algorithms, otherwise the algorithms run hundreds times slower or just crash due to memory shortage. Stream architecture is particularly useful in such application domains as time series analysis, image recognition or mining data streams. It is also the only efficient architecture for implementation of online algorithms. Due to its scalability and modularity Debellor was chosen as the basis for TunedTester application – one of three pillars of TunedIT (tunedit.org) system for automatic evaluation of machine learning algorithms. The current version of Debellor is 0.6.2.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Arkadiusz Wojna,et al.  On the Evolution of Rough Set Exploration System , 2004, Rough Sets and Current Trends in Computing.

[3]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[6]  Marcin Wojnarski Absolute Contrasts in Face Detection with AdaBoost Cascade , 2007, RSKT.

[7]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[8]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[9]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.