Challenges in Learning from Streaming Data Extended Abstract

Machine learning studies automatic methods for acquisition of domain knowledge with the goal of improving systems performance as the result of experience. In the past two decades, machine learning research and practice has focused on batch learning usually with small data sets. The rationale behind this practice is that examples are generated at random accordingly to some stationary probability distribution. Most learners use a greedy, hill-climbing search in the space of models. They are prone to overfitting, local maximas, etc. Data are scarce and statistic estimates have high variance. A paradigmatic example is the TDIT algorithm to learn decision trees [14]. As the tree grows, less and fewer examples are available to compute the sufficient statistics, variance increase leading to model instability Moreover, the growing process re-uses the same data, exacerbating the overfitting problem. Regularization and pruning mechanisms are mandatory.

[1]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[2]  Hillol Kargupta,et al.  Mining decision trees from data streams in a mobile environment , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Mohamed Medhat Gaber,et al.  Knowledge discovery from data streams , 2009, IDA 2009.

[4]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[5]  Philip S. Yu,et al.  A framework for resource-aware knowledge discovery in data streams: a holistic approach with its application to clustering , 2006, SAC '06.

[6]  Haimonti Dutta,et al.  Orthogonal decision trees , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ricard Gavaldà,et al.  Adaptive XML Tree Classification on Evolving Data Streams , 2009, ECML/PKDD.

[8]  Assaf Schuster,et al.  A geometric approach to monitoring threshold functions over distributed data streams , 2007, ACM Trans. Database Syst..

[9]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[12]  Rong Chen,et al.  Collective Mining of Bayesian Networks from Distributed Heterogeneous Data , 2004, Knowl. Inf. Syst..

[13]  Mohamed Medhat Gaber,et al.  Cost-Efficient Mining Techniques for Data Streams , 2004, ACSW.

[14]  Yelena Yesha,et al.  Data Mining: Next Generation Challenges and Future Directions , 2004 .

[15]  Geoff Hulten,et al.  Catching up with the Data: Research Issues in Mining Data Streams , 2001, DMKD.

[16]  Ricard Gavaldà,et al.  Mining adaptively frequent closed unlabeled rooted trees in data streams , 2008, KDD.