Issues and Challenges in Learning from Data Streams Extended Abstract

In the last two decades, machine learning research and practice has focused on batch learning usually with small datasets. In batch learning, the whole training data is available to the algorithm, that outputs a decision model after processing the data eventually (or most of the times) multiple times. The rationale behind this practice is that examples are generated at random accordingly to some stationary probability distribution. Most learners use a greedy, hill-climbing search in the space of models. The development of information and communication technologies dramatically change the data collection and processing methods. Advances in miniaturization and sensor technology lead to sensor networks, collecting detailed spatio-temporal data about the environment. An illustrative application is the problem of mining data produced by sensors distributed all around electrical-power distribution networks. These sensors produce streams of data at high-speed. From a data mining perspective, this problem is characterized by a large number of variables (sensors), producing a continuous flow of data, in a dynamic non-stationary environment. Companies analyze these data streams and make decisions for several problems. Companies are interested in identify critical points in load evolution, e.g. picks on the demand. These aspects are related to anomaly detection, extreme values, failures, outliers, and abnormal activities detection. Other problem is related to change detection in the behavior (correlation) of sensors. Cluster analysis can be used for the identification of groups of high-correlated sensors, corresponding to common behaviors or profiles (e.g. Urban, Rural, Industrial, etc.). Decisions to buy or sell energy are based on the predictions on the value measured by each sensor for different time horizons. All these problems illustrates some of the requirements and objectives usually inherent to ubiquitous computing. Sensors produce a continuous flow of data, are limited in resources such as memory and computational power, and communication between them is easily narrowed due to distance and hardware limitations. Moreover, given the limited resources and fast production of data, information must be processed in real-time, creating a scenario of multidimensional streaming analysis. In this article we discuss the issues and challenges on learning from data streams. We discuss limitations of current learning systems and point out possible research lines for next generation data mining systems. How to learn from these distributed continuous streaming data? Which are the main characteristics of a learning algorithm acting in sensor networks? What are the relevant issues, challenges, and research opportunities?