Monitoring Incremental Histogram Distribution for Change Detection in Data Streams

Histograms are a common technique for density estimation and they have been widely used as a tool in exploratory data analysis. Learning histograms from static and stationary data is a well known topic. Nevertheless, very few works discuss this problem when we have a continuous flow of data generated from dynamic environments. The scope of this paper is to detect changes from high-speed time-changing data streams. To address this problem, we construct histograms able to process examples once at the rate they arrive. The main goal of this work is continuously maintain a histogram consistent with the current status of the nature. We study strategies to detect changes in the distribution generating examples, and adapt the histogram to the most recent data by forgetting outdated data. We use the Partition Incremental Discretization algorithm that was designed to learn histograms from high-speed data streams. We present a method to detect whenever a change in the distribution generating examples occurs. The base idea consists of monitoring distributions from two different time windows: the reference window, reflecting the distribution observed in the past; and the current window which receives the most recent data. The current window is cumulative and can have a fixed or an adaptive step depending on the distance between distributions. We compared both distributions using Kullback-Leibler divergence, defining a threshold for change detection decision based on the asymmetry of this measure. We evaluated our algorithm with controlled artificial data sets and compare the proposed approach with nonparametric tests. We also present results with real word data sets from industrial and medical domains. Those results suggest that an adaptive window's step exhibit high probability in change detection and faster detection rates, with few false positives alarms.

[1]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[3]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[4]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[5]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[6]  Diogo Ayres-de-Campos,et al.  Omniview-SisPorto 3.5 - a central fetal monitoring station with online alerts based on computerized cardiotocogram+ST event analysis. , 2008, Journal of perinatal medicine.

[7]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[8]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  OLINDDA: a cluster-based approach for detecting novelty and concept drift in data streams , 2007, SAC '07.

[9]  D. Ayres-de- Campos,et al.  SisPorto 2.0: a program for automated analysis of cardiotocograms. , 2000, The Journal of maternal-fetal medicine.

[10]  Alessandra Russo,et al.  Advances in Artificial Intelligence – SBIA 2004 , 2004, Lecture Notes in Computer Science.

[11]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[12]  Dimitris Sacharidis,et al.  Exploiting duality in summarization with deterministic guarantees , 2007, KDD '07.

[13]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[14]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[15]  David J. Hand,et al.  Intelligent Data Analysis: An Introduction , 2005 .

[16]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[17]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[18]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[19]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[20]  Ryszard S. Michalski,et al.  Selecting Examples for Partial Memory Learning , 2000, Machine Learning.

[21]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[22]  Ana Paula Rocha,et al.  Linear and nonlinear analysis of heart rate patterns associated with fetal behavioral states in the antepartum period. , 2007, Early human development.

[23]  Jean-Yves Tourneret,et al.  Optimal wavelet for abrupt change detection in multiplicative noise , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[25]  João Gama,et al.  Incremental discretization, application to data with concept drift , 2007, SAC '07.

[26]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[27]  João Gama,et al.  Change Detection in Learning Histograms from Data Streams , 2007, EPIA Workshops.