Fading histograms in detecting distribution and concept changes

The remarkable number of real applications under dynamic scenarios is driving a novel ability to generate and gather information. Nowadays, a massive amount of information is generated at a high-speed rate, known as data streams. Moreover, data are collected under evolving environments. Due to memory restrictions, data must be promptly processed and discarded immediately. Therefore, dealing with evolving data streams raises two main questions: (i) how to remember discarded data? and (ii) how to forget outdated data? To maintain an updated representation of the time-evolving data, this paper proposes fading histograms. Regarding the dynamics of nature, changes in data are detected through a windowing scheme that compares data distributions computed by the fading histograms: the adaptive cumulative windows model (ACWM). The online monitoring of the distance between data distributions is evaluated using a dissimilarity measure based on the asymmetry of the Kullback–Leibler divergence. The experimental results support the ability of fading histograms in providing an updated representation of data. Such property works in favor of detecting distribution changes with smaller detection delay time when compared with standard histograms. With respect to the detection of concept changes, the ACWM is compared with 3 known algorithms taken from the literature, using artificial data and using public data sets, presenting better results. Furthermore, we the proposed method was extended for multidimensional and the experiments performed show the ability of the ACWM for detecting distribution changes in these settings.

[1]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[2]  Marcus A. Maloof,et al.  Paired Learners for Concept Drift , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[4]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[5]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[6]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[7]  Mohamed Medhat Gaber,et al.  Knowledge discovery from data streams , 2009, IDA 2009.

[8]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[9]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[10]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[11]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[12]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[13]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[14]  Nikos Mamoulis,et al.  Hierarchical synopses with optimal error guarantees , 2008, TODS.

[15]  Ana Paula Rocha,et al.  Linear and nonlinear analysis of heart rate patterns associated with fetal behavioral states in the antepartum period. , 2007, Early human development.

[16]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[17]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[18]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[19]  João Gama,et al.  Comparing Data Distribution Using Fading Histograms , 2014, ECAI.

[20]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[21]  João Gama,et al.  Change Detection in Learning Histograms from Data Streams , 2007, EPIA Workshops.

[22]  Michèle Basseville,et al.  Detection of Abrupt Changes: Theory and Applications. , 1995 .

[23]  D. Ayres-de- Campos,et al.  SisPorto 2.0: a program for automated analysis of cardiotocograms. , 2000, The Journal of maternal-fetal medicine.

[24]  Concha Bielza,et al.  Comparison of Bayesian networks and artificial neural networks for quality detection in a machining process , 2009, Expert Syst. Appl..

[25]  Diogo Ayres-de-Campos,et al.  Omniview-SisPorto 3.5 - a central fetal monitoring station with online alerts based on computerized cardiotocogram+ST event analysis. , 2008, Journal of perinatal medicine.

[26]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[27]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Detecting Concept Change in Streaming Data: Overview and Perspectives , 2008 .

[28]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[29]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[30]  H. Mouss,et al.  Test of Page-Hinckley, an approach for fault detection in an agro-alimentary production system , 2004, 2004 5th Asian Control Conference (IEEE Cat. No.04EX904).

[31]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[32]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[33]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[34]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[35]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[36]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[37]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[38]  João Gama,et al.  Constructing fading histograms from data streams , 2014, Progress in Artificial Intelligence.

[39]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[40]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.