Learning from Data Streams: Synopsis and Change Detection

The aim of this PhD program is the study of algorithms for learning histograms, with the capacity of representing continuous high-speed flows of data and dealing with the current problem of change detection on data streams. In many modern applications, information is no longer gathered as finite stored data sets, but assuming the form of infinite data streams. As a large volume of information is produced at a high-speed rate it is no longer possible to use memory algorithms which require the full historic data stored in the main memory, so new ones are needed to process data online at the rate it is available. Moreover, the process generating data is not strictly stationary and evolves over time; so algorithms should, while extracting some sort of knowledge from this incessantly growing data, be able to adapt themselves to changes, maintaining a representation consistent with the most recent status of nature. In this work, we presented a feasible approach, using incremental histograms and monitoring data distributions, to detect concept drift in data stream context.

[1]  Grigorios Tsoumakas,et al.  An adaptive personalized news dissemination system , 2009, Journal of Intelligent Information Systems.

[2]  João Gama,et al.  Regression Trees from Data Streams with Drift Detection , 2009, Discovery Science.

[3]  Douglas C. Montgomery,et al.  Introduction to Statistical Quality Control , 1986 .

[4]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[5]  Jean-Yves Tourneret,et al.  Optimal wavelet for abrupt change detection in multiplicative noise , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jaideep Srivastava,et al.  Event detection from time series data , 1999, KDD '99.

[7]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[8]  Harry Wechsler,et al.  A Martingale Framework for Detecting Changes in Data Streams by Testing Exchangeability , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Dimitris K. Tasoulis,et al.  Online annotation and prediction for regime switching data streams , 2009, SAC '09.

[10]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[11]  Guy A Dumont,et al.  An Evaluation of a Novel Software Tool for Detecting Changes in Physiological Monitoring , 2009, Anesthesia and analgesia.

[12]  João Gama,et al.  Monitoring Incremental Histogram Distribution for Change Detection in Data Streams , 2008, KDD Workshop on Knowledge Discovery from Sensor Data.

[13]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[14]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[15]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[16]  Marcus A. Maloof,et al.  Paired Learners for Concept Drift , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[17]  J. J. Higgins,et al.  A Comparison of the Power of Wilcoxon's Rank-Sum Statistic to that of Student'st Statistic Under Various Nonnormal Distributions , 1980 .

[18]  S L Shafer,et al.  Response Surface Model for Anesthetic Drug Interactions , 2000, Anesthesiology.

[19]  Marianne Frisén,et al.  Introduction to financial surveillance , 2008 .

[20]  Erik W. Jensen,et al.  Modelling the interaction of propofol and remifentanil by means of an adaptive neuro fuzzy inference system (ANFIS): A-514 , 2006 .

[21]  D. Hinkley Inference in Two-Phase Regression , 1971 .

[22]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[23]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[24]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[25]  D. Ayres-de- Campos,et al.  SisPorto 2.0: a program for automated analysis of cardiotocograms. , 2000, The Journal of maternal-fetal medicine.

[26]  S. Muthukrishnan,et al.  Sequential Change Detection on Data Streams , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[27]  David V. Hinkley,et al.  Inference about the change-point in a sequence of binomial variables , 1970 .

[28]  Sudipto Guha,et al.  Wavelet synopsis for data streams: minimizing non-euclidean error , 2005, KDD '05.

[29]  Martin Luginbühl,et al.  Detection of awareness with the bispectral index: two case reports. , 2002, Anesthesiology.

[30]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[31]  Concha Bielza,et al.  Comparison of Bayesian networks and artificial neural networks for quality detection in a machining process , 2009, Expert Syst. Appl..

[32]  Sudipto Guha,et al.  REHIST: Relative Error Histogram Construction Algorithms , 2004, VLDB.

[33]  Miroslav Kubat Floating approximation in time-varying knowledge bases , 1989, Pattern Recognit. Lett..

[34]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[35]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[36]  Ping Yang,et al.  Adaptive Change Detection in Heart Rate Trend Monitoring in Anesthetized Children , 2006, IEEE Transactions on Biomedical Engineering.

[37]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[38]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[39]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[40]  Pere Caminal,et al.  Validation of the Index of Consciousness (IoC) during sedation/analgesia for ultrasonographic endoscopy , 2008, 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[41]  João Gama,et al.  Hierarchical Clustering of Time-Series Data Streams , 2008, IEEE Transactions on Knowledge and Data Engineering.

[42]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[43]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[44]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[45]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[46]  S. W. Roberts Control chart tests based on geometric moving averages , 2000 .

[47]  Peter S. Sebel,et al.  Bispectral Index Monitoring Allows Faster Emergence and Improved Recovery from Propofol, Alfentanil, and Nitrous Oxide Anesthesia , 1997 .

[48]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[49]  Michèle Basseville,et al.  Detection of Abrupt Changes: Theory and Applications. , 1995 .

[50]  Ryszard S. Michalski,et al.  Selecting Examples for Partial Memory Learning , 2000, Machine Learning.

[51]  João Gama,et al.  Change Detection with Kalman Filter and CUSUM , 2006, Discovery Science.

[52]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[53]  R. Klinkenberg,et al.  Adaptive Information Filtering : Learning Drifting Concepts 1 , 1998 .

[54]  Geoff Hulten,et al.  Catching up with the Data: Research Issues in Mining Data Streams , 2001, DMKD.

[55]  George A. Mashour,et al.  A novel electronic algorithm for detecting potentially insufficient anesthesia: implications for the prevention of intraoperative awareness , 2009, Journal of clinical monitoring and computing.

[56]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[57]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[58]  Margarida M. Silva,et al.  Total Mass TCI driven by parametric estimation , 2009, 2009 17th Mediterranean Conference on Control and Automation.

[59]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[60]  S. Selbst,et al.  Adverse Sedation Events in Pediatrics: A Critical Incident Analysis of Contributing Factors , 2000, Pediatrics.

[61]  Nikos Mamoulis,et al.  Hierarchical synopses with optimal error guarantees , 2008, TODS.

[62]  I. Rampil A Primer for EEG Signal Processing in Anesthesia , 1998, Anesthesiology.

[63]  Ana Paula Rocha,et al.  Linear and nonlinear analysis of heart rate patterns associated with fetal behavioral states in the antepartum period. , 2007, Early human development.

[64]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Changing Environments , 2004, Multiple Classifier Systems.

[65]  Indre Zliobaite,et al.  Ensemble Learning for Concept Drift Handling - the Role of New Expert , 2007, MLDM Posters.

[66]  W. A. Shewhart,et al.  The Application of Statistics as an Aid in Maintaining Quality of a Manufactured Product , 1925 .

[67]  Shen-Shyang Ho,et al.  A martingale framework for concept change detection in time-varying data streams , 2005, ICML.

[68]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[69]  João Gama,et al.  Incremental discretization, application to data with concept drift , 2007, SAC '07.

[70]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[71]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[72]  P. Sebel,et al.  Bispectral index monitoring allows faster emergence and improved recovery from propofol, alfentanil, and nitrous oxide anesthesia. BIS Utility Study Group. , 1997, Anesthesiology.

[73]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[74]  D. W. Scott On optimal and data based histograms , 1979 .

[75]  A. Yli-Hankala,et al.  Description of the Entropy™ algorithm as applied in the Datex‐Ohmeda S/5™ Entropy Module , 2004, Acta anaesthesiologica Scandinavica.

[76]  Jean Dickinson Gibbons,et al.  Nonparametric Statistical Inference , 1972, International Encyclopedia of Statistical Science.

[77]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[78]  João Gama,et al.  A Study on Change Detection Methods , 2009 .

[79]  Christian Sonesson,et al.  Surveillance in Longitudinal Models: Detection of Intrauterine Growth Restriction , 2004, Biometrics.

[80]  I Magnusson,et al.  Change detection on longitudinal data in periodontal research. , 1993, Journal of periodontal research.

[81]  W. Shewhart The Economic Control of Quality of Manufactured Product. , 1932 .

[82]  Indre Zliobaite,et al.  Learning under Concept Drift: an Overview , 2010, ArXiv.

[83]  D. Notterman,et al.  Adverse Sedation Events in Pediatrics: A Critical Incident Analysis of Contributing Factors , 2000, Pediatrics.

[84]  Žliobait . e,et al.  Learning under Concept Drift: an Overview , 2010 .

[85]  Ivan Koychev,et al.  Gradual Forgetting for Adaptation to Concept Drift , 2000 .

[86]  João Gama,et al.  Data Stream Processing , 2007 .

[87]  Ioannis Stamos,et al.  A Comparison of 2-CUSUM Stopping Rules for Quickest Detection of Two-Sided Alternatives in a Brownian Motion Model , 2009 .

[88]  Peter S. Sebel,et al.  Bispectral Analysis Measures Sedation and Memory Effects of Propofol, Midazolam, Isoflurane, and Alfentanil in Healthy Volunteers , 1997, Anesthesiology.

[89]  João Gama,et al.  Drift Severity Metric , 2010, ECAI.

[90]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[91]  Michèle Sebag,et al.  Multi-armed Bandit, Dynamic Environments and Meta-Bandits , 2006 .

[92]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[93]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[94]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[95]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[96]  Ming-Yen Lin,et al.  Interactive Mining of Frequent Itemsets over Arbitrary Time Intervals in a Data Stream , 2008, ADC.

[97]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[98]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Detecting Concept Change in Streaming Data: Overview and Perspectives , 2008 .

[99]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[100]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  OLINDDA: a cluster-based approach for detecting novelty and concept drift in data streams , 2007, SAC '07.

[101]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[102]  William W. Melek,et al.  Comparison of trend detection algorithms in the analysis of physiological time-series data , 2005, IEEE Transactions on Biomedical Engineering.

[103]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[104]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[105]  Ricard Gavaldà,et al.  Kalman Filters and Adaptive Windows for Learning in Data Streams , 2006, Discovery Science.

[106]  Niall M. Adams,et al.  The impact of changing populations on classifier performance , 1999, KDD '99.

[107]  D. Hinkley Inference about the change-point from cumulative sum tests , 1971 .

[108]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[109]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[110]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[111]  David J. Hand,et al.  Intelligent Data Analysis: An Introduction , 2005 .

[112]  Diogo Ayres-de-Campos,et al.  Omniview-SisPorto 3.5 - a central fetal monitoring station with online alerts based on computerized cardiotocogram+ST event analysis. , 2008, Journal of perinatal medicine.

[113]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[114]  Dimitris Sacharidis,et al.  Exploiting duality in summarization with deterministic guarantees , 2007, KDD '07.

[115]  João Gama,et al.  Change Detection in Learning Histograms from Data Streams , 2007, EPIA Workshops.

[116]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.