Statistical mining in data streams

Recent years have seen a steady rise of a new class of data management systems called Data Stream Management Systems (DSMS). These systems manage rapid, high-volume data-streams with transient relations instead of static data with persistent relations. Data streams are common to applications such as network traffic and transaction monitoring systems, click-stream processors, industrial process control, and sensor networks. A DSMS operates on these continuous and time-varying data streams to facilitate on-the-fly query answering, and to support data acquisition, monitoring and analysis. In this dissertation, we present statistical stream mining solutions for effective online processing of streaming data. We focus research issues related to adaptive stream resource conservation and online mining in a DSMS. We have developed statistical linear and non-linear filtering techniques based on the Kalman Filter to capture temporal correlations in the streaming data. Such correlations help in stream resource conservation. We also propose techniques that capture spatial correlations between the streaming sources that further helps improving resource conservation and facilitates answering group-queries in an efficient manner. In addition to resource management and query processing, a DSMS needs to address issues related to online stream mining. Once the data stream arrives at a central server, effective mining techniques are necessary for stream analysis, before the data can be discarded. Since a stream continuously evolves with time, stream mining techniques need to be adaptive and should operate under a given memory constraint. We propose adaptive clustering solutions that use the kernel trick to capture non-linear relations in the streaming data. We also present OCODDS, a change-detection approach that can track evolutionary changes in the stream in both linear and non-linear settings. Finally, we present our techniques for effective acquisition and processing of data streams common to video sensor networks.

[1]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[2]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[3]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[4]  Alʹbert Nikolaevich Shiri︠a︡ev,et al.  Optimal stopping rules , 1977 .

[5]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[6]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[7]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[8]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[9]  John C. Gower,et al.  Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance , 1999 .

[10]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[12]  Gilbert Strang,et al.  Introduction to applied mathematics , 1988 .

[13]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[14]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[15]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[16]  Shaoning Pang,et al.  One-Pass Incremental Membership Authentication by Face Classification , 2004, ICBA.

[17]  Jennifer Widom,et al.  Query Processing, Resource Management, and Approximation ina Data Stream Management System , 2002 .

[18]  Divesh Srivastava,et al.  Streams, Security and Scalability , 2005, DBSec.

[19]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[20]  J. Waddington,et al.  The application of Kalman filtering to the load/pressure control of coal-fired boilers , 1989 .

[21]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[22]  Carlo Zaniolo,et al.  A native extension of SQL for mining data streams , 2005, SIGMOD '05.

[23]  Dimitrios Gunopulos,et al.  Correlating synchronous and asynchronous data streams , 2003, KDD '03.

[24]  T. Kanade,et al.  A master-slave system to acquire biometric imagery of humans at distance , 2003, IWVS '03.

[25]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[26]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[27]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[28]  Charu C. Aggarwal,et al.  On change diagnosis in evolving data streams , 2005, IEEE Transactions on Knowledge and Data Engineering.

[29]  Jennifer Widom,et al.  Exploiting k-constraints to reduce memory overhead in continuous queries over data streams , 2004, TODS.

[30]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[31]  Wei Hong,et al.  The design of an acquisitional query processor for sensor networks , 2003, SIGMOD '03.

[32]  Jennifer Widom,et al.  STREAM: The Stanford Stream Data Manager , 2003, IEEE Data Eng. Bull..

[33]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[34]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[35]  Rajeev Motwani,et al.  Operator scheduling in data stream systems , 2004, VLDB 2004.

[36]  R. Biswas,et al.  A probabilistic approach to inference with limited information in sensor networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[37]  Eric Bauer,et al.  Update Rules for Parameter Estimation in Bayesian Networks , 1997, UAI.

[38]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[39]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[40]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[41]  Fabio Gagliardi Cozman,et al.  Online Learning of Bayesian Network Parameters , 2001 .

[42]  Wei Wu,et al.  Neural Decoding of Cursor Motion Using a Kalman Filter , 2002, NIPS.

[43]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[44]  A. Prasad Sistla,et al.  Updating and Querying Databases that Track Mobile Units , 1999, Distributed and Parallel Databases.

[45]  I. Jolliffe Principal Component Analysis , 2002 .

[46]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[47]  Alan Watt,et al.  Advanced animation and rendering techniques - theory and practice , 1992 .

[48]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[49]  John Anderson,et al.  Wireless sensor networks for habitat monitoring , 2002, WSNA '02.

[50]  José Moreira,et al.  Oporto: A Realistic Scenario Generator for Moving Objects , 2001, GeoInformatica.

[51]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[52]  Jiawei Han,et al.  MAIDS: mining alarming incidents from data streams , 2004, SIGMOD '04.

[53]  Charu C. Aggarwal An intuitive framework for understanding changes in evolving data streams , 2002, Proceedings 18th International Conference on Data Engineering.

[54]  Nir Friedman,et al.  Sequential Update of Bayesian Network Structure , 1997, UAI.

[55]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[56]  Rajmohan Rajaraman,et al.  Hybrid Push-Pull Query Processing for Sensor Networks , 2004, GI Jahrestagung.

[57]  Srinivasan Seshan,et al.  FastCARS: fast, correlation-aware sampling for network data mining , 2002, Global Telecommunications Conference, 2002. GLOBECOM '02. IEEE.

[58]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[59]  Johannes Gehrke,et al.  Query Processing in Sensor Networks , 2003, CIDR.

[60]  José M. F. Moura,et al.  Intelligent sensor fusion: a graphical model approach , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[61]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[62]  Qi Han,et al.  QUASAR: quality aware sensing architecture , 2004, SGMD.

[63]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[64]  Aggelos Bletsas,et al.  Evaluation of Kalman filtering for network time keeping , 2003, Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, 2003. (PerCom 2003)..

[65]  Giovanni De Micheli,et al.  Energy efficient design of portable wireless systems , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[66]  Jennifer Widom,et al.  Operator placement for in-network stream query processing , 2005, PODS.

[67]  Rajeev Motwani,et al.  Scalable Techniques for Mining Causal Structures , 1998, Data Mining and Knowledge Discovery.

[68]  Jennifer Widom,et al.  Adaptive precision setting for cached approximate values , 2001, SIGMOD '01.

[69]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[70]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[71]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[72]  Haitao Zhao,et al.  Incremental eigen decomposition , 2003 .

[73]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[74]  Jennifer Widom,et al.  Adaptive filters for continuous queries over distributed data streams , 2003, SIGMOD '03.

[75]  Charu C. Aggarwal Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search , 2002, SIGMOD '02.

[76]  William H. Hsu,et al.  A Survey of Algorithms for Real-Time Bayesian Network Inference , 2002 .

[77]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[78]  Greg Welch,et al.  An Introduction to Kalman Filter , 1995, SIGGRAPH 2001.

[79]  Sharad Mehrotra,et al.  Capturing sensor-generated time series with quality guarantees , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[80]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[81]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[82]  Dan Suciu,et al.  Foundations of probabilistic answers to queries , 2005, SIGMOD '05.

[83]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[84]  David E. Culler,et al.  Supporting aggregate queries over ad-hoc wireless sensor networks , 2002, Proceedings Fourth IEEE Workshop on Mobile Computing Systems and Applications.

[85]  Edward Y. Chang,et al.  Adaptive stream resource management using Kalman Filters , 2004, SIGMOD '04.

[86]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[87]  Robert Grover Brown,et al.  Introduction to random signal analysis and Kalman filtering , 1983 .

[88]  Mani Srivastava,et al.  Energy-aware wireless microsensor networks , 2002, IEEE Signal Process. Mag..

[89]  Edward Y. Chang,et al.  Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance , 2003, MULTIMEDIA '03.

[90]  Yoshua Bengio,et al.  Spectral Clustering and Kernel PCA are Learning Eigenfunctions , 2003 .

[91]  Samuel Madden,et al.  Using Probabilistic Models for Data Management in Acquisitional Environments , 2005, CIDR.

[92]  Ambuj K. Singh,et al.  SWAT: hierarchical stream summarization in large networks , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[93]  J. Gower Adding a point to vector diagrams in multivariate analysis , 1968 .

[94]  Xing Chen,et al.  Calibrating pan-tilt cameras in wide-area surveillance networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[95]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[96]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[97]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[98]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[99]  Jennifer Widom,et al.  Characterizing memory requirements for queries over continuous data streams , 2002, PODS '02.

[100]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[101]  Lionel Sacks,et al.  Adaptive Sampling Mechanisms in Sensor Networks , 2003 .

[102]  Robert E. Tarjan,et al.  Finding optimum branchings , 1977, Networks.

[103]  Wei Hong,et al.  TinyDB: an acquisitional query processing system for sensor networks , 2005, TODS.

[104]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[105]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.