A Monte Carlo Method to Data Stream Analysis

Data stream analysis is the process of computing various summaries and derived values from large amounts of data which are continuously generated at a rapid rate. The nature of a stream does not allow a revisit on each data element. Furthermore, data processing must be fast to produce timely analysis results. These requirements impose constraints on the design of the algorithms to balance correctness against timely responses. Several techniques have been proposed over the past few years to address these challenges. These techniques can be categorized as either data- oriented or task-oriented. The data-oriented approach analyzes a subset of data or a smaller transformed representation, whereas task- oriented scheme solves the problem directly via approximation techniques. We propose a hybrid approach to tackle the data stream analysis problem. The data stream has been both statistically transformed to a smaller size and computationally approximated its characteristics. We adopt a Monte Carlo method in the approxi- mation step. The data reduction has been performed horizontally and vertically through our EMR sampling method. The proposed method is analyzed by a series of experiments. We apply our algorithm on clustering and classification tasks to evaluate the utility of our approach.

[1]  S. Muthukrishnan,et al.  One-Pass Wavelet Decompositions of Data Streams , 2003, IEEE Trans. Knowl. Data Eng..

[2]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[5]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[6]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[7]  Jean-Michel Marin,et al.  Bayesian Modelling and Inference on Mixtures of Distributions , 2005 .

[8]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  David J. C. Mackay,et al.  Introduction to Monte Carlo Methods , 1998, Learning in Graphical Models.

[10]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[11]  Ian Witten,et al.  Data Mining , 2000 .

[12]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[13]  David J. Hand,et al.  Intelligent Data Analysis: An Introduction , 2005 .

[14]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.