rEMM: Extensible Markov Model for Data Stream Clustering in R

Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The extensible Markov model (EMM) adds the temporal component to data stream clustering by superimposing a dynamically adapting Markov chain. In this paper we introduce the implementation of the R extension package rEMM which implements EMM and we discuss some examples and applications.

[1]  Maja J. Mataric,et al.  Coordinating mobile robot group behavior using a model of interaction dynamics , 1999, AGENTS '99.

[2]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[3]  Masaaki Kijima,et al.  Markov processes for stochastic modeling , 1997 .

[4]  Charu C. Aggarwal,et al.  A Framework for Clustering Massive-Domain Data Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Jie Huang,et al.  Extensible Markov model , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Kendall Scott,et al.  UML distilled - a brief guide to the Standard Object Modeling Language (2. ed.) , 2000, notThenot Addison-Wesley object technology series.

[7]  KriegelHans-Peter,et al.  Density-Based Clustering in Spatial Databases , 1998 .

[8]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[9]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[10]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[11]  Mari Ostendorf,et al.  HMM topology design using maximum likelihood successive state splitting , 1997, Comput. Speech Lang..

[12]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[13]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[14]  E. Jaynes Probability theory : the logic of science , 2003 .

[15]  Yu Meng,et al.  Mining Developing Trends of Dynamic Spatiotemporal Data Streams , 2006, J. Comput..

[16]  Lin Lu,et al.  Mining Significant Usage Patterns from Clickstream Data , 2005, WEBKDD.

[17]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  Hans-Peter Kriegel,et al.  Incremental OPTICS: Efficient Computation of Updates in a Hierarchical Cluster Ordering , 2003, DaWaK.

[19]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[20]  Emanuel Parzen,et al.  Stochastic Processes , 1962 .

[21]  Dimitris K. Tasoulis,et al.  Visualising the Cluster Structure of Data Streams , 2007, IDA.

[22]  Jie Huang,et al.  Rare Event Detection in a Spatiotemporal Environment , 2006, 2006 IEEE International Conference on Granular Computing.

[23]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[24]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[25]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[26]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[27]  Dimitris K. Tasoulis,et al.  Unsupervised Clustering In Streaming Data , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[28]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[29]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[30]  Philip S. Yu,et al.  Density-based clustering of data streams at multiple resolutions , 2009, TKDD.

[31]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[33]  Yu Meng,et al.  Efficient Mining of Emerging Events in a Dynamic Spatiotemporal Environment , 2006, PAKDD.

[34]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[35]  Margaret H. Dunham,et al.  Risk Leveling of Network Traffic Anomalies , 2006 .

[36]  Yu Meng,et al.  Online Mining of Risk Level of Traffic Anomalies with User~s Feedbacks , 2006, 2006 IEEE International Conference on Granular Computing.

[37]  Anne Lohrli Chapman and Hall , 1985 .