Instant Selection of High Contrast Projections in Multi-Dimensional Data Streams

In many of today’s applications we have to cope with multidimensional data streams containing dimensions which are not relevant to a particular stream mining task. These irrelevant dimensions hinder knowledge discovery as they lead to noisy distributions in the full dimensional space, while knowledge is hidden in some sets of dependent dimensions. This dependence of dimensions may change over time and poses a major open challenge to stream mining. In this work, we focus on dependent dimensions having a high contrast, i.e. they show a clear separation between outliers and clustered objects. We present HCP-StreamMiner, a method for selecting high-contrast projections in multi-dimensional streams. Our quality measure (the contrast) of each projection is statistically determined by comparing the data distribution in a set of dimensions to their marginal distributions. We propose a technique for computing the score out of stream data summaries and a procedure for progressively tracking interesting subspaces. Our method was tested on both synthetic and real world data, and proved to be effective in detecting and tracking high contrast subspaces.

[1]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[2]  Pedro Furtado,et al.  Vmhist: Efficient Multidimensional Histograms with Improved Accuracy , 2000, DaWaK.

[3]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[4]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[5]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[6]  Hans-Peter Kriegel,et al.  Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.

[7]  S. Muthukrishnan,et al.  Maintenance of Multidimensional Histograms , 2003, FSTTCS.

[8]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[9]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[10]  Charu C. Aggarwal,et al.  On Abnormality Detection in Spuriously Populated Data Streams , 2005, SDM.

[11]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[12]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[13]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[14]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[15]  Emmanuel Müller,et al.  Statistical selection of relevant subspace projections for outlier ranking , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[16]  Myoung-Ho Kim,et al.  Efficient construction of histograms for multidimensional data using quad-trees , 2011, Decis. Support Syst..

[17]  Ira Assent,et al.  AnyOut: Anytime Outlier Detection on Streaming Data , 2012, DASFAA.

[18]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.