Succinctly summarizing machine usage via multi-subspace clustering of multi-sensor data

Modern industrial equipments of all kinds are instrumented with a large number of sensors that continuously transmit their readings wirelessly, giving rise to what is often referred to as the `industrial internet'. Such data are often explored by engineers to determine the different usage patterns and behavior of similar machines. In this paper we describe a technique to automatically summarize the usage and behavioral patterns of a collection of similar machines by a small set of rules that nevertheless cover a large fraction of the observed data. We characterize the usage and behavior of a machine over a day, by a collection of single-sensor histograms; thus each day is a point in a high-dimensional space. We first cluster days according to each sensor separately and then combine the clusters using communities in a specially constructed graph that considers common days within clusters of different sensors. In the process some clusters of a single sensor get merged. Finally, we discover rules, each comprising of memberships in clusters of possibly different sensors. Thus, we use the term multi-subspace clustering to describe such a collection of cluster-based rules. Last but not the least, we attempt to cover a large fraction of observed days with a small number of such rules. We present empirical results on voluminous (100s of GBs) real-life sensor data and also compare our technique with related work in subspace clustering and histogram summarization.

[1]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[3]  Shao-Yi Chien,et al.  Fast image segmentation based on K-Means clustering with histograms in HSV color space , 2008, 2008 IEEE 10th Workshop on Multimedia Signal Processing.

[4]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[5]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[6]  Ashwin Srinivasan,et al.  Exploratory Data Analysis Using Alternating Covers of Rules and Exceptions , 2014, COMAD.

[7]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[8]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[9]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[10]  Joachim M. Buhmann,et al.  Histogram clustering for unsupervised segmentation and image retrieval , 1999, Pattern Recognit. Lett..

[11]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Frederick Reiss,et al.  Compact histograms for hierarchical identifiers , 2006, VLDB.

[13]  Emanuele Trucco,et al.  Robust motion and correspondence of noisy 3-D point sets with missing data , 1999, Pattern Recognit. Lett..

[14]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Ashish Verma,et al.  Cross-Guided Clustering: Transfer of Relevant Supervision across Tasks , 2012, TKDD.

[17]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[18]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Peter C. Evans,et al.  Industrial Internet: Pushing the Boundaries of Minds and Machines , 2012 .

[20]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.