A Probabilistic Approach to Detect Local Dependencies in Streams

Given m source streams (X 1, X 2, ..., X m ) and one target data stream Y, at any time window w, we want to find out which source stream has the strongest dependency to the current target stream value. Existing solutions fail in several important dependency cases, such as the not-similar-but-frequent patterns, the signals with multiple lags, and the single point dependencies. To reveal these hard-to-detect local patterns in streams, a statistical model based framework is developed, together with an incremental update algorithm. Using the framework, a new scoring function based on the conditional probability is defined to effectively capture the local dependencies between any source stream and the target stream. Immediate real life applications include quickly identifying the causal streams with respect to a Key Performance Indicator (KPI) in a complex production system, and detecting locally correlated stocks for an interesting event in the financial system. We apply this framework to two real data sets to demonstrate its advantages compared with the Principal Component Analysis (PCA) based method [16] and the naive local Pearson implementation.

[1]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[2]  Vanish Talwar,et al.  Statistical techniques for online anomaly detection in data centers , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.

[3]  Alexander Aiken,et al.  Online detection of multi-component interactions in production systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[4]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[5]  Christos Faloutsos,et al.  BRAID: stream mining through group lag correlations , 2005, SIGMOD '05.

[6]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[7]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[8]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[9]  Christophe Diot,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM.

[10]  Tao Jiang,et al.  Monitoring correlative financial data streams by local pattern similarity , 2009 .

[11]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[12]  Zhen Guo,et al.  Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[13]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[14]  P. Embrechts,et al.  Risk Management: Correlation and Dependence in Risk Management: Properties and Pitfalls , 2002 .

[15]  W. Härdle,et al.  Applied Multivariate Statistical Analysis , 2003 .

[16]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[17]  Abdelhamid Bouchachia,et al.  Incremental Learning Based on Growing Gaussian Mixture Models , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[18]  H. Sung Gaussian Mixture Regression and Classification , 2004 .

[19]  Philip S. Yu,et al.  Local Correlation Tracking in Time Series , 2006, Sixth International Conference on Data Mining (ICDM'06).