Monitoring Least Squares Models of Distributed Streams

Least squares regression is widely used to understand and predict data behavior in many fields. As data evolves, regression models must be recomputed, and indeed much work has focused on quick, efficient and accurate computation of linear regression models. In distributed streaming settings, however, periodically recomputing the global model is wasteful: communicating new observations or model updates is required even when the model is, in practice, unchanged. This is prohibitive in many settings, such as in wireless sensor networks, or when the number of nodes is very large. The alternative, monitoring prediction accuracy, is not always sufficient: in some settings, for example, we are interested in the model's coefficients, rather than its predictions. We propose the first monitoring algorithm for multivariate regression models of distributed data streams that guarantees a bounded model error. It maintains an accurate estimate using a fraction of the communication by recomputing only when the precomputed model is sufficiently far from the (hypothetical) current global model. When the global model is stable, no communication is needed. Experiments on real and synthetic datasets show that our approach reduces communication by up to two orders of magnitude while providing an accurate estimate of the current global model in all nodes.

[1]  Jiannong Cao,et al.  E3: Towards energy-efficient distributed least squares estimation in sensor networks , 2014, 2014 IEEE 22nd International Symposium of Quality of Service (IWQoS).

[2]  C. Guestrin,et al.  Distributed regression: an efficient framework for modeling sensor network data , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[3]  Carlos Guestrin,et al.  A robust architecture for distributed inference in sensor networks , 2005, IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005..

[4]  Joel A. Tropp,et al.  Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit , 2007, IEEE Transactions on Information Theory.

[5]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[6]  T. Tao,et al.  Honeycombs and sums of Hermitian matrices , 2000, math/0009048.

[7]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[8]  Laurence T. Yang,et al.  Parallel MCGLS and ICGLS Methods for Least Squares Problems on Distributed Memory Architectures , 2003, The Journal of Supercomputing.

[9]  Kanishka Bhaduri,et al.  Distributed Monitoring of the R2 Statistic for Linear Regression , 2011, SDM.

[10]  Assaf Schuster,et al.  Communication-Efficient Distributed Online Prediction by Dynamic Model Synchronization , 2014, ECML/PKDD.

[11]  Ali H. Sayed,et al.  Adaptive Networks , 2014, Proceedings of the IEEE.

[12]  Pushpraj Shukla,et al.  Efficient Constraint Monitoring Using Adaptive Thresholds , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Carlos Canudas de Wit,et al.  A new robust approach for highway traffic density estimation , 2014, 2014 European Control Conference (ECC).

[14]  Krithi Ramamritham,et al.  Handling Non-linear Polynomial Queries over Dynamic Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.

[16]  Ling Huang,et al.  Communication-Efficient Online Detection of Network-Wide Anomalies , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[17]  Ran Wolff,et al.  A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems , 2009, IEEE Transactions on Knowledge and Data Engineering.

[18]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[19]  L. Bauwens,et al.  Econometrics , 2005 .

[20]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.

[21]  Ali H. Sayed,et al.  Diffusion Strategies Outperform Consensus Strategies for Distributed Estimation Over Adaptive Networks , 2012, IEEE Transactions on Signal Processing.

[22]  Mihaela van der Schaar,et al.  A fast online learning algorithm for distributed mining of BigData , 2014, PERV.

[23]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[24]  Assaf Schuster,et al.  Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[25]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[26]  M. Anthony,et al.  Advanced linear algebra , 2006 .

[27]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[28]  Amir Abboud,et al.  Geometric Monitoring of Heterogeneous Streams , 2014, IEEE Transactions on Knowledge and Data Engineering.

[29]  Assaf Schuster,et al.  Shape Sensitive Geometric Monitoring , 2008, IEEE Transactions on Knowledge and Data Engineering.

[30]  Ali H. Sayed,et al.  Distributed Adaptive Incremental Strategies: Formulation and Performance Analysis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[31]  K. Miller On the Inverse of the Sum of Matrices , 1981 .

[32]  Mukesh K. Mohania,et al.  Ratio threshold queries over distributed data sources , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[33]  Hillol Kargupta,et al.  An Efficient Local Algorithm for Distributed Multivariate Regression in Peer-to-Peer Networks , 2008, SDM.

[34]  Gonzalo Mateos,et al.  Distributed Recursive Least-Squares: Stability and Performance Analysis , 2011, IEEE Transactions on Signal Processing.

[35]  Santiago Marco,et al.  Bioinspired early detection through gas flow modulation in chemo-sensory systems , 2015 .

[36]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[37]  Assaf Schuster,et al.  Monitoring Distributed Streams using Convex Decompositions , 2015, Proc. VLDB Endow..

[38]  Ausra Saudargiene Structurization of the Covariance Matrix by Process Type and Block-Diagonal Models in the Classifier Design , 1999, Informatica.

[39]  Jing Gao,et al.  DLRDG: distributed linear regression-based hierarchical data gathering framework in wireless sensor network , 2012, Neural Computing and Applications.

[40]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[41]  Bruno Gonçalves,et al.  Links that speak: The global language network and its association with global fame , 2014, Proceedings of the National Academy of Sciences.

[42]  Assaf Schuster,et al.  Prediction-based geometric monitoring over distributed data streams , 2012, SIGMOD Conference.

[43]  Z. Bai,et al.  Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , 1993 .