Scalable Distributed Change Detection from Astronomy Data Streams Using Local, Asynchronous Eigen Monitoring Algorithms

This paper considers the problem of change detection using local distributed eigen monitoring algorithms for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating 30 terabytes of calibrated imagery every night that will need to be coanalyzed with other astronomical data stored at different locations around the world. Change point detection and event classification in such data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. However, performing such data mining tasks is a challenging problem for such high-throughput distributed data streams. In this paper we propose a highly scalable and distributed asynchronous algorithm for monitoring the principal components (PC) of such dynamic data streams. We demonstrate the algorithm on a large set of distributed astronomical data to accomplish well-known astronomy tasks such as measuring variations in the fundamental plane of galaxy parameters. The proposed algorithm is provably correct (i.e. converges to the correct PCs without centralizing any data) and can seamlessly handle changes to the data or the network. Real experiments performed on Sloan Digital Sky Survey (SDSS) catalogue data show the effectiveness of the algorithm.

[1]  Ran Wolff,et al.  Noname manuscript No. (will be inserted by the editor) In-Network Outlier Detection in Wireless Sensor Networks , 2022 .

[2]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[3]  Ran Wolff,et al.  A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Steven C. H. Hoi,et al.  Cascade RSVM in Peer-to-Peer Networks , 2008, ECML/PKDD.

[5]  Hui Xiong,et al.  Distributed classification in peer-to-peer networks , 2007, KDD '07.

[6]  Ran Wolff,et al.  In-Network Outlier Detection in Wireless Sensor Networks , 2006, ICDCS.

[7]  Hillol Kargupta,et al.  Approximate Distributed K-Means Clustering over a Peer-to-Peer Network , 2009, IEEE Transactions on Knowledge and Data Engineering.

[8]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.

[9]  Wolf-Tilo Balke,et al.  Progressive distributed top-k retrieval in peer-to-peer networks , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Ran Wolff,et al.  A Local Facility Location Algorithm for Large-scale Distributed Systems , 2007, Journal of Grid Computing.

[11]  Jason Novotny,et al.  Data mining on NASA's Information Power Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[12]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[13]  Ling Huang,et al.  Distributed PCA and Network Anomaly Detection , 2006 .

[14]  A. Schuster,et al.  Association rule mining in peer-to-peer systems , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[15]  Michael P. Haydock,et al.  Data Mining in Astronomy , 2003 .

[16]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[17]  Stephen P. Boyd,et al.  Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[18]  Rajeev Motwani,et al.  The price of validity in dynamic networks , 2004, SIGMOD '04.

[19]  Antonio Gonzalez Garcia Elliptical galaxies: merger simulations and the fundamental plane , 2003 .

[20]  Hillol Kargupta,et al.  A Scalable Local Algorithm for Distributed Multivariate Regression , 2008, Stat. Anal. Data Min..

[21]  Hillol Kargupta,et al.  A Scalable Local Algorithm for Distributed Multivariate Regression , 2008 .

[22]  Haimonti Dutta,et al.  Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System , 2007, SDM.

[23]  Hillol Kargupta,et al.  Efficient local algorithms for distributed data mining in large scale peer to peer environments: a deterministic approach , 2008 .

[24]  Ran Wolff,et al.  Distributed Decision‐Tree Induction in Peer‐to‐Peer Systems , 2008, Stat. Anal. Data Min..