Streaming Algorithms for Halo Finders

Cosmological N-body simulations are essential for studies of the large-scale distribution of matter and galaxies in the Universe. This analysis often involves finding clusters of particles and retrieving their properties. Detecting such "halos" among a very large set of particles is a computationally intensive problem, usually executed on the same super-computers that produced the simulations, requiring huge amounts of memory. Recently, a new area of computer science emerged. This area, called streaming algorithms, provides new theoretical methods to compute data analytics in a scalable way using only a single pass over a data sets and logarithmic memory. The main contribution of this paper is a novel connection between the N-body simulations and the streaming algorithms. In particular, we investigate a link between halo finders and the problem of finding frequent items (heavy hitters) in a data stream, that should greatly reduce the computational resource requirements, especially the memory needs. Based on this connection, we can build a new halo finder by running efficient heavy hitter algorithms as a black-box. We implement two representatives of the family of heavy hitter algorithms, the Count-Sketch algorithm (CS) and the Pick-and-Drop sampling (PD), and evaluate their accuracy and memory usage. Comparison with other halo-finding algorithms from [1] shows that our halo finder can locate the largest haloes using significantly smaller memory space and with comparable running time. This streaming approach makes it possible to run and analyze extremely large data sets from N-body simulations on a smaller machine, rather than on supercomputers. Our findings demonstrate the connection between the halo search problem and streaming algorithms as a promising initial direction of further research.

[1]  G. Efstathiou,et al.  The evolution of large-scale structure in a universe dominated by cold dark matter , 1985 .

[2]  Y. Suto,et al.  Probability Distribution Function of Cosmological Density Fluctuations from a Gaussian Initial Condition: Comparison of One-Point and Two-Point Lognormal Model Predictions with N-Body Simulations , 2001, astro-ph/0105218.

[3]  Yannis Manolopoulos,et al.  Continuous Trend-Based Clustering in Data Streams , 2008, DaWaK.

[4]  Vicent Quilis,et al.  ASOHF: a new adaptive spherical overdensity halo finder , 2010, 1006.3205.

[5]  Ashwin Lall,et al.  A data streaming algorithm for estimating entropies of od flows , 2007, IMC '07.

[6]  Nickolay Y. Gnedin,et al.  voboz: an almost-parameter-free halo-finding algorithm , 2004 .

[7]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[8]  J. Peacock,et al.  Stable clustering, the halo model and non-linear cosmological power spectra , 2002, astro-ph/0207664.

[9]  A. Knebe,et al.  Ahf: AMIGA'S HALO FINDER , 2009, 0904.3662.

[10]  B. Jones,et al.  A lognormal model for the cosmological mass distribution. , 1991 .

[11]  Noureddine Zerhouni,et al.  Evidential evolving Gustafson-Kessel algorithm for online data streams partitioning using belief function theory , 2012, Int. J. Approx. Reason..

[12]  Alexander S. Szalay,et al.  origami: DELINEATING HALOS USING PHASE-SPACE FOLDS , 2012, 1201.2353.

[13]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[14]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[15]  Neoklis Polyzotis,et al.  Graph-based synopses for relational selectivity estimation , 2006, SIGMOD Conference.

[16]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[17]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[18]  Rafail Ostrovsky,et al.  Approximating Large Frequency Moments with Pick-and-Drop Sampling , 2012, APPROX-RANDOM.

[19]  Michal Maciejewski,et al.  Structure finding in cosmological simulations: the state of affairs , 2013, 1304.0585.

[20]  Stefan Gottloeber,et al.  Shape, Spin, and Baryon Fraction of Clusters in the MareNostrum Universe , 2007, astro-ph/0703164.

[21]  Michal Maciejewski,et al.  Haloes gone MAD: The Halo-Finder Comparison Project , 2011, 1104.0949.

[22]  Louiqa Raschid,et al.  A Flexible and Extensible Contract Aggregation Framework (CAF) for Financial Data Stream Analytics , 2014, DSMM'14.

[23]  Eyke Hüllermeier,et al.  An Efficient Algorithm for Instance-Based Learning on Data Streams , 2007, ICDM.

[24]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[25]  Vladimir Braverman,et al.  An Optimal Algorithm for Large Frequency Moments Using O(n^(1-2/k)) Bits , 2014, APPROX-RANDOM.

[26]  Eyke Hüllermeier,et al.  Efficient instance-based learning on data streams , 2007, Intell. Data Anal..

[27]  Carsten Lund,et al.  Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications , 2004, IMC '04.

[28]  Anatoly Klypin,et al.  Particle mesh code for cosmological simulations , 1997, astro-ph/9712217.

[29]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..