Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

Two-point Correlation Function (TPCF) is widely used in astronomy to characterize the distribution of matter/energy in the Universe, and help derive the physics that can trace back to the creation of the universe. However, it is prohibitively slow for current sized datasets, and would continue to be a critical bottleneck with the trend of increasing dataset sizes to billions of particles and more, which makes TPCF a compelling benchmark application for future exa-scale architectures. State-of-the-art TPCF implementations do not map well to the underlying SIMD hardware, and also suffer from load-imbalance for large core counts. In this paper, we present a novel SIMD-friendly histogram update algorithm that exploits the spatial locality of histogram updates to achieve near-linear SIMD scaling. We also present a load-balancing scheme that combines domain-specific initial static division of work and dynamic task migration across nodes to effectively balance computation across nodes. Using Zin supercomputer at Lawrence Livermore National Laboratory (25,600 cores of Intel® Xeon® E5-2670, each with 256-bit SIMD), we achieve 90% parallel efficiency and 96% SIMD efficiency, and perform TPCF computation on a 1.7 billion particle dataset in 5.3 hours (at least 35× faster than previous approaches). In terms of cost per performance (measured in flops/$), we achieve at least an order-of-magnitude (11.1x) higher flops/$ as compared to the best known results [1]. Consequently, we now have line-of-sight to achieving the processing power for correlation computation to process billion+ particles telescopic data.

[1]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[2]  Robert J. Brunner,et al.  Accelerating Cosmological Data Analysis with FPGAs , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[3]  A. Kashlinsky,et al.  Large-scale structure in the Universe , 1991, Nature.

[4]  David A. Bader,et al.  Practical parallel algorithms for dynamic data redistribution, median finding, and selection , 1995, Proceedings of International Conference on Parallel Processing.

[5]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[6]  J. Cordes The Square Kilometer Array , 2006 .

[7]  John C. Hart,et al.  Parallel SAH k-D tree construction , 2010, HPG '10.

[8]  Kirk D. Borne,et al.  Galaxy Evolution with LSST , 2010 .

[9]  Robert J. Brunner,et al.  Accelerating cosmological data analysis with graphics processors , 2009, GPGPU-2.

[10]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[11]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[12]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2011, Commun. ACM.

[13]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[14]  Edward J. Wollack,et al.  SEVEN-YEAR WILKINSON MICROWAVE ANISOTROPY PROBE (WMAP) OBSERVATIONS: PLANETS AND CELESTIAL CALIBRATION SOURCES , 2010, 1001.4731.

[15]  Tsuyoshi Hamada,et al.  190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[17]  Robert J. Brunner,et al.  Implementation of the two-point angular correlation function on a high-performance reconfigurable computer , 2009, Sci. Program..

[18]  L. Wasserman,et al.  Fast Algorithms and Efficient Statistics: N-Point Correlation Functions , 2000, astro-ph/0012333.

[19]  J. Koomey Worldwide electricity used in data centers , 2008 .

[20]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[21]  Christopher J. Hughes,et al.  Atomic Vector Operations on Chip Multiprocessors , 2008, 2008 International Symposium on Computer Architecture.

[22]  Christopher J. Hughes,et al.  Computer Vision on Multi-Core Processors: Articulated Body Tracking , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[23]  Joseph Lazio The Square Kilometer Array , 2008 .

[24]  Ray P. Norris Data Challenges for Next-generation Radio Telescopes , 2010, 2010 Sixth IEEE International Conference on e-Science Workshops.

[25]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[26]  Pradeep Dubey,et al.  Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.

[27]  A. Szalay,et al.  Bias and variance of angular correlation functions , 1993 .

[28]  Eftychios Sifakis,et al.  Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors , 2007, ISCA '07.

[29]  Robert J. Brunner,et al.  Fast Two-Point Correlations of Extremely Large Data Sets , 2008 .

[30]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[31]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[32]  Laxmikant V. Kalé,et al.  Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  P. Peebles,et al.  The Cosmological Constant and Dark Energy , 2002, astro-ph/0207347.

[34]  Albert-Jan Boonstra,et al.  DOME: towards the ASTRON & IBM center for exascale technology , 2012, Astro-HPC '12.