Design and Performance of a Scalable, Parallel Statistics Toolkit

Most statistical software packages implement a broad range of techniques but do so in an ad hoc fashion, leaving users who do not have a broad knowledge of statistics at a disadvantage since they may not understand all the implications of a given analysis or how to test the validity of results. These packages are also largely serial in nature, or target multicore architectures instead of distributed-memory systems, or provide only a small number of statistics in parallel. This paper surveys a collection of parallel implementations of statistics algorithm developed as part of a common framework over the last 3 years. The framework strategically groups modeling techniques with associated verification and validation techniques to make the underlying assumptions of the statistics more clear. Furthermore it employs a design pattern specifically targeted for distributed-memory parallelism, where architectural advances in large-scale high-performance computing have been focused. Moment-based statistics (which include descriptive, correlative, and multicorrelative statistics, principal component analysis (PCA), and k-means statistics) scale nearly linearly with the data set size and number of processes. Entropy-based statistics (which include order and contingency statistics) do not scale well when the data in question is continuous or quasi-diffuse but do scale well when the data is discrete and compact. We confirm and extend our earlier results by now establishing near-optimal scalability with up to 10,000 processes.

[1]  Junji Nakano,et al.  Parallel computing techniques , 2012 .

[2]  James M. Brandt,et al.  The OVIS analysis architecture. , 2010 .

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Ray W. Grout,et al.  Numerically stable, single-pass, parallel statistics algorithms , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7]  Takashi Seo,et al.  ON JARQUE-BERA TESTS FOR ASSESSING MULTIVARIATE NORMALITY , 2009 .

[8]  Amy Henderson Squilacote The Paraview Guide , 2008 .

[9]  Alexander S. Szalay,et al.  Designing a multi-petabyte database for LSST , 2005, SPIE Astronomical Telescopes + Instrumentation.

[10]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[11]  David C. Thompson,et al.  Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases , 2010, 2010 IEEE International Conference on Cluster Computing.

[12]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[13]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[14]  Anil K. Bera,et al.  A test for normality of observations and regression residuals , 1987 .

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[16]  Anthony Skjellum,et al.  Portable Parallel Programming with the Message-Passing Interface , 1996 .

[17]  F. Yates Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .

[18]  Walter L. Smith Probability and Statistics , 1959, Nature.

[19]  Philippe Pierre Pebay,et al.  Scalable multi-correlative statistics and principal component analysis with Titan. , 2009 .

[20]  Philippe Pierre Pebay,et al.  Scalable k-means statistics with Titan. , 2009 .

[21]  P. Mahalanobis On the generalized distance in statistics , 1936 .