论文信息 - Design and Performance of a Scalable, Parallel Statistics Toolkit

Design and Performance of a Scalable, Parallel Statistics Toolkit

Most statistical software packages implement a broad range of techniques but do so in an ad hoc fashion, leaving users who do not have a broad knowledge of statistics at a disadvantage since they may not understand all the implications of a given analysis or how to test the validity of results. These packages are also largely serial in nature, or target multicore architectures instead of distributed-memory systems, or provide only a small number of statistics in parallel. This paper surveys a collection of parallel implementations of statistics algorithm developed as part of a common framework over the last 3 years. The framework strategically groups modeling techniques with associated verification and validation techniques to make the underlying assumptions of the statistics more clear. Furthermore it employs a design pattern specifically targeted for distributed-memory parallelism, where architectural advances in large-scale high-performance computing have been focused. Moment-based statistics (which include descriptive, correlative, and multicorrelative statistics, principal component analysis (PCA), and k-means statistics) scale nearly linearly with the data set size and number of processes. Entropy-based statistics (which include order and contingency statistics) do not scale well when the data in question is continuous or quasi-diffuse but do scale well when the data is discrete and compact. We confirm and extend our earlier results by now establishing near-optimal scalability with up to 10,000 processes.

[1] Junji Nakano,et al. Parallel computing techniques , 2012 .

[2] James M. Brandt,et al. The OVIS analysis architecture. , 2010 .

[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4] Anthony Skjellum,et al. Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[5] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[6] Ray W. Grout,et al. Numerically stable, single-pass, parallel statistics algorithms , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7] Takashi Seo,et al. ON JARQUE-BERA TESTS FOR ASSESSING MULTIVARIATE NORMALITY , 2009 .

[8] Amy Henderson Squilacote. The Paraview Guide , 2008 .

[9] Alexander S. Szalay,et al. Designing a multi-petabyte database for LSST , 2005, SPIE Astronomical Telescopes + Instrumentation.

[10] Hao Yu,et al. State of the Art in Parallel Computing with R , 2009 .

[11] David C. Thompson,et al. Computing Contingency Statistics in Parallel: Design Trade-Offs and Limiting Cases , 2010, 2010 IEEE International Conference on Cluster Computing.

[12] C. Pipper,et al. [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[13] Peter S. Pacheco. Parallel programming with MPI , 1996 .

[14] Anil K. Bera,et al. A test for normality of observations and regression residuals , 1987 .

[15] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[16] Anthony Skjellum,et al. Portable Parallel Programming with the Message-Passing Interface , 1996 .

[17] F. Yates. Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .

[18] Walter L. Smith. Probability and Statistics , 1959, Nature.

[19] Philippe Pierre Pebay,et al. Scalable multi-correlative statistics and principal component analysis with Titan. , 2009 .

[20] Philippe Pierre Pebay,et al. Scalable k-means statistics with Titan. , 2009 .

[21] P. Mahalanobis. On the generalized distance in statistics , 1936 .