Contributions to the efficient use of general purpose coprocessors: kernel density estimation as case study

The high performance computing landscape is shifting from assemblies of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators provide greater theoretical performance compared to traditional multi-core CPUs, but exploiting their computing power remains as a challenging task.This dissertation discusses the issues that arise when trying to efficiently use general purpose accelerators. As a contribution to aid in this task, we present a thorough survey of performance modeling techniques and tools for general purpose coprocessors. Then we use as case study the statistical technique Kernel Density Estimation (KDE). KDE is a memory bound application that poses several challenges for its adaptation to the accelerator-based model. We present a novel algorithm for the computation of KDE that reduces considerably its computational complexity, called S-KDE. Furthermore, we have carried out two parallel implementations of S-KDE, one for multi and many-core processors, and another one for accelerators. The latter has been implemented in OpenCL in order to make it portable across a wide range of devices. We have evaluated the performance of each implementation of S-KDE in a variety of architectures, trying to highlight the bottlenecks and the limits that the code reaches in each device. Finally, we present an application of our S-KDE algorithm in the field of climatology: a novel methodology for the evaluation of environmental models.

[1]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[2]  K. Jarrod Millman,et al.  Python for Scientists and Engineers , 2011, Comput. Sci. Eng..

[3]  K. Taylor,et al.  Forcing, feedbacks and climate sensitivity in CMIP5 coupled atmosphere‐ocean climate models , 2012 .

[4]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[5]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[6]  M. Iredell,et al.  The NCEP Climate Forecast System Version 2 , 2014 .

[7]  J. Faraway,et al.  Bootstrap choice of bandwidth for density estimation , 1990 .

[8]  R. Knutti,et al.  Climate model genealogy , 2011 .

[9]  Gabriel Ibarra-Berastegi,et al.  Multi-objective environmental model evaluation by means of multidimensional kernel density estimators: Efficient and multi-core implementations , 2015, Environ. Model. Softw..

[10]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[11]  Arturo González-Escribano,et al.  uBench: exposing the impact of CUDA block geometry in terms of performance , 2013, The Journal of Supercomputing.

[12]  Simon See,et al.  在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析 (Performance Portability Evaluation for OpenACC on Intel Knights Corner and NVIDIA Kepler) , 2015, 计算机科学.

[13]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[14]  Robert F. Cahalan,et al.  Energy Balance Climate Models , 2017 .

[15]  Tim N. Palmer,et al.  Signature of recent climate change in frequencies of natural atmospheric circulation regimes , 1999, Nature.

[16]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[17]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[18]  R. Weißbach A general kernel functional estimator with general bandwidth—strong consistency and applications , 2006 .

[19]  S. Sheather Density Estimation , 2004 .

[20]  Chris Hewitt,et al.  Ensembles-based predictions of climate changes and their impacts , 2004 .

[21]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[22]  Suraje Dessai,et al.  Limited sensitivity analysis of regional climate change probabilities for the 21st century , 2005 .

[23]  A. Pitman,et al.  Evaluation of the AR4 Climate Models’ Simulated Daily Maximum Temperature, Minimum Temperature, and Precipitation over Australia Using Probability Density Functions , 2007 .

[24]  Kecheng Zhang,et al.  Sensitivity of inferred climate model skill to evaluation decisions: a case study using CMIP5 evapotranspiration , 2013 .

[25]  Jaejin Lee,et al.  Automatic OpenCL work-group size selection for multicore CPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[26]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[27]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[28]  Maurice Steinman,et al.  AMD Fusion APU: Llano , 2012, IEEE Micro.

[29]  Reto Knutti,et al.  The end of model democracy ? An editorial comment , 2010 .

[30]  Larry S. Davis,et al.  Efficient Kernel Density Estimation Using the Fast Gauss Transform with Applications to Color Modeling and Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  William Menke,et al.  Environmental Data Analysis with MatLab , 2009 .

[32]  Peter C. Chu,et al.  Two Kinds of Predictability in the Lorenz System , 1999 .

[33]  C. Schönwiese,et al.  Overview of Results , 1997 .

[34]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[35]  Efraim Rotem,et al.  Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.

[36]  Ronan Keryell,et al.  Par4All: From Convex Array Regions to Heterogeneous Computing , 2012, HiPEAC 2012.

[37]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[38]  Yan Zhao,et al.  Evaluation of climate models using palaeoclimatic data , 2012 .

[39]  Alexander Mendiburu,et al.  A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing , 2015, IEEE Transactions on Parallel and Distributed Systems.

[40]  Bernard P. Puc,et al.  An integrated semiconductor device enabling non-optical genome sequencing , 2011, Nature.

[41]  Reto Knutti,et al.  Should we believe model predictions of future climate change? , 2008, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[42]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[43]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[44]  Jouni Räisänen,et al.  Twenty‐first century changes in daily temperature variability in CMIP3 climate models , 2014 .

[45]  J. Dardanelli,et al.  A linked-modeling framework to estimate maize production risk associated with ENSO-related climate variability in Argentina , 2001 .

[46]  R. Zamar,et al.  A multivariate Kolmogorov-Smirnov test of goodness of fit , 1997 .

[47]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[48]  Jianliang Xu,et al.  GPURoofline: A Model for Guiding Performance Optimizations on GPUs , 2012, Euro-Par.

[49]  Szymon Lukasik,et al.  Parallel Computing of Kernel Density Estimates with MPI , 2007, International Conference on Computational Science.

[50]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[51]  H. Storch,et al.  Statistical Analysis in Climate Research , 2000 .

[52]  Ki-Hwan Kim,et al.  Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model , 2011, Comput. Phys. Commun..

[53]  A score based method for assessing the performance of GCMs in the Yellow-Huai-Hai region , 2017 .

[54]  A. Sterl,et al.  The ERA‐40 re‐analysis , 2005 .

[55]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[56]  Michael F. Wehner,et al.  Separating signal and noise in atmospheric temperature changes: The importance of timescale , 2011 .

[57]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[58]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[59]  Konstantinos G. Margaritis,et al.  Accelerating Kernel Density Estimation on the GPU Using the CUDA Framework , 2013 .

[60]  Jeffrey S. Racine,et al.  Parallel distributed kernel estimation , 2002 .

[61]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[62]  U. Ulbrich,et al.  Changing Northern Hemisphere Storm Tracks in an Ensemble of IPCC Climate Change Simulations , 2008 .

[63]  Tarn Duong,et al.  ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R , 2007 .

[64]  K. Taylor Summarizing multiple aspects of model performance in a single diagram , 2001 .

[65]  Feng Qian,et al.  Evolutionary algorithm using kernel density estimation model in continuous domain , 2009, 2009 7th Asian Control Conference.

[66]  S. J. Lambert,et al.  Second-order space-time climate difference statistics , 2001 .

[67]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[68]  Alfonso Niño,et al.  A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[69]  José Manuel Gutiérrez,et al.  On the Use of Reanalysis Data for Downscaling , 2012 .

[70]  Alexander Mendiburu,et al.  Porting Estimation of Distribution Algorithms to the Cell Broadband Engine , 2010, Parallel Comput..

[71]  Stephen A. Jarvis,et al.  An investigation of the performance portability of OpenCL , 2013, J. Parallel Distributed Comput..

[72]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[73]  Thomas Reichler,et al.  On the Effective Number of Climate Models , 2011 .

[74]  Rolf Sundberg,et al.  Statistical framework for evaluation of climate model simulations by use of climate proxy data from the last millennium - Part 1: Theory , 2012 .

[75]  L. Bengtsson What is the climate system able to do ‘on its own’? , 2013 .

[76]  Reto Knutti,et al.  Climate model genealogy: Generation CMIP5 and how we got there , 2013 .

[77]  Piotr Kokoszka,et al.  Non–Parametric Econometrics , 2012 .

[78]  R. Reynolds,et al.  The NCEP/NCAR 40-Year Reanalysis Project , 1996, Renewable Energy.

[79]  M. Hazelton,et al.  Cross‐validation Bandwidth Matrices for Multivariate Kernel Density Estimation , 2005 .

[80]  Karl E. Taylor,et al.  An overview of CMIP5 and the experiment design , 2012 .

[81]  John F. B. Mitchell,et al.  THE WCRP CMIP3 Multimodel Dataset: A New Era in Climate Change Research , 2007 .

[82]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[83]  Alexander Mendiburu,et al.  An efficient implementation of kernel density estimation for multi-core and many-core architectures , 2015, Int. J. High Perform. Comput. Appl..