Exascale computer system design : the square kilometre array

Exascale Computer System Design: The Square Kilometre Array With each new generation, the performance of high-performance computing systems increases. In the past decade, supercomputers reached petascale performance: machines capable of processing more than 1015 floating-point operations per second (FLOPS). Today, engineers are working to conquer the next barrier: building an exascale system capable of processing more than 1018 FLOPS. A major challenge is to keep power consumption low. Petascale systems reached an energy e ciency of a few GFLOPS per watt, but it is estimated that exascale systems need to reach at least 50 GFLOPS per watt. System architects face a huge design space that is too expensive to simulate or prototype. New methodologies are needed to assess the architectural trade-o s involved in reaching the goal of building an energy-e cient exascale system in this decade. A prime example of an exascale system is the computing system required to operate the future Square Kilometre Array (SKA) radio telescope. Hundreds of thousands of antennas and thousands of dishes are constructed in two phases in the Australian and South African deserts. Two instruments are constructed in phase one: SKA1-Low and SKA1-Mid. The raw data from the receivers—nearly 150 TB/s in phase one alone—need to be processed in near real-time. Processing is performed in three steps: the station processor, the central signal processor (CSP), and the science data processor (SDP). The output is scientific data, such as sky images, for astronomers to use. The SKA is the use case for the exascale system design methodology we develop in this dissertation, with particular focus on the imaging pipeline. The first contribution of this work is an application-specific model to derive the computing requirements on the processing platform from the instrumental parameters of radio telescopes. A first-order prediction of power consumption is based on extrapolations from the TOP500 supercomputer list. An analysis of the original SKA phase-one baseline design, released by the SKA Organisation (SKAO), shows that the telescope requires a sustained computing throughput of nearly 1 EFLOPS for the SDP. We predict a power consumption of up to 120 MW in 2018. Partly based on results of this analysis, the SKAO released a revised design of the telescope to reduce the power consumption of the system. The i

[1]  Jeffrey S. Vetter,et al.  Aspen: A domain specific language for performance modeling , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  S. Bhatnagar,et al.  Applying full polarization A-Projection to very wide field of view instruments: An imager for LOFAR , 2012, 1212.6178.

[3]  Michael Lang,et al.  Using Performance Modeling to Design Large-Scale Systems , 2009, Computer.

[4]  James E. Smith,et al.  Advanced Micro Devices , 2005 .

[5]  Rob van Nieuwpoort,et al.  Evaluating multi-core platforms for HPC data-intensive kernels , 2009, CF '09.

[6]  R. Jongerius Analyzing LOFAR station processing on multi-core platforms , 2012 .

[7]  Martin Meyer,et al.  Imaging SKA-scale data in three different computing environments , 2015, Astron. Comput..

[8]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[9]  A. R. Whitney,et al.  The Murchison Widefield Array: The Square Kilometre Array Precursor at Low Radio Frequencies , 2012, Publications of the Astronomical Society of Australia.

[10]  Hubertus Franke,et al.  Workload and network-optimized computing systems , 2010, IBM J. Res. Dev..

[11]  David Eklov,et al.  Fast modeling of shared caches in multicore systems , 2011, HiPEAC.

[12]  Gero Dittmann,et al.  Holistic power analysis of implementation alternatives for a very large scale synthesis array with phased array stations , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  David R. DeBoer,et al.  Australian SKA Pathfinder: A High-Dynamic Range Wide-Field of View Survey Telescope , 2009, Proceedings of the IEEE.

[14]  Ana Lucia Varbanescu,et al.  Building high-resolution sky images using the Cell/B.E. , 2009, HiPC 2009.

[15]  Rob van Nieuwpoort,et al.  Correlating Radio Astronomy Signals with Many-Core Hardware , 2011, International Journal of Parallel Programming.

[16]  Yusuf Leblebici,et al.  A 35mW8 b 8.8 GS/s SAR ADC with low-power capacitive reference buffers in 32nm Digital SOI CMOS , 2013, 2013 Symposium on VLSI Circuits.

[17]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Gero Dittmann,et al.  Scalable, efficient ASICS for the square kilometre array: From A/D conversion to central correlation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tor M. Aamodt,et al.  A first-order fine-grained multithreaded throughput model , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[20]  Vittorio Zaccaria,et al.  ReSPIR: A Response Surface-Based Pareto Iterative Refinement for Application-Specific Design Space Exploration , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[21]  Dana Schaa,et al.  Synthetic aperture beamformation using the GPU , 2011, 2011 IEEE International Ultrasonics Symposium.

[22]  Chris Jesshope,et al.  A polyphase filter for GPUs and multi-core processors , 2012, Astro-HPC '12.

[23]  Zhihong Zhang,et al.  Comparison about the Three Central Composite Designs with Simulation , 2009, 2009 International Conference on Advanced Computer Control.

[24]  Rob van Nieuwpoort,et al.  The LOFAR correlator: implementation and performance analysis , 2010, PPoPP '10.

[25]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[26]  Robert Navarro,et al.  Implementation of a Digital Signal Processing subsystem for a Long Wavelength Array station , 2011, 2011 Aerospace Conference.

[27]  T. J. Cornwell SKA and EVLA Computing Costs for Wide Field Imaging , 2004 .

[28]  Anne René Offringa Algorithms for radio interference detection and removal , 2012 .

[29]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[30]  Randall B. Wayth,et al.  A GPU-based Real-time Software Correlation System for the Murchison Widefield Array Prototype , 2009, 0906.1887.

[31]  Stephen Berard,et al.  Implications of Historical Trends in the Electrical Efficiency of Computing , 2011, IEEE Annals of the History of Computing.

[32]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  Gero Dittmann,et al.  Spatio-Temporal Locality Characterization , 2013 .

[34]  B. Humphreys,et al.  Analysis of Convolutional Resampling Algorithm Performance , 2011 .

[35]  Abhinav Vishnu,et al.  Codesign Challenges for Exascale Systems: Performance, Power, and Reliability , 2011, Computer.

[36]  T. Murphy,et al.  wsclean: an implementation of a fast, generic wide-field imager for radio astronomy , 2014, 1407.1943.

[37]  Barack Obama Executive Order 13702: Creating a National Strategic Computing Initiative , 2015 .

[38]  Takashi Yokota,et al.  Potentials of Branch Predictors: From Entropy Viewpoints , 2008, ARCS.

[39]  Gero Dittmann,et al.  Quantifying Communication in Graph Analytics , 2015, ISC.

[40]  Mark Horowitz,et al.  Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis , 2010, ISCA.

[41]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[42]  Wei Gao,et al.  Performance Evaluation of NPB and SPEC CPU2006 on Various SIMD Extensions , 2015, BigCom.

[43]  James E. Smith,et al.  Characterizing the branch misprediction penalty , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[44]  Xi Chen,et al.  Cache contention and application performance prediction for multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[45]  Alessio Sclocco,et al.  Radio Astronomy Beam Forming on Many-Core Architectures , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[46]  John A. Gunnels,et al.  Petascale computing with accelerators , 2009, PPoPP '09.

[47]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[48]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[49]  Tim J. Cornwell,et al.  The Noncoplanar Baselines Effect in Radio Interferometry: The W-Projection Algorithm , 2008, IEEE Journal of Selected Topics in Signal Processing.

[50]  Gero Dittmann,et al.  Scaling application properties to exascale , 2015, Conf. Computing Frontiers.

[51]  John Shalf,et al.  Rethinking Hardware-Software Codesign for Exascale Systems , 2011, Computer.

[52]  John W. Romein,et al.  An efficient work-distribution strategy for gridding radio-telescope data on GPUs , 2012, ICS '12.

[53]  James E. Smith,et al.  Automated design of application specific superscalar processors: an analytical approach , 2007, ISCA '07.

[54]  Larry R. D'Addario Low-power architectures for large radio astronomy correlators , 2011, 2011 XXXth URSI General Assembly and Scientific Symposium.

[55]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[56]  R. S. Booth,et al.  An Overview of the MeerKAT Project , 2012 .

[57]  J. Högbom,et al.  APERTURE SYNTHESIS WITH A NON-REGULAR DISTRIBUTION OF INTERFEROMETER BASELINES. Commentary , 1974 .

[58]  Ronald Nijboer,et al.  The LOFAR Telescope: System Architecture and Signal Processing , 2009, Proceedings of the IEEE.

[59]  Christoph Hagleitner,et al.  Challenges in exascale radio astronomy: Can the SKA ride the technology wave? , 2015, Int. J. High Perform. Comput. Appl..

[60]  Jeong-Gun Lee,et al.  Design space exploration of SW beamformer on GPU , 2015, Concurr. Comput. Pract. Exp..

[61]  Gero Dittmann,et al.  Analytic processor model for fast design-space exploration , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[62]  The Ligo Scientific Collaboration,et al.  Observation of Gravitational Waves from a Binary Black Hole Merger , 2016, 1602.03837.

[63]  W. V. van Cappellen,et al.  APERTIF: Phased array feeds for the westerbork synthesis radio telescope , 2010, 2010 IEEE International Symposium on Phased Array Systems and Technology.

[64]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[65]  S. Markoff,et al.  LOFAR - low frequency array , 2006 .

[66]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[67]  M. A. Voronkov,et al.  Wide field imaging for the square kilometre array , 2012, Other Conferences.

[68]  Jack J. Dongarra,et al.  The quest for petascale computing , 2001, Comput. Sci. Eng..

[69]  Henri E. Bal,et al.  The Square Kilometre Array Science Data Processor. Preliminary compute platform design , 2015 .

[70]  Georgi Gaydadjiev,et al.  Multi-Core Platforms for Beamforming and Wave Field Synthesis , 2011, IEEE Transactions on Multimedia.

[71]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[72]  Samuel Williams,et al.  ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..

[73]  Andrew Siemion,et al.  A Scalable Correlator Architecture Based on Modular FPGA Hardware, Reuseable Gateware, and Data Packetization , 2008, 0809.2266.

[74]  Gero Dittmann,et al.  Analytic Multi-Core Processor Model for Fast Design-Space Exploration , 2018, IEEE Transactions on Computers.

[75]  Stefan J. Wijnholds,et al.  Computing cost of sensitivity and survey speed for aperture array and phased array feed systems , 2013, 2013 Africon.

[76]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[77]  Tarek M. Taha,et al.  An Instruction Throughput Model of Superscalar Processors , 2008, IEEE Trans. Computers.

[78]  P. J. Hall,et al.  Cost-effective aperture arrays for SKA Phase 1: single or dual-band? , 2012, 1203.0413.

[79]  Vittorio Zaccaria,et al.  DeSpErate++: An Enhanced Design Space Exploration Framework Using Predictive Simulation Scheduling , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[80]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[81]  James E. Jaussi,et al.  A Scalable 5–15 Gbps, 14–75 mW Low-Power I/O Transceiver in 65 nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[82]  Christoph Hagleitner,et al.  An energy-efficient custom architecture for the SKA1-low central signal processor , 2015, Conf. Computing Frontiers.

[83]  Alfred C. H. Yu,et al.  Software-based high-level synthesis design of FPGA beamformers for synthetic aperture imaging , 2015, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control.

[84]  David Black-Schaffer,et al.  Analytical Processor Performance and Power Modeling Using Micro-Architecture Independent Characteristics , 2016, IEEE Transactions on Computers.

[85]  J. E. Hargreaves UniBoard: generic hardware for radio astronomy signal processing , 2012, Other Conferences.

[86]  Christoph Hagleitner,et al.  Exploring the Design Space of an Energy-Efficient Accelerator for the SKA1-Low Central Signal Processor , 2016, International Journal of Parallel Programming.

[87]  Gero Dittmann,et al.  An Instrumentation Approach for Hardware-Agnostic Software Characterization , 2015, International Journal of Parallel Programming.

[88]  Lourdes Verdes-Montenegro,et al.  Advancing Astrophysics with the Square Kilometre Array , 2015 .

[89]  Andrew J. Faulkner SKADS White Paper , 2009 .

[90]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[91]  Stefan J. Wijnholds,et al.  Fast gain calibration in radio astronomy using alternating direction implicit methods: Analysis and applications , 2014, 1410.2101.

[92]  Takehiro Moriya,et al.  GPU-based real-time beamforming for large arrays of optical wireless acoustic sensors , 2015 .

[93]  Behzad Razavi Architectures and circuits for RF CMOS receivers , 1998, Proceedings of the IEEE 1998 Custom Integrated Circuits Conference (Cat. No.98CH36143).

[94]  Frederic R. Schwab,et al.  Bandwidth and Time-Average Smearing , 1999 .

[95]  Albert-Jan Boonstra,et al.  DOME: towards the ASTRON & IBM center for exascale technology , 2012, Astro-HPC '12.

[96]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[97]  Gero Dittmann,et al.  Scaling Properties of Parallel Applications to Exascale , 2016, International Journal of Parallel Programming.

[98]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[99]  Richard F. Barrett,et al.  Exascale design space exploration and co-design , 2014, Future Gener. Comput. Syst..

[100]  Rainer Beck,et al.  Square kilometre array , 2010, Scholarpedia.