Trends in High Performance Computing: Exascale Systems and Facilities Beyond the First Wave

The demand for computing at extreme scale continues to drive the performance of high-performance computing systems to exascale and has recently resulted in plans worldwide to field such systems, including an acquisition of three such systems planned by the United States in 2021 to 2023. China, Japan, and Europe also have programs leading to deployment of exascale systems in the next decade. Vendor response to this continuing demand for more computing power has led to unabated increases in performance that are enabling the fielding of exascale systems featuring a combination of traditional processors and, now prominently, co-processors, such as graphics processing units. The first exascale systems are currently expected to necessitate facilities that can provide massive resources: as much as 40 MW of electrical power; up to 13,000 tons of liquid cooling; 250 thousand to 1 million CFM of forced air cooling; and 15,000 square feet of facility space. As the historic trend in HPC has long indicated, the advent of the first exascale systems in the early 2020's will be just the initial wave of systems at such scale, followed later in the decade by others in increasing numbers that will be continually more efficient and compact, requiring less power, cooling and space. This paper addresses the trends which will characterize such systems and facilities beyond the first wave of exascale, enabling the deployment of leading-edge computer systems to the larger communities of organizations and sites that cannot provide the huge facilities that will be required for the first-wave of exascale computers. These trends are now discernable from the data published in recent TOP500 and Green500 semiannual lists, as well as from developments evident in the processors, systems, and facilities slated to characterize the newest and highest performing systems worldwide. Extension of exascale beyond the first wave will require power and cooling needs, and thus facilities cost of operation, that can be sustained by a growing community of sites and organizations who heretofore have fielded the now-dominant architecture staple of data center facilities - air-cooled designs based on commodity-processor rack clusters. The balance between custom hardware designs along with acquisition and operational costs for the compute/network/storage racks requires an evolutionary approach where the solution deployed can leverage existing infrastructure and allow for enhancements with a TCO mindset. As compute silicon power increases with no reduction in rack power in sight, the migration path from traditional air cooled or liquid assisted cooling to a more direct liquid-to-node approach requires planning and standardization of deployment and validation of the solution. Infrastructure compatibility across vendors of cooling equipment and systems could be needed to ensure success for new cooling approaches to be consumed in the data center. Highlighted in this paper are the challenges and opportunities these second wave of exascale facilities operators need to be cognizant of while developing a data center strategy over the next decade.