HPC Cloud for Scientific and Business Applications

High performance computing (HPC) clouds are becoming an alternative to on-premise clusters for executing scientific applications and business analytics services. Most research efforts in HPC cloud aim to understand the cost benefit of moving resource-intensive applications from on-premise environments to public cloud platforms. Industry trends show that hybrid environments are the natural path to get the best of the on-premise and cloud resources—steady (and sensitive) workloads can run on on-premise resources and peak demand can leverage remote resources in a pay-as-you-go manner. Nevertheless, there are plenty of questions to be answered in HPC cloud, which range from how to extract the best performance of an unknown underlying platform to what services are essential to make its usage easier. Moreover, the discussion on the right pricing and contractual models to fit small and large users is relevant for the sustainability of HPC clouds. This article brings a survey and taxonomy of efforts in HPC cloud and a vision on what we believe is ahead of us, including a set of research challenges that, once tackled, can help advance businesses and scientific discoveries. This becomes particularly relevant due to the fast increasing wave of new HPC applications coming from big data and artificial intelligence.

[1]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[2]  Seo-Young Noh,et al.  Exploring Infiniband Hardware Virtualization in OpenNebula towards Efficient High-Performance Computing , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[3]  Rajkumar Buyya,et al.  Multi-cloud resource provisioning with Aneka: A unified and integrated utilisation of microsoft azure and amazon EC2 instances , 2015, 2015 International Conference on Computing and Network Communications (CoCoNet).

[4]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[5]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[6]  Sabela Ramos,et al.  Performance analysis of HPC applications in the cloud , 2013, Future Gener. Comput. Syst..

[7]  Preston M. Smith,et al.  Cost-Effective HPC: The Community or the Cloud? , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[8]  Zibin Zheng,et al.  Topology-Aware Deployment of Scientific Applications in Cloud Computing , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[9]  Peter Luksch,et al.  Improving HPC Application Performance in Public Cloud , 2014 .

[10]  Dhabaleswar K. Panda,et al.  High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[11]  R. Buhrman,et al.  GPU-accelerated micromagnetic simulations using cloud computing , 2015, 1505.01207.

[12]  Robert Love,et al.  Kernel korner: CPU affinity , 2003 .

[13]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[14]  Andrzej M. Goscinski,et al.  A Survey of Cloud-Based Service Computing Solutions for Mammalian Genomics , 2014, IEEE Transactions on Services Computing.

[15]  G. Bruce Berriman,et al.  The Application of Cloud Computing to Astronomy: A Study of Cost and Performance , 2010, 2010 Sixth IEEE International Conference on e-Science Workshops.

[16]  James Sexton,et al.  Enabling High-Performance Computing as a Service , 2012, Computer.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Dhabaleswar K. Panda,et al.  Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems , 2012, 2012 IEEE 20th Annual Symposium on High-Performance Interconnects.

[19]  Marty Humphrey,et al.  A quantitative analysis of high performance computing with Amazon's EC2 infrastructure: The death of the local cluster? , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[20]  Marco Aurélio Stelmar Netto,et al.  Deciding When and How to Move HPC Jobs to the Cloud , 2015, Computer.

[21]  Andrzej Goscinski,et al.  IaaS clouds vs. clusters for HPC: a performance study , 2011, CLOUD 2011.

[22]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[23]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[24]  Dejan S. Milojicic,et al.  The Who, What, Why, and How of High Performance Computing in the Cloud , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[25]  Manish Parashar,et al.  Cloud Paradigms and Practices for Computational and Data-Enabled Science and Engineering , 2013, Computing in Science & Engineering.

[26]  Rajkumar Buyya,et al.  Libra: a computational economy‐based job scheduling system for clusters , 2004, Softw. Pract. Exp..

[27]  Thomas L. Sterling,et al.  A High-Performance Computing Forecast: Partly Cloudy , 2009, Computing in Science & Engineering.

[28]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[29]  Andrzej M. Goscinski,et al.  A unified framework for the deployment, exposure and access of HPC applications as services in clouds , 2013, Future Gener. Comput. Syst..

[30]  Athanasios V. Vasilakos,et al.  Cloud computing in e-Science: research challenges and opportunities , 2014, The Journal of Supercomputing.

[32]  Mark John Somers,et al.  Trends and future directions. , 2006 .

[33]  Vivek Kale,et al.  Big Data Computing , 2019, Parallel Computing Architectures and APIs.

[34]  Shujia Zhou,et al.  Case study for running HPC applications in public clouds , 2010, HPDC '10.

[35]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[36]  Marta Mattoso,et al.  SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes , 2011, BSB.

[37]  Brian Kocoloski,et al.  A case for dual stack virtualization: consolidating HPC and commodity applications in the cloud , 2012, SoCC '12.

[38]  Zibin Zheng,et al.  A topology-aware method for scientific application deployment on cloud , 2014, Int. J. Web Grid Serv..

[39]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[40]  Salvatore Venticinque,et al.  Performance Prediction for HPC on Clouds , 2011, CloudCom 2011.

[41]  Paul Marshall,et al.  High-performance computing and the cloud: a match made in heaven or hell? , 2013, XRDS.

[42]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[43]  Michael Httermann,et al.  DevOps for Developers , 2012 .

[44]  Carlos Arango,et al.  Performance Evaluation of Container-based Virtualization for High Performance Computing Environments , 2017, Revista UIS Ingenierías.

[45]  Kannan Govindarajan,et al.  CLOUDRB: A framework for scheduling and managing High-Performance Computing (HPC) applications in science cloud , 2014, Future Gener. Comput. Syst..

[46]  Dhabaleswar K. Panda,et al.  Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand , 2017, VEE.

[47]  Dejan S. Milojicic,et al.  Exploring the performance and mapping of HPC applications to platforms in the cloud , 2012, HPDC '12.

[48]  Massimiliano Rak,et al.  Early Prediction of the Cost of Cloud Usage for HPC Applications , 2015, Scalable Comput. Pract. Exp..

[49]  Ian T. Foster The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Euro-Par.

[50]  Calvin J. Ribbens,et al.  Hybrid Computing - Where HPC meets grid and Cloud Computing , 2011, Future Gener. Comput. Syst..

[51]  Bruno Schulze,et al.  An Analysis of Public Clouds Elasticity in the Execution of Scientific Applications: a Survey , 2016, Journal of Grid Computing.

[52]  Dimitrios Soudris,et al.  A survey on reconfigurable accelerators for cloud computing , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[53]  Michael Hüttermann DevOps for Developers , 2012, Apress.

[54]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[55]  Walaa M. Sheta,et al.  Scalability and communication performance of HPC on Azure Cloud , 2016 .

[56]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[57]  Rajkumar Buyya,et al.  High-Performance Cloud Computing: A View of Scientific Applications , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[58]  Jörn Altmann,et al.  A Cost Model for Hybrid Clouds , 2011, GECON.

[59]  David E. Culler,et al.  PlanetLab: an overlay testbed for broad-coverage services , 2003, CCRV.

[60]  Dejan S. Milojicic,et al.  HPC-Aware VM Placement in Infrastructure Clouds , 2013, 2013 IEEE International Conference on Cloud Engineering (IC2E).

[61]  Antonio Brogi,et al.  Cloud Container Technologies: A State-of-the-Art Review , 2019, IEEE Transactions on Cloud Computing.

[62]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[63]  Harald Richter About the Suitability of Clouds in High-Performance Computing , 2016, ArXiv.

[64]  Xiaolin Li,et al.  Designing Flexible Resource Rental Models for Implementing HPC-as-a-Service in Cloud , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[65]  Bastien Chopard,et al.  A hybrid HPC/cloud distributed infrastructure: Coupling EC2 cloud resources with HPC clusters to run large tightly coupled multiscale applications , 2015, Future Gener. Comput. Syst..

[66]  Xiaorong Li,et al.  Building an HPC-as-a-Service Toolkit for User-Interactive HPC Services in the Cloud , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[67]  Cristiano André da Costa,et al.  AutoElastic: Automatic Resource Elasticity for High Performance Applications in the Cloud , 2016, IEEE Transactions on Cloud Computing.

[68]  Dana Petcu,et al.  Next Generation HPC Clouds: A View for Large-Scale Scientific and Data-Intensive Applications , 2014, Euro-Par Workshops.

[69]  Michael Griebel,et al.  Massively Parallel Fluid Simulations on Amazon's HPC Cloud , 2011, 2011 First International Symposium on Network Cloud Computing and Applications.

[70]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[71]  Chandra Krintz,et al.  Neptune: a domain specific language for deploying hpc software on cloud platforms , 2011, ScienceCloud '11.

[72]  Mukesh Singhal,et al.  The Role of Cloud Computing Architecture in Big Data , 2015 .

[73]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[74]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[75]  Michael Johnston,et al.  Exploring HPC-based scientific software as a service using CometCloud , 2014, 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[76]  Rajkumar Buyya,et al.  Next generation cloud computing: New trends and research directions , 2017, Future Gener. Comput. Syst..

[77]  Dror G. Feitelson,et al.  On Identifying User Session Boundaries in Parallel Workload Logs , 2012, JSSPP.

[78]  Leonid Oliker,et al.  Integrated performance monitoring of a cosmology application on leading HEC platforms , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[79]  Laxmikant V. Kale,et al.  The who, what, why and how of high performance computing applications in the cloud , 2013 .

[80]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..

[81]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[82]  John Shalf,et al.  Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[83]  Constantinos Evangelinos,et al.  Cloud Computing for parallel Scientific HPC Applications: Feasibility of Running Coupled Atmosphere- , 2008 .

[84]  Abdallah Khreishah,et al.  Program Scalability Analysis for HPC Cloud: Applying Amdahl's Law to NAS Benchmarks , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[85]  Guojing Cong,et al.  Practical Efficiency of Asynchronous Stochastic Gradient Descent , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[86]  Marco Aurélio Stelmar Netto,et al.  SLA-aware Interactive Workflow Assistant for HPC Parameter Sweeping Experiments , 2016, WORKS@SC.

[87]  Alessandro Rubini Kernel Korner , 1998 .

[88]  Abhishek Gupta,et al.  Evaluation of HPC Applications on Cloud , 2011, 2011 Sixth Open Cirrus Summit.

[89]  Dejan S. Milojicic,et al.  Evaluating and Improving the Performance and Scheduling of HPC Applications in Cloud , 2016, IEEE Transactions on Cloud Computing.

[90]  Xin Yuan,et al.  A comparative study of high-performance computing on the cloud , 2013, HPDC.

[91]  Paolo Bientinesi,et al.  Can cloud computing reach the top500? , 2009, UCHPC-MAW '09.

[92]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[93]  C. Divya,et al.  Efficient resource selection framework to enable cloud for HPC applications , 2013, 2013 4th International Conference on Computer and Communication Technology (ICCCT).

[94]  Marco Aurélio Stelmar Netto,et al.  Job placement advisor based on turnaround predictions for HPC hybrid clouds , 2016, Future Gener. Comput. Syst..

[95]  Wenguang Chen,et al.  Cloud versus in-house cluster: Evaluating Amazon cluster compute instances for running MPI applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[96]  Janis Keuper,et al.  Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[97]  Rajkumar Buyya,et al.  Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters , 2009, HPDC '09.

[98]  Peter Sanders,et al.  High Performance in the Cloud with FPGA Groups , 2016, 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC).

[99]  Geoffrey C. Fox,et al.  High Performance Parallel Computing with Clouds and Cloud Technologies , 2009, CloudComp.

[100]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[101]  Martin Schulz,et al.  Exploiting redundancy for cost-effective, time-constrained execution of HPC applications on amazon EC2 , 2014, HPDC '14.

[102]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[103]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[104]  Larry L. Peterson,et al.  Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors , 2007, EuroSys '07.

[105]  Marta Mattoso,et al.  Dynamic steering of HPC scientific workflows: A survey , 2015, Future Gener. Comput. Syst..

[106]  Andrzej M. Goscinski,et al.  Toward Exposing and Accessing HPC Applications in a SaaS Cloud , 2012, 2012 IEEE 19th International Conference on Web Services.

[107]  Ramakrishnan Rajamony,et al.  An updated performance comparison of virtual machines and Linux containers , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[108]  Giulio Giunta,et al.  A GPGPU Transparent Virtualization Component for High Performance Computing Clouds , 2010, Euro-Par.

[109]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[110]  Qian Huang Development of a SaaS application probe to the physical properties of the Earth's interior: An attempt at moving HPC to the cloud , 2014, Comput. Geosci..

[111]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[112]  Philippe Olivier Alexandre Navaux,et al.  High Performance Computing in the cloud: Deployment, performance and cost efficiency , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[113]  Marius Hillenbrand,et al.  High performance cloud computing , 2013, Future Gener. Comput. Syst..

[114]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[115]  Marta Mattoso,et al.  SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[116]  Bartosz Balis,et al.  Porting HPC applications to the cloud: A multi-frontal solver case study , 2017, J. Comput. Sci..

[117]  Christophe Lefèvre,et al.  Exposing HPC and sequential applications as services through the development and deployment of a SaaS cloud , 2015, Future Gener. Comput. Syst..