The OpenDC Vision: Towards Collaborative Datacenter Simulation and Exploration for Everybody

In the new Digital Economy, massive computer systems, often grouped in datacenters, serve as factories "producing" cloud services with massive consumption. However, to afford cloud services globally, we must address new research challenges in designing, operating, and using modern datacenters. We must also address challenges in educating and training the next generation of datacenter engineers. Addressing such challenges, in this work we present our vision on OpenDC: we envision the exploration of various datacenter concepts and technologies, using existing and new scientific methods, enabling new education practices and topics, and leading to the creation of new software and data artifacts. We present the datacenter concepts and technologies we are currently planning to explore using OpenDC. We identify the scientific methods we want to use, and explain our vision of education practices. We present the architecture and open-source program underlying the OpenDC software, and the format and open-access data we use for datacenter experiments. We conclude with an open invitation for the community to join our effort.

[1]  Moshe Y. Vardi Is Moore's Party over? , 2011, CACM.

[2]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[3]  Ewa Deelman,et al.  Pegasus in the Cloud: Science Automation through Workflow Technologies , 2016, IEEE Internet Computing.

[4]  Alexandru Iosup,et al.  Grid Computing Workloads , 2011, IEEE Internet Computing.

[5]  R. Prodan,et al.  GroudSim: An Event-Based Simulation Framework for Computational Grids and Clouds , 2010, Euro-Par Workshops.

[6]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[7]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[8]  Alexander Serebrenik,et al.  Challenges for Static Analysis of Java Reflection - Literature Review and Empirical Study , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[9]  Eli M. Dow,et al.  Leveraging virtualization to optimize high-availability system configurations , 2008, IBM Syst. J..

[10]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[11]  Shari Lawrence Pfleeger,et al.  Principles of survey research: part 1: turning lemons into lemonade , 2001, SOEN.

[12]  Rajkumar Buyya,et al.  Extending GridSim with an architecture for failure detection , 2007, 2007 International Conference on Parallel and Distributed Systems.

[13]  Pearl Brereton,et al.  Lessons from applying the systematic literature review process within the software engineering domain , 2007, J. Syst. Softw..

[14]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[15]  Jennifer M. Schopf,et al.  Ten Actions When Grid Scheduling , 2004 .

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  Alexandru Iosup,et al.  An Experimental Performance Evaluation of Autoscaling Policies for Complex Workflows , 2017, ICPE.

[18]  Alexandru Iosup,et al.  On the dynamic resource availability in grids , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[19]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[20]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[21]  Kishor S. Trivedi,et al.  System availability with non-exponentially distributed outages , 2002, IEEE Trans. Reliab..

[22]  Claus Pahl,et al.  Performance Engineering for Microservices: Research Challenges and Directions , 2017, ICPE Companion.

[23]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[24]  Alexandru Iosup,et al.  Socializing by Gaming: Revealing Social Relationships in Multiplayer Online Games , 2015, TKDD.

[25]  Laurent Lefèvre,et al.  A survey on techniques for improving the energy efficiency of large-scale distributed systems , 2014, ACM Comput. Surv..

[26]  Rajkumar Buyya,et al.  Energy-aware resource allocation heuristics for efficient management of data centers for Cloud computing , 2012, Future Gener. Comput. Syst..

[27]  Alexandru Iosup,et al.  An experience report on using gamification in technical higher education , 2014, SIGCSE.

[28]  Qi Huang,et al.  Gorilla: A Fast, Scalable, In-Memory Time Series Database , 2015, Proc. VLDB Endow..

[29]  Yogesh L. Simmhan,et al.  Provenance for Scientific Workflows Towards Reproducible Research , 2010, IEEE Data Eng. Bull..

[30]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[31]  Rouven Krebs,et al.  Ready for Rain? A View from SPEC Research on the Future of Cloud Metrics , 2016, ArXiv.

[32]  Didier Colle,et al.  Trends in worldwide ICT electricity consumption from 2007 to 2012 , 2014, Comput. Commun..

[33]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[34]  Yonggang Wen,et al.  Data Center Energy Consumption Modeling: A Survey , 2016, IEEE Communications Surveys & Tutorials.

[35]  Iosif Legrand,et al.  The MONARC toolset for simulating large network-distributed processing systems , 2000, 2000 Winter Simulation Conference Proceedings (Cat. No.00CH37165).

[36]  Robert Cypher,et al.  Disks for Data Centers , 2016 .

[37]  Márk Jelasity,et al.  PeerSim: A scalable P2P simulator , 2009, 2009 IEEE Ninth International Conference on Peer-to-Peer Computing.

[38]  Kurt Stockinger,et al.  Simulation of Dynamic Grid Replication Strategies in OptorSim , 2002, GRID.

[39]  Alexandru Iosup,et al.  An Availability-on-Demand Mechanism for Datacenters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[40]  María S. Pérez-Hernández,et al.  Reproducibility of execution environments in computational science using Semantics and Clouds , 2017, Future Gener. Comput. Syst..

[41]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[42]  Olaf Spinczyk,et al.  FederatedCloudSim: a SLA-aware federated cloud simulation framework , 2014, CCB '14.

[43]  Alexandru Iosup,et al.  Analysis and modeling of time-correlated failures in large-scale distributed systems , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[44]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[45]  Arif Merchant,et al.  Janus: Optimal Flash Provisioning for Cloud Storage Workloads , 2013, USENIX Annual Technical Conference.

[46]  Albert Y. Zomaya,et al.  A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems , 2010, Adv. Comput..

[47]  Michael A. Cusumano,et al.  Extrapolating from Moore's law , 2015, Commun. ACM.

[48]  Radu Prodan Online Analysis and Runtime Steering of Dynamic Workflows in the ASKALON Grid Environment , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[49]  Alexandru Iosup,et al.  DGSim: Comparing Grid Resource Management Architectures through Trace-Based Simulation , 2008, Euro-Par.

[50]  Alexandru Iosup,et al.  A Model for Space-Correlated Failures in Large-Scale Distributed Systems , 2010, Euro-Par.

[51]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[52]  Rajkumar Buyya,et al.  A taxonomy and survey on scheduling algorithms for scientific workflows in IaaS cloud computing environments , 2017, Concurr. Comput. Pract. Exp..

[53]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems , 2013, J. Parallel Distributed Comput..

[54]  Hui Li Realistic Workload Modeling and Its Performance Impacts in Large-Scale eScience Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.