SANgo: a storage infrastructure simulator with reinforcement learning support

We introduce SANgo (Storage Area Network in the Go language)—aGo-based package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The flexible structure of the package allows us to create a model of a real storage system with a configurable number of components. The granularity of the simulated system can be defined depending on the replicated patterns of actual system behavior. Accurate replication enables us to reach the primary goal of our simulator—to explore the stability boundaries of real storage systems. To meet this goal, SANgo offers a variety of interfaces for easy monitoring and tuning of the simulated model. These interfaces allow us to track the number of metrics of such components as storage controllers, network connections, and harddrives. Other interfaces allow altering the parameter values of the simulated system effectively in real-time, thus providing the possibility for training a realistic digital twin using, for example, the reinforcement learning (RL) approach. One can train an RL model to reduce discrepancies between simulated and real SAN data. The external control algorithm can adjust the simulator parameters to make the difference as small as possible. SANgo supports the standard OpenAI gym interface; thus, the software can serve as a benchmark for comparison of different learning algorithms. Subjects Data Mining and Machine Learning, Scientific Computing and Simulation, Software Engineering

[1]  Andrey Ustyuzhanin,et al.  Machine Learning Algorithms for Automatic Anomalies Detection in Data Storage Systems Operation , 2019 .

[2]  T. Mitchel International Disk Drive Equipment and Materials Association Japan , 1996 .

[3]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .

[4]  Tim Bray,et al.  The JavaScript Object Notation (JSON) Data Interchange Format , 2014, RFC.

[5]  G. Coquery,et al.  Effects of current density and chip temperature distribution on lifetime of high power IGBT modules in traction working conditions , 1997 .

[6]  Federico Silla,et al.  Performance analysis of storage area networks using high-speed LAN interconnects , 2000, Proceedings IEEE International Conference on Networks 2000 (ICON 2000). Networking Trends and Challenges in the New Millennium.

[7]  Henry S. Blanks,et al.  Arrhenius and the temperature dependence of non‐constant failure rate , 1990 .

[8]  Sriram Sankar,et al.  Impact of temperature on hard disk drive reliability in large datacenters , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[9]  J. J. Serrano,et al.  Improving the execution of groups of simulations on a cluster of workstations and its application to storage area networks , 2001, Proceedings. 34th Annual Simulation Symposium.

[10]  Azer Bestavros,et al.  Reinforcement Learning for UAV Attitude Control , 2018, ACM Trans. Cyber Phys. Syst..

[11]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[12]  A. G. Evans,et al.  Failure mechanisms associated with the thermally grown oxide in plasma-sprayed thermal barrier coatings , 2000 .

[13]  N. Sriraam,et al.  The effect of temperature on the reliability of electronic components , 2014, 2014 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT).

[14]  Federico Silla,et al.  A tool for the design and evaluation of fibre channel storage area networks , 2001, Proceedings. 34th Annual Simulation Symposium.

[15]  Toshiyuki Hashida,et al.  Thermal fatigue failure induced by delamination in thermal barrier coating , 2002 .

[16]  Feng Zhou,et al.  Simulation of fibre channel storage area network using SANSim , 2003, The 11th IEEE International Conference on Networks, 2003. ICON2003..

[17]  George F. Riley,et al.  The ns-3 Network Simulator , 2010, Modeling and Tools for Network Simulation.

[18]  Robert B. Ross,et al.  Enabling Parallel Simulation of Large-Scale HPC Network Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[19]  Andrey Sapronov,et al.  Tuning hybrid distributed storage system digital twins by reinforcement learning , 2018 .

[20]  Federico Silla,et al.  Modeling and simulation of storage area networks , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[21]  King-Ning Tu,et al.  Current-crowding-induced electromigration failure in flip chip solder joints , 2002 .

[22]  Andrew R. Barnard,et al.  PERFORMANCE OF HARD DISK DRIVES IN HIGH NOISE ENVIRONMENTS , 2017 .

[23]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[24]  András Varga,et al.  An overview of the OMNeT++ simulation environment , 2008, SimuTools.

[25]  Andrey Ustyuzhanin,et al.  Hybrid approach to design of storage attached network simulation systems , 2018 .

[26]  Howard M. Berg,et al.  Chip corrosion in plastic packages , 1980 .

[27]  J. S. Dagpunar,et al.  Principles of Discrete Event Simulation , 1980 .

[28]  Minoru Aoyagi Temperature characteristics of stress-induced migration based on atom migration , 2005 .

[29]  B.D. Strom,et al.  Hard Disk Drive Reliability Modeling and Failure Prediction , 2006, Asia-Pacific Magnetic Recording Conference 2006.

[30]  T. Matsunaga,et al.  Thermal Fatigue Life Evaluation of Aluminum Wire Bonds , 2006, 2006 1st Electronic Systemintegration Technology Conference.

[31]  Jerry Hamann,et al.  Large Fabric Storage Area Networks: Fabric Simulator Development and Preliminary Performance Analysis , 2010 .

[32]  William J. Roesch,et al.  Cycling copper flip chip interconnects , 2004, Microelectron. Reliab..

[33]  F. d'Heurle Electromigration and failure in electronics: An introduction , 1971 .