论文信息 - Online detection of failures generated by storage simulator

Online detection of failures generated by storage simulator

Modern large-scale data-farms consist of hundreds of thousands of storage devices that span distributed infrastructure. Devices used in modern data centers (such as controllers, links, SSDand HDD-disks) can fail due to hardware as well as software problems. Such failures or anomalies can be detected by monitoring the activity of components using machine learning techniques. In order to use these techniques, researchers need plenty of historical data of devices in normal and failure mode for training algorithms. In this work, we challenge two problems: 1) lack of storage data in the methods above by creating a simulator and 2) applying existing online algorithms that can faster detect a failure occurred in one of the components. We created a Go-based (golang) package for simulating the behavior of modern storage infrastructure. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. The package’s flexible structure allows us to create a model of a real-world storage system with a configurable number of components. The primary area of interest is exploring the storage machine’s behavior under stress testing or exploitation in the mediumor long-term for observing failures of its components. To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode. The goal of the change-point detection is to discover differences in time series distribution. This work describes an approach for failure detection in time series data based on direct density ratio estimation via binary classifiers. Introduction Disk-drive is one of the crucial elements of any computer and IT infrastructure. Disk failures have a high contributing factor to outages of the overall computing system. During the last decades, the storage system’s reliability and modeling is an active area of research in industry and academia works [1–3]. Nowadays, the rough total amount of hard disk drives (HDD) and solid-state drives (SSD) deployed in data-farms and cloud systems passed tens of millions of units [4]. Consequently, the importance of early identifying defects leading to failures that can happen in the future can result in significant benefits. Such failures or anomalies can be detected by monitoring components’ activity using machine learning techniques, named change point detection [5–7]. To use these techniques, especially for anomaly detection, it is a necessity in historical data of devices in normal and failure mode for training algorithms. In this paper, ar X iv :2 10 1. 07 10 0v 1 [ cs .L G ] 1 8 Ja n 20 21 due to the reasons mentioned above, we challenge two problems: 1) lack of storage data in the methods above by creating a simulator and 2) applying new online algorithms that can faster detect a failure occurred in one of the components [8]. A Go-based (golang) package for simulating the behavior of modern storage infrastructure is created. The primary area of interest is exploring the storage machine’s behavior under stress testing or exploitation in the mediumor long-term for observing failures of its components. The software is based on the discrete-event modeling paradigm and captures the structure and dynamics of high-level storage system building blocks. It represents the hybrid approach to modeling storage attached network [9, 10]. This method uses additional blocks with a neural network that tunes the internal model parameters while a simulation is running, described in [11]. This approach’s critical advantage is a decreased requirement for detailed simulation and the number of modeled parameters of real-world system components and, as a result, a significant reduction in the intellectual cost of its development. The package’s modular structure allows us to create a model of a real-word storage system with a configurable number of components. Compared to other techniques, parameter tuning does not require heavy-lifting changes within developing service [12]. To discover failures in the time series distribution generated by the simulator, we modified a change point detection algorithm that works in online mode. The goal of the change-point detection is to discover differences in time series distribution. This work uses an approach for failure detection in time series data based on direct density ratio estimation via binary classifiers [8]. Simulator Internals The simulator uses a Discrete Event Simulation (DES) [13] paradigm for modeling storage infrastructure. In a broad sense, DES is used to simulate a system as a discrete sequence of events in time. Each event happens in a specific moment in time and traces a change of state in the system. Between two consecutive events, no altering in the system is presumed to happen; thus, the simulation time can directly move to the next event’s occurrence time. The scheme of the process is shown in Figure 1. Figure 1. The event handling loop is the central part that responsible for time movement in the simulator. The Master process creates necessary logical processes (Client1, IOBalancer, HDD Write, etc.) and populates a Priority Queue by collecting events from modeling processes. The last part of the implementation is running the event handling loop. It removes successive elements from the queue. That would be correct because we know that the queue is already time sorted and performed the associated actions. The simulator’s programming environment provides the functionality to set up a model for specific computing environments, especially storage area networks. The key site of interest is Table 1. Resource description Resource Real word entity Parameters Units Anomaly type CPU Controller, server Number of cores Amount Each component Core speed Flops can suffer from Link Networking cables Bandwidth Megabyte/sec performance degradation Latency Sec or total breakup Storage Cache, SSD, HDD Size Gigabyte Write speed Megabyte/sec Read speed Megabyte/sec exploring the storage infrastructure’s behavior under various stress testing or utilization in the mediumor long-term for monitoring breakups of its components. In the simulator, load to storage system can be represented by two action types: read file from disk and write file to disk. Each file has corresponding attributes, such as name, block size, and total size. With the current load, these attributes determine the amount of time required to perform the corresponding action. The three basic types of resources are provided: CPU, network interface, and storage. Their representation is shown in the Figure 3 and informative description is given in the Table 1. By using basic blocks, real-world systems can be constructed, as shown in the Figure 2. Figure 2. The example of the real storage system that can be modeled by using basic blocks Figure 3. Basic resource entities in the simulator Comparison with the real data The data from the real-world storage system were used to validate the behavior of the simulator. A similar writing load scenario was generated on the model prototype, together with intentional controller failure (turn-off). The comparison is shown in the Figure 4. As we can see, the simulator’s data can qualitatively reflect the components breakup. Figure 4. Comparison of the CPU load metrics between simulated (A) and real data (B). The periods marked ‘Failure’ correspond to a storage processor being offline Change point detection Consider a d-dimensional time series that is described by a vector of observations x(t) ∈ Rd at time t. Sequence of observations for time t with length k is defined as: X(t) = [x(t) , x(t− 1) , . . . , x(t− k − 1) ] ∈ R Sample of sequences of size n is defined as: X (t) = X(t), X(t− 1), . . . , X(t− n+ 1) It is implied that observation distribution changes at time t∗. The goal is to detect this change. The idea is to estimate dissimilarity score between reference Xrf (t−n) and test Xte(t). The larger dissimilarity, the more likely the change point occurs at time t− n. In this work, we apply a CPD algorithm based on direct density ratio estimation developed in [8]. The main idea is to estimate density ratio w(X) between two probability distributions Pte(X) and Prf (X) which correspond to test and reference sets accordingly. For estimating w(X), different binary classifiers can be used, like decision trees, random forests, SVM, etc. We use neural networks for this purpose. This network f(X, θ) is trained on the mini-batches with cross-entropy loss function L(X (t− l),X (t), θ), L(X (t− l),X (t), θ) = − 1 n ∑ X∈X (t−l) log(1− f(X, θ))− 1 n ∑ X∈X (t) log f(X, θ), We use a dissimilarity score based on the Kullback-Leibler divergence, D(X (t − l),X (t)). Following [14], we define this score as: D(X (t− l),X (t), θ) = 1

[1] Shyam Diwakar,et al. Lecture Notes in Networks and Systems , 2018, REV.

[2] Andrey Ustyuzhanin,et al. Generalization of Change-Point Detection in Time Series Data Based on Direct Density Ratio Estimation , 2020, J. Comput. Sci..

[3] B.D. Strom,et al. Hard Disk Drive Reliability Modeling and Failure Prediction , 2006, Asia-Pacific Magnetic Recording Conference 2006.

[4] Diane J. Cook,et al. A survey of methods for time series change point detection , 2017, Knowledge and Information Systems.

[5] Feng-Bin Sun,et al. A comprehensive review of hard-disk drive reliability , 1999, Annual Reliability and Maintainability. Symposium. 1999 Proceedings (Cat. No.99CH36283).

[6] E. Silerova,et al. Knowledge and information systems , 2018 .

[7] Nigel Collier,et al. Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation , 2012, Neural Networks.

[8] J. G. Elerath. Specifying reliability in the disk drive industry: No more MTBF's , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[9] Ali Khan,et al. A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters , 2016, 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).

[10] Denis Derkach,et al. Online Neural Networks for Change-Point Detection , 2020, ArXiv.

[11] Andrey Sapronov,et al. SANgo: a storage infrastructure simulator with reinforcement learning support , 2020, PeerJ Comput. Sci..

[12] Andrey Sapronov,et al. Tuning hybrid distributed storage system digital twins by reinforcement learning , 2018 .