Copula-HDP-HMM: Non-parametric Modeling of Temporal Multivariate Data for I/O Efficient Bulk Cache Preloading

Caching is an important determinant of storage system performance. Bulk cache preloading is the process of preloading large batches of relevant data into cache, minutes or hours in advance of actual requests by the application. We address bulk preloading by analyzing high-level spatio-temporal motifs from raw and noisy I/O traces by aggregating the trace into a temporal sequence of correlated count vectors. Such temporal multivariate data from trace aggregation arise from a diverse set of workloads leading to diverse data distributions with complex spatio-temporal dependencies. Motivated by this, we propose the Copula-HDPHMM, a new Bayesian non-parametric modeling technique based on Gaussian Copula, suitable for temporal multivariate data with arbitrary marginals, avoiding limiting assumptions on the marginal distributions. We are not aware of prior work on copula based extensions of Bayesian non-parametric modeling algorithms for discrete data. Inference with copulas is hard when data is not continuous. We propose a semi-parametric inference technique based on extended rank likelihood that circumvents specifying marginals, making our inference suitable for count data and even data with a combination of discrete and continuous marginals, enabling the use of Bayesian non-parametric modeling, for several data types, without assumptions on marginals. Finally, we propose HULK , a strategy for I/O efficient bulk cache preloading using our Copula-HDPHMM model to leverage high-level spatio-temporal motifs in Block I/O traces. In experiments on benchmark traces, we show near perfect hitrate of 0.95 using HULK, a tremendous improvement over baseline using Multivariate Poisson, with only a fourth of I/O overhead.

[1]  Michael I. Jordan,et al.  A Sticky HDP-HMM With Application to Speaker Diarization , 2009, 0905.2592.

[2]  Kazutomo Kawamura,et al.  The structure of multivariate Poisson distribution , 1979 .

[3]  Volker Roth,et al.  Copula Mixture Model for Dependency-seeking Clustering , 2012, ICML.

[4]  Chiranjib Bhattacharyya,et al.  Mining Block I/O Traces for Cache Preloading with Sparse Temporal Non-parametric Mixture of Multivariate Poisson , 2014, SDM.

[5]  Dharmendra S. Modha,et al.  SARC: Sequential Prefetching in Adaptive Replacement Cache , 2005, USENIX Annual Technical Conference, General Track.

[6]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[7]  Milos Hauskrecht,et al.  Mining recent temporal patterns for event detection in multivariate time series data , 2012, KDD.

[8]  Wei Sun,et al.  Latent Variable Copula Inference for Bundle Pricing from Retail Transaction Data , 2014, ICML.

[9]  Daniel Hernández-Lobato,et al.  Gaussian Process Conditional Copulas with Applications to Financial Time Series , 2013, NIPS.

[10]  Yuval Shahar,et al.  A Framework for Knowledge-Based Temporal Abstraction , 1997, Artif. Intell..

[11]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[12]  A. M. Schmidt,et al.  Modelling Multivariate Counts Varying Continuously in Space , 2011 .

[13]  Dimitris Karlis,et al.  Strategies for Efficient Computation of Multivariate Poisson Probabilities , 2004 .

[14]  Mohamad A. Khaled,et al.  Estimation of Copula Models with Discrete Margins , 2010 .

[15]  Dimitris Karlis,et al.  Modelling multivariate count data , 2006 .

[16]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[17]  Peter D. Hoff Extending the rank likelihood for semiparametric copula estimation , 2006, math/0610413.

[18]  Jiang Zhou,et al.  Block2Vec: A Deep Learning Strategy on Mining Block Correlations in Storage Systems , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).

[19]  Antony I. T. Rowstron,et al.  Write off-loading: Practical power management for enterprise storage , 2008, TOS.

[20]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[21]  Madalin Mihailescu,et al.  Context-Aware Prefetching at the Storage Server , 2008, USENIX Annual Technical Conference.

[22]  Andrea C. Arpaci-Dusseau,et al.  Warming up storage-level caches with bonfire , 2013, FAST.