Mining hidden mixture context with ADIOS-P to improve predictive pre-fetcher accuracy

Predictive pre-fetcher, which predicts future data access events and loads the data before users requests, has been widely studied, especially in file systems or web contents servers, to reduce data load latency. Especially in scientific data visualization, pre-fetching can reduce the IO waiting time. In order to increase the accuracy, we apply a data mining technique to extract hidden information. More specifically, we apply a data mining technique for discovering the hidden contexts in data access patterns and make prediction based on the inferred context to boost the accuracy. In particular, we performed Probabilistic Latent Semantic Analysis (PLSA), a mixture model based algorithm popular in the text mining area, to mine hidden contexts from the collected user access patterns and, then, we run a predictor within the discovered context. We further improve PLSA by applying the Deterministic Annealing (DA) method to overcome the local optimum problem. In this paper we demonstrate how we can apply PLSA and DA optimization to mine hidden contexts from users data access patterns and improve predictive pre-fetcher performance.

[1]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[2]  Ricardo Todling,et al.  The GEOS-5 Data Assimilation System-Documentation of Versions 5.0.1, 5.1.0, and 5.2.0 , 2008 .

[3]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[4]  Geoffrey C. Fox Deterministic annealing and robust scalable data mining for the data deluge , 2011, PDAC '11.

[5]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[6]  Joachim M. Buhmann,et al.  Multidimensional Scaling by Deterministic Annealing , 1997, EMMCVPR.

[7]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[8]  Darrell D. E. Long,et al.  The case for efficient file access pattern modeling , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[9]  Arie Shoshani,et al.  Toward a first-principles integrated simulation of tokamak edge plasmas , 2008 .

[10]  XuLei Yang,et al.  A robust deterministic annealing algorithm for data clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[11]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[12]  Arif Merchant,et al.  TaP: Table-based Prefetching for Storage Caches , 2008, FAST.

[13]  Geoffrey C. Fox,et al.  Unsupervised learning of finite mixture models with deterministic annealing for large-scale data analysis , 2012 .

[14]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[15]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[16]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[17]  Darrell D. E. Long,et al.  Design and Implementation of a Predictive File Prefetching Algorithm , 2001, USENIX Annual Technical Conference, General Track.

[18]  Olivia R. Liu Sheng,et al.  A Data-Mining-Based Prefetching Approach to Caching for Network Storage Systems , 2006, INFORMS J. Comput..

[19]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[20]  Luonan Chen,et al.  Protein structure alignment by deterministic annealing , 2005, Bioinform..

[21]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[22]  M. Satyanarayanan,et al.  Long Term Distributed File Reference Tracing: Implementation and Experience , 1996 .

[23]  Scott Klasky,et al.  Grid-based Parallel Data Streaming Implemented for the Gyrokinetic Toroidal Code , 2003 .

[24]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[26]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[27]  Hong Jiang,et al.  Nexus: a novel weighted-graph-based prefetching algorithm for metadata servers in petabyte-scale storage systems , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[28]  Mahadev Satyanarayanan,et al.  Long Term Distributed File Reference Tracing: Implementation and Experience , 1996, Softw. Pract. Exp..

[29]  Prabhat,et al.  Extreme Scaling of Production Visualization Software on Diverse Architectures , 2010, IEEE Computer Graphics and Applications.

[30]  Richard A. Harshman,et al.  Information retrieval using a singular value decomposition model of latent semantic structure , 1988, SIGIR '88.

[31]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[32]  Hong Jiang,et al.  A Novel Weighted-Graph-Based Grouping Algorithm for Metadata Prefetching , 2010, IEEE Transactions on Computers.

[33]  Luis Angel D. Bathen,et al.  AMP: Adaptive Multi-stream Prefetching in a Shared Cache , 2007, FAST.

[34]  Geoffrey C. Fox,et al.  Generative topographic mapping by deterministic annealing , 2010, ICCS.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Dirk Grunwald,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[37]  Chanik Park,et al.  Enhancing Prediction Accuracy in PCM-Based File Prefetch by Constrained Pattern Replacement Algorithm , 2003, International Conference on Computational Science.

[38]  George Pallis,et al.  A clustering-based prefetching scheme on a Web cache environment , 2008, Comput. Electr. Eng..