Write Skew and Zipf Distribution: Evidence and Implications

Understanding workload characteristics is essential to storage systems design and performance optimization. With the emergence of flash memory as a new viable storage medium, the new design concern of flash endurance arises, necessitating a revisit of workload characteristics, in particular, of the write behavior. Inspired by Web caching studies where a Zipf-like access pattern is commonly found, we hypothesize that write count distribution at the block level may also follow Zipf’s Law. To validate this hypothesis, we study 48 block I/O traces collected from a wide variety of real and benchmark applications. Through extensive analysis, we demonstrate that the Zipf-like pattern indeed widely exists in write traffic provided its disguises are removed by statistical processing. This finding implies that write skew in a large class of applications could be analytically expressed and, thus, facilitates design tradeoff explorations adaptive to workload characteristics.

[1]  Songqing Chen,et al.  The stretched exponential distribution of internet media access patterns , 2008, PODC '08.

[2]  Pasquale Cirillo,et al.  Are your data really Pareto distributed , 2013, 1306.0100.

[3]  Qi Zhang,et al.  Characterization of storage workload traces from production Windows Servers , 2008, 2008 IEEE International Symposium on Workload Characterization.

[4]  Jin-Soo Kim,et al.  An empirical study of hot/cold data separation policies in solid state drives (SSDs) , 2013, SYSTOR '13.

[5]  Martin Arlitt,et al.  Web Workload Characterization: Ten Years Later , 2005 .

[6]  Pablo Rodriguez,et al.  I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system , 2007, IMC '07.

[7]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[8]  David Hung-Chang Du,et al.  Hot data identification for flash-based storage systems using multiple bloom filters , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Peter Desnoyers,et al.  Analytic Models of SSD Write Performance , 2014, TOS.

[10]  Yue Yang,et al.  Analytical modeling of garbage collection algorithms in hotness-aware flash-based solid state drives , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  KimJin-Soo,et al.  A reconfigurable FTL (flash translation layer) architecture for NAND flash-based applications , 2008 .

[12]  Antony I. T. Rowstron,et al.  Write off-loading: Practical power management for enterprise storage , 2008, TOS.

[13]  Li-Pin Chang,et al.  Design and implementation of an efficient wear-leveling algorithm for solid-state-disk microcontrollers , 2009, TODE.

[14]  Werner Bux,et al.  Performance of greedy garbage collection in flash-based solid-state drives , 2010, Perform. Evaluation.

[15]  Benny Van Houdt,et al.  A mean field model for a class of garbage collection algorithms in flash-based solid state drives , 2013, Queueing Systems.

[16]  Bernd Blasius,et al.  Zipf's law in the popularity distribution of chess openings. , 2007, Physical review letters.

[17]  Anand Sivasubramaniam,et al.  Leveraging Value Locality in Optimizing NAND Flash-based SSDs , 2011, FAST.

[18]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[19]  Tian Luo,et al.  CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives , 2011, FAST.

[20]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[21]  Tei-Wei Kuo,et al.  Efficient identification of hot data for flash memory storage systems , 2006, TOS.

[22]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[23]  Peter Desnoyers,et al.  Analytic modeling of SSD write performance , 2012, SYSTOR '12.

[24]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[25]  Zongpeng Li,et al.  Youtube traffic characterization: a view from the edge , 2007, IMC '07.

[26]  Alma Riska,et al.  Disk Drive Level Workload Characterization , 2006, USENIX Annual Technical Conference, General Track.

[27]  Alec Wolman,et al.  Measurement and Analysis of a Streaming Media Workload , 2001, USITS.

[28]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[29]  Hiroaki Kobayashi,et al.  Modeling of cache access behavior based on Zipf's law , 2008, MEDEA '08.

[30]  Edward Chlebus,et al.  An approximate formula for a partial sum of the divergent p-series , 2009, Appl. Math. Lett..

[31]  Mark Crovella,et al.  Characteristics of WWW Client-based Traces , 1995 .

[32]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[33]  Amin Vahdat,et al.  MediSyn: a synthetic streaming media service workload generator , 2003, NOSSDAV '03.

[34]  Peter Nijkamp,et al.  Accessibility of Cities in the Digital Economy , 2004, cond-mat/0412004.

[35]  Ben Y. Zhao,et al.  Understanding user behavior in large-scale video-on-demand systems , 2006, EuroSys.