Storage characterization for unstructured data in online services applications

Mega datacenters hosting large scale web services have unique workload attributes that need to be taken into account for optimal service scalability. Provisioning compute and storage resources to provide a seamless user experience is challenging since customer traffic loads vary widely across time and geographies, and the servers hosting these applications have to be rightsized to provide both performance within a single server and across a scale-out cluster. Typical user-facing web services have a three tiered hierarchy — front-end web servers, middle-tier application logic, and back-end data storage and processing layer. In this paper, we address the challenge of disk subsystem design for back-end servers hosting large amounts of unstructured (also called blob) data. Examples of typical content hosted on such servers include user generated content such as photos, email messages, videos, and social networking updates. Specific server applications analyzed in this paper correspond to the message store of a large scale email application, image tile storage for a large scale geo-mapping application, and user content storage for Web 2.0 type applications. We analyze the storage subsystems for these web services in a live production environment and provide an overview of the disk traffic patterns and access characteristics for each of these applications. We then explore time-series characteristics and derive probabilistic models showing state transitions between locations on the data volumes for these applications. We then explore how these probabilistic models could be extended into a framework for synthetic benchmark generation for such applications. Finally, we discuss how this framework can be used for storage subsystem rightsizing for optimal scalability of such backend storage clusters.

[1]  María Engracia Gómez,et al.  A new approach in the modeling and generation of synthetic disk workload , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[2]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[3]  Qi Zhang,et al.  Characterization of storage workload traces from production Windows Servers , 2008, 2008 IEEE International Symposium on Workload Characterization.

[4]  Murad S. Taqqu,et al.  On the Self-Similar Nature of Ethernet Traffic , 1993, SIGCOMM.

[5]  Mark W. Garrett,et al.  Modeling and generation of self-similar vbr video traffic , 1994, SIGCOMM 1994.

[6]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[7]  Erez Zadok,et al.  Tracefs: A File System to Trace Them All , 2004, FAST.

[8]  María Engracia Gómez,et al.  Analysis of self-similarity in I/O workload using structural modeling , 1999, MASCOTS '99. Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[9]  John Wilkes,et al.  My Cache or Yours? Making Storage More Exclusive , 2002, USENIX Annual Technical Conference, General Track.

[10]  Walter Willinger,et al.  Analysis, modeling and generation of self-similar VBR video traffic , 1994, SIGCOMM.

[11]  Edward D. Lazowska,et al.  Quantitative system performance - computer system analysis using queueing network models , 1983, Int. CMG Conference.

[12]  Rudolf H. Riedi,et al.  Multifractal Properties of TCP Traffic: a Numerical Study , 1997 .

[13]  María Engracia Gómez,et al.  A new approach in the analysis and modeling of disk access patterns , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[14]  Gregory R. Ganger,et al.  Generating Representative Synthetic Workloads: An Unsolved Problem , 1995 .

[15]  Richard G. Baraniuk,et al.  A Multifractal Wavelet Model with Application to Network Traffic , 1999, IEEE Trans. Inf. Theory.

[16]  Christos Faloutsos,et al.  Capturing the spatio-temporal behavior of real traffic data , 2002, Perform. Evaluation.

[17]  John Wilkes,et al.  UNIX Disk Access Patterns , 1993, USENIX Winter.

[18]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.