BigDataBench: a Big Data Benchmark Suite from Web Search Engines

This paper presents our joint research efforts on big data benchmarking with several industrial partners. Considering the complexity, diversity, workload churns, and rapid evolution of big data systems, we take an incremental approach in big data benchmarking. For the first step, we pay attention to search engines, which are the most important domain in Internet services in terms of the number of page views and daily visitors. However, search engine service providers treat data, applications, and web access logs as business confidentiality, which prevents us from building benchmarks. To overcome those difficulties, with several industry partners, we widely investigated the open source solutions in search engines, and obtained the permission of using anonymous Web access logs. Moreover, with two years' great efforts, we created a sematic search engine named ProfSearch (available from this http URL). These efforts pave the path for our big data benchmark suite from search engines---BigDataBench, which is released on the web page (this http URL). We report our detailed analysis of search engine workloads, and present our benchmarking methodology. An innovative data generation methodology and tool are proposed to generate scalable volumes of big data from a small seed of real data, preserving semantics and locality of data. Also, we preliminarily report two case studies using BigDataBench for both system and architecture researches.

[1]  Ninghui Sun,et al.  High Volume Computing : Identifying and Characterizing Throughput Oriented Workloads in Data Centers , 2013 .

[2]  Babak Falsafi,et al.  Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware , 2011 .

[3]  Jianfeng Zhan,et al.  Characterizing OS behavior of Scale-out Data Center Workloads , 2013 .

[4]  Xiaona Li,et al.  Cost-Aware Cooperative Resource Provisioning for Heterogeneous Workloads in Data Centers , 2013, IEEE Transactions on Computers.

[5]  Zhen Jia,et al.  The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems , 2012, WBDB.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Gang Lu,et al.  CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications , 2012, Frontiers of Computer Science.

[8]  栄藤 稔 ビッグデータとパターン認識 ~ More data usually beats better algorithms? ~ , 2011 .

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Chunjie Luo,et al.  Characterizing data analysis workloads in data centers , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Yi Liang,et al.  In Cloud, Can Scientific Communities Benefit from the Economies of Scale? , 2010, IEEE Transactions on Parallel and Distributed Systems.

[14]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[15]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[16]  Gang Lu,et al.  Characterization of real workloads of web search engines , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[18]  William Stafford Noble,et al.  Support vector machine , 2013 .