Sampling Based Range Partition Methods for Big Data Analytics

Big Data Analytics requires partitioning datasets into thousands of partitions according to a specific set of keys so that different machines can process different partitions in parallel. Range partition is one of the ways to partition the data that is needed whenever global ordering is required. It partitions the data according to a pre-defined set of exclusive and continuous ranges that covers the entire domain of the partition key. Providing high-quality (approximately equal-sized) partitions is a key problem for the big data analytics because the job latency is determined by the most loaded node. This problem is especially challenging because typically no statistics about the key distribution over machines for an input dataset is available at the beginning of a range partition. The system needs to find a way to determine the partition boundaries that is both cost-effective and accurate. This paper presents a weightedsampling based approach, implemented in Cosmos–the cloud infrastructure for big data analytics used by Microsoft Online Service Division. The approach has been used by many jobs daily and was found to be both efficient and providing desired partition quality.

[1]  Richard S. Varga,et al.  Proof of Theorem 5 , 1983 .

[2]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[3]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[4]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[5]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Sanjeev Khanna,et al.  Power-conserving computation of order-statistics over sensor networks , 2004, PODS.

[8]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[9]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[10]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[11]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[12]  Qin Zhang,et al.  Optimal tracking of distributed heavy hitters and quantiles , 2009, PODS.

[13]  Milan Vojnovic,et al.  Random Sampling for Data Intensive Computations , 2009 .

[14]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Qin Zhang,et al.  Randomized algorithms for tracking distributed count, frequencies, and ranks , 2012, PODS '12.