Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets

In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.

[1]  Andrew B. Whitford Bayesian Methods: A Social and Behavioral Sciences Approach , 2003, Journal of Politics.

[2]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[3]  Hong Jiang,et al.  VSFS: A Searchable Distributed File System , 2014, 2014 9th Parallel Data Storage Workshop.

[4]  Thu D. Nguyen,et al.  ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.

[5]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[6]  Gordon Johnston,et al.  Statistical Models and Methods for Lifetime Data , 2003, Technometrics.

[7]  Jo-Ellen Asbury,et al.  Overview of Focus Group Research , 1995 .

[8]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[9]  Ken Yocum,et al.  In-situ MapReduce for Log Processing , 2011, USENIX Annual Technical Conference.

[10]  Dan Wu,et al.  A Bloom Filter-Based Approach for Efficient Mapreduce Query Processing on Ordered Datasets , 2013, 2013 International Conference on Advanced Cloud and Big Data.

[11]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[12]  Wu-chun Feng,et al.  SLAM: scalable locality-aware middleware for I/O in scientific analysis and visualization , 2014, HPDC '14.

[13]  Funda Ergün,et al.  Online load balancing for MapReduce with skewed data input , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[14]  Cong Xu,et al.  CooMR: Cross-task coordination for efficient data management in MapReduce programs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Shouling Ji,et al.  General Graph Data De-Anonymization , 2016, ACM Trans. Inf. Syst. Secur..

[16]  Bin Cui,et al.  Exploiting Matrix Dependency for Efficient Distributed Matrix Computation , 2015, SIGMOD Conference.

[17]  Toon De Pessemier,et al.  MovieTweetings: a movie rating dataset collected from twitter , 2013, RecSys 2013.

[18]  Jun Wang,et al.  Achieving up to zero communication delay in BSP-based graph processing via vertex categorization , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[19]  Irene Finocchi,et al.  On data skewness, stragglers, and MapReduce progress indicators , 2015, SoCC.

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Hua Li,et al.  Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[22]  James C. French,et al.  Content Locality in Distributed Digital Libraries , 1999, Inf. Process. Manag..

[23]  Lixin Gao,et al.  A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets , 2013, 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing.

[24]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[25]  Garth A. Gibson,et al.  PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[26]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[27]  Prateek Mittal,et al.  On the relative de-anonymizability of graph data: Quantification and evaluation , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[28]  Carey L. Williamson,et al.  A comparative analysis of web and peer-to-peer traffic , 2008, WWW.

[29]  Jun Wang,et al.  ScalScheduling: A Scalable Scheduling Architecture for MPI-based interactive analysis programs , 2014, 2014 23rd International Conference on Computer Communication and Networks (ICCCN).

[30]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[31]  Michael J. Carey,et al.  Extending Map-Reduce for Efficient Predicate-Based Sampling , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[32]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.