论文信息 - Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets

In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.

[1] Andrew B. Whitford. Bayesian Methods: A Social and Behavioral Sciences Approach , 2003, Journal of Politics.

[2] M. Balazinska,et al. A Study of Skew in MapReduce Applications , 2011 .

[3] Hong Jiang,et al. VSFS: A Searchable Distributed File System , 2014, 2014 9th Parallel Data Storage Workshop.

[4] Thu D. Nguyen,et al. ApproxHadoop: Bringing Approximations to MapReduce Frameworks , 2015, ASPLOS.

[5] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[6] Gordon Johnston,et al. Statistical Models and Methods for Lifetime Data , 2003, Technometrics.

[7] Jo-Ellen Asbury,et al. Overview of Focus Group Research , 1995 .

[8] Tom Fawcett,et al. Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[9] Ken Yocum,et al. In-situ MapReduce for Log Processing , 2011, USENIX Annual Technical Conference.

[10] Dan Wu,et al. A Bloom Filter-Based Approach for Efficient Mapreduce Query Processing on Ordered Datasets , 2013, 2013 International Conference on Advanced Cloud and Big Data.

[11] Magdalena Balazinska,et al. SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[12] Wu-chun Feng,et al. SLAM: scalable locality-aware middleware for I/O in scientific analysis and visualization , 2014, HPDC '14.

[13] Funda Ergün,et al. Online load balancing for MapReduce with skewed data input , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[14] Cong Xu,et al. CooMR: Cross-task coordination for efficient data management in MapReduce programs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15] Shouling Ji,et al. General Graph Data De-Anonymization , 2016, ACM Trans. Inf. Syst. Secur..

[16] Bin Cui,et al. Exploiting Matrix Dependency for Efficient Distributed Matrix Computation , 2015, SIGMOD Conference.

[17] Toon De Pessemier,et al. MovieTweetings: a movie rating dataset collected from twitter , 2013, RecSys 2013.

[18] Jun Wang,et al. Achieving up to zero communication delay in BSP-based graph processing via vertex categorization , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[19] Irene Finocchi,et al. On data skewness, stragglers, and MapReduce progress indicators , 2015, SoCC.

[20] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21] Hua Li,et al. Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[22] James C. French,et al. Content Locality in Distributed Digital Libraries , 1999, Inf. Process. Manag..

[23] Lixin Gao,et al. A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets , 2013, 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing.

[24] Zhen Xiao,et al. LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[25] Garth A. Gibson,et al. PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[26] Michael J. A. Berry,et al. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[27] Prateek Mittal,et al. On the relative de-anonymizability of graph data: Quantification and evaluation , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[28] Carey L. Williamson,et al. A comparative analysis of web and peer-to-peer traffic , 2008, WWW.

[29] Jun Wang,et al. ScalScheduling: A Scalable Scheduling Architecture for MPI-based interactive analysis programs , 2014, 2014 23rd International Conference on Computer Communication and Networks (ICCCN).

[30] Thomas H. Cormen,et al. Introduction to algorithms [2nd ed.] , 2001 .

[31] Michael J. Carey,et al. Extending Map-Reduce for Efficient Predicate-Based Sampling , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[32] Cory Hill,et al. f4: Facebook's Warm BLOB Storage System , 2014, OSDI.