Federated MapReduce to Transparently Run Applications on Multicluster Environment

In the Cloud era, data is generated everywhere, how to efficiently analyze those "Big Data" that have properties such as large volume, fast generation, and variety, are most critical issues. MapReduce is a simplified distributed parallel data processing model. It has been widely applied in many areas such as web indexing, clustering and classification. However, when it confronted the sensitive data, such as network log or mails, which are distributed among independent organizations, these data must keep privacy and cannot be aggregated for centralized analyzing. We propose Federated MapReduce (Fed-MR), a framework aimed at analyzing geometrically distributed data among independent organizations while avoiding data movement. In contrast to previous works, Fed-MR retains the simplicity of MapReduce programming eto provide a transparent way to run original MapReduce jobs across multiple clusters without any extra programming burden. Fed-MR also integrates multiple clusters in different locations to form hierarchical Top-Region relationships. Experiments, compared to a single cluster with the same number of worker nodes, had shown that the computation time was only increased by an average of 30% in WordCount and 10% in Grep. Therefore, Fed-MR has reasonable overheads in performance for analyzing data across Internet-connected clusters while no additional Global Reduce function was required as in traditional hierarchical MapReduce frameworks.

[1]  Judy Qiu,et al.  A hierarchical framework for cross-domain MapReduce execution , 2011, ECMLS '11.

[2]  Chen He,et al.  HOG: Distributed Hadoop MapReduce on the Grid , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[3]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.

[4]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[5]  I. Tomasic,et al.  Using Hadoop MapReduce in a multicluster environment , 2013, 2013 36th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[6]  Wu-chun Feng,et al.  MOON: MapReduce On Opportunistic eNvironments , 2010, HPDC '10.

[7]  Chita R. Das,et al.  HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[8]  Nikos Parlavantzas,et al.  Resilin: Elastic MapReduce over Multiple Clouds , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.