论文信息 - Using Hadoop MapReduce in a multicluster environment

Using Hadoop MapReduce in a multicluster environment

Hadoop MapReduce has become one of the most popular tools for data processing. Hadoop is normally installed on a cluster of computers. When the cluster becomes undersized, it can be scaled by adding new computers and storage devices, but it can also be extended by real or virtual resources from another computer cluster. We present a utilization of the MapReduce paradigm on a Hadoop installation extended across two clusters connected over the Internet. We measured execution times of Map and Reduce tasks in a multicluster environment, and compared them to the corresponding times obtained while only computers from a single cluster are used. The results show that there might be a decrease in MapReduce performance depending on: the concrete data analyses application, the ratio of the number of local and remote computers, and connection bandwidth to remote computers. Additionally, the investigation suggests an upgrade to the Apache Hadoop MapReduce, making it more adjusted to the multicluster environment.

I. Tomasic | A. Rashkovska | M. Depolli

[1] Roman Trobec,et al. Computer Simulation of Topical Knee Cooling , 2008, Parallel and Distributed Computing and Networks.

[2] Rajiv Ranjan,et al. G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[3] Komal Shringare,et al. Apache Hadoop Goes Realtime at Facebook , 2015 .

[4] Bill Franks. What Is Big Data and Why Does It Matter , 2012 .

[5] Bill Franks,et al. Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics , 2012 .

[6] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7] Roman Trobec,et al. Multicluster Hadoop Distributed File System , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[8] B. Šarler,et al. Solution of a low Prandtl number natural convection benchmark by a local meshless method , 2013 .

[9] G. Kosec,et al. Modelling of slope processes on karst , 2011 .

[10] Eija Korpelainen,et al. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud , 2012, Bioinform..