Job scheduling for optimizing data locality in Hadoop clusters

We describe the use of non-dedicated clusters by a known group of local applications sharing the computational resources with additional bioinformatics MapReduce applications. We have studied how to effectively use the resources shared by both application types during their execution. In order to keep local application execution times unaffected we consider the configuration of a group of parameters of the Hadoop platform. One of the most relevant aspects to consider is the job scheduling policy. Our aim is to allow that tasks from different jobs that handle the same data blocks are grouped to be run on the same node where the blocks are allocated. Experimental results show that our approach outperforms traditional policies.

[1]  Ana Ripoll,et al.  Analysis and improvement of map-reduce data distribution in read mapping applications , 2012, The Journal of Supercomputing.

[2]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[4]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[5]  Peter Druschel,et al.  Resource containers: a new facility for resource management in server systems , 1999, OSDI '99.

[6]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[7]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[9]  Jin-Soo Kim,et al.  HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[11]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[12]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[13]  Suzanne J. Matthews,et al.  MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees , 2010, BMC Bioinformatics.

[14]  L. S. S. Reddy,et al.  Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments , 2012, ArXiv.

[15]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[16]  Xian-He Sun,et al.  ADAPT: Availability-Aware MapReduce Data Placement for Non-dedicated Distributed Computing , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[17]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Wu-chun Feng,et al.  MOON: MapReduce On Opportunistic eNvironments , 2010, HPDC '10.