Analysis and improvement of map-reduce data distribution in read mapping applications

The map-reduce paradigm has shown to be a simple and feasible way of filtering and analyzing large data sets in cloud and cluster systems. Algorithms designed for the paradigm must implement regular data distribution patterns so that appropriate use of resources is ensured. Good scalability and performance on Map-Reduce applications greatly depend on the design of regular intermediate data generation-consumption patterns at the map and reduce phases. We describe the data distribution patterns found in current Map-Reduce read mapping bioinformatics applications and show some data decomposition principles to greatly improve their scalability and performance

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1996, Inf. Process. Lett..

[3]  Konstantina Palla A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework , 2009 .

[4]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[5]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[6]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[7]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[8]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[9]  M. Schatz,et al.  Searching for SNPs with cloud computing , 2009, Genome Biology.

[10]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[11]  Surajit Chaudhuri,et al.  Proceedings of the 11th ACM Symposium on Cloud Computing , 2010 .

[12]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[13]  Suzanne J. Matthews,et al.  MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees , 2010, BMC Bioinformatics.

[14]  Robert Morris,et al.  Optimizing MapReduce for Multicore Architectures , 2010 .