Euro-Par 2013: Parallel Processing Workshops

As the explosion of data sizes continues to push the limits of our abilities to efficiently store and process big data, next generation big data systems face multiple challenges. One such important challenge relates to the limited scalability of I/O, a determining factor in the overall performance of big data applications. Although paradigms like MapReduce have long been used to take advantage of local disks and avoid data movements over the network as much as possible, with increasing core count per node, local storage comes under increasing I/O pressure itself and prompts the need to equip nodes with multiple disks. However, given the rising need to virtualize large datacenters in order to provide a more flexible allocation and consolidation of physical resources (transforming them into public or private/hybrid clouds), the following questions arise: is it possible to take advantage of multiple local disks at virtual machine (VM) level in order to speed up big data analytics? If so, what are the best practices to achieve a high virtualized aggregated I/O throughput? This paper aims to answer these questions in the context of I/O intensive MapReduce workloads: it analyzes and characterizes their behavior under different virtualization scenarios in order to propose best practices for current approaches and speculate on future areas of improvement.

[1]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[2]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[3]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[4]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Inderpal Singh Mumick,et al.  Deriving Production Rules For Incremental View Maintenance , 1999 .

[7]  Natalia Khuri,et al.  Population level functional diversity in a microbial community revealed by comparative genomic and metagenomic analyses , 2007, The ISME Journal.

[8]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[9]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[10]  Lenin Ravindranath,et al.  Nectar: Automatic Management of Data and Computation in Datacenters , 2010, OSDI.

[11]  Frank Dabek,et al.  Large-scale Incremental Processing Using Distributed Transactions and Notifications , 2010, OSDI.

[12]  Yanhong A. Liu,et al.  Static caching for incremental computation , 1998, TOPL.

[13]  Christopher Olston,et al.  Stateful bulk processing for incremental analytics , 2010, SoCC '10.

[14]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Michael Isard,et al.  DryadInc: Reusing Work in Large-scale Computations , 2009, HotCloud.

[16]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[17]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[18]  S. Kravitz,et al.  The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data , 2010, Standards in genomic sciences.

[19]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[20]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[21]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[22]  Rickard Sandberg,et al.  Improved precision and accuracy for microarrays using updated probe set definitions , 2007, BMC Bioinformatics.

[23]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[24]  Andreas Wilke,et al.  Using clouds for metagenomics: A case study , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[25]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[26]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[27]  Ian T. Foster,et al.  Efficient Incremental Maintenance of Derived Relations and BLAST Computations in Bioinformatics Data Warehouses , 2008, DaWaK.