A distributed computing framework for All-to-All comparison problems

Distributed computation and storage have been widely used for processing of big data sets. For many big data problems, with the size of data growing rapidly, the distribution of computing tasks and related data can affect the performance of the computing system greatly. In this paper, a distributed computing framework is presented for high performance computing of All-to-All Comparison Problems. A data distribution strategy is embedded in the framework for reduced storage space and balanced computing load. Experiments are conducted to demonstrate the effectiveness of the developed approach. They have shown that about 88% of the ideal performance capacity can be achieved in multiple machines through using the approach presented in this paper.

[1]  Jon Hill,et al.  SPRINT: A new parallel framework for R , 2008, BMC Bioinformatics.

[2]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[3]  田中 俊典 National Center for Biotechnology Information (NCBI) , 2012 .

[4]  H.S. Lopes,et al.  A distributed approach for a multiple sequence alignment algorithm using a parallel virtual machine , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[5]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[6]  Yi Pan,et al.  A Study of Average-Case Speedup and Scalability of Parallel Computations on Static Networks , 1997, PDPTA.

[7]  Bailin Hao,et al.  PROKARYOTIC PHYLOGENY BASED ON COMPLETE GENOMES WITHOUT SEQUENCE ALIGNMENT , 2003 .

[8]  Fatos Xhafa,et al.  JXTA-Overlay: A P2P Platform for Distributed, Collaborative, and Ubiquitous Computing , 2011, IEEE Transactions on Industrial Electronics.

[9]  Azzedine Boukerche,et al.  Hybrid MPI/OpenMP Strategy for Biological Multiple Sequence Alignment with DIALIGN-TX in Heterogeneous Multicore Clusters , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[10]  Tamer Elsayed,et al.  iHadoop: Asynchronous Iterations for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[11]  Miguel A. Andrade-Navarro,et al.  Computational space reduction and parallelization of a new clustering approach for large groups of sequences , 1998, Bioinform..

[12]  Hongming Cai,et al.  An IoT-Oriented Data Storage Framework in Cloud Computing Platform , 2014, IEEE Transactions on Industrial Informatics.

[13]  Douglas Thain,et al.  All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.

[14]  Wayne Kelly,et al.  Optimizing I/O cost and managing memory for composition vector method based on correlation matrix calculation in bioinformatics , 2014 .

[15]  Patrick J. Flynn,et al.  Overview of the face recognition grand challenge , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Lynette Hirschman,et al.  Nephele: genotyping via complete composition vectors and MapReduce , 2011, Source Code for Biology and Medicine.

[17]  Xiandong Meng,et al.  A High-Performance Heterogeneous Computing Platform for Biological Sequence Analysis , 2010, IEEE Transactions on Parallel and Distributed Systems.

[18]  Wu-chun Feng,et al.  Accelerating Protein Sequence Search in a Heterogeneous Computing System , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[19]  H. Gould The $q$-Stirling numbers of first and second kinds , 1961 .

[20]  Laurence T. Yang,et al.  Load Scheduling Strategies for Parallel DNA Sequencing Applications , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[21]  Mark D. Hill,et al.  What is scalability? , 1990, CARN.

[22]  Wayne Kelly,et al.  Managing memory and reducing I/O cost for correlation matrix calculation in bioinformatics , 2013, 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).