Data-aware task scheduling for all-to-all comparison problems in heterogeneous distributed systems

Solving large-scale all-to-all comparison problems using distributed computing is increasingly significant for various applications. Previous efforts to implement distributed all-to-all comparison frameworks have treated the two phases of data distribution and comparison task scheduling separately. This leads to high storage demands as well as poor data locality for the comparison tasks, thus creating a need to redistribute the data at runtime. Furthermore, most previous methods have been developed for homogeneous computing environments, so their overall performance is degraded even further when they are used in heterogeneous distributed systems. To tackle these challenges, this paper presents a data-aware task scheduling approach for solving all-to-all comparison problems in heterogeneous distributed systems. The approach formulates the requirements for data distribution and comparison task scheduling simultaneously as a constrained optimization problem. Then, metaheuristic data pre-scheduling and dynamic task scheduling strategies are developed along with an algorithmic implementation to solve the problem. The approach provides perfect data locality for all comparison tasks, avoiding rearrangement of data at runtime. It achieves load balancing among heterogeneous computing nodes, thus enhancing the overall computation time. It also reduces data storage requirements across the network. The effectiveness of the approach is demonstrated through experimental studies.

[1]  Geoffrey C. Fox,et al.  IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1 Cloud Technologies for Bioinformatics Applications , 2022 .

[2]  Mohammad Mehdi Keikha,et al.  Improved Simulated Annealing Using Momentum Terms , 2011, 2011 Second International Conference on Intelligent Systems, Modelling and Simulation.

[3]  Andre B. Bondi,et al.  Characteristics of scalability and their impact on performance , 2000, WOSP '00.

[4]  H. Cohn,et al.  Simulated Annealing: Searching for an Optimal Temperature Schedule , 1999, SIAM J. Optim..

[5]  Qichang Chen,et al.  MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS , 2008, 2008 IEEE Fourth International Conference on eScience.

[6]  Colin J. Fidge,et al.  Distributed computing of all-to-all comparison problems in heterogeneous systems , 2015, IECON 2015 - 41st Annual Conference of the IEEE Industrial Electronics Society.

[7]  S. Tringe,et al.  Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen , 2011, Science.

[8]  Bailin Hao,et al.  PROKARYOTIC PHYLOGENY BASED ON COMPLETE GENOMES WITHOUT SEQUENCE ALIGNMENT , 2003 .

[9]  Jemal H. Abawajy,et al.  Data Replication Approach with Consistency Guarantee for Data Grid , 2014, IEEE Transactions on Computers.

[10]  Srinivas Aluru,et al.  Editorial: Scalable Systems for Big Data Management and Analytics , 2015, J. Parallel Distributed Comput..

[11]  Maya R. Gupta,et al.  Similarity-based clustering by left-stochastic matrix factorization , 2013, J. Mach. Learn. Res..

[12]  H. Gould The $q$-Stirling numbers of first and second kinds , 1961 .

[13]  Wayne Kelly,et al.  Optimizing I/O cost and managing memory for composition vector method based on correlation matrix calculation in bioinformatics , 2014 .

[14]  Colin J. Fidge,et al.  A distributed computing framework for All-to-All comparison problems , 2014, IECON 2014 - 40th Annual Conference of the IEEE Industrial Electronics Society.

[15]  Bernard P. Veldkamp,et al.  Optimizing Balanced Incomplete Block Designs for Educational Assessments , 2004 .

[16]  Wu-chun Feng,et al.  Accelerating Protein Sequence Search in a Heterogeneous Computing System , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17]  Yuan Zhou,et al.  Preemptive Hadoop Jobs Scheduling under a Deadline , 2012, 2012 Eighth International Conference on Semantics, Knowledge and Grids.

[18]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[19]  Derek G. Corneil,et al.  Algorithmic Techniques for the Generation and Analysis of Strongly Regular Graphs and other Combinatorial Configurations , 1978 .

[20]  Anupam Das,et al.  Transparent and Flexible Network Management for Big Data Processing in the Cloud , 2013, HotCloud.

[21]  Rajkumar Buyya,et al.  Big Data computing and clouds: Trends and future directions , 2013, J. Parallel Distributed Comput..

[22]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[23]  Azzedine Boukerche,et al.  Hybrid MPI/OpenMP Strategy for Biological Multiple Sequence Alignment with DIALIGN-TX in Heterogeneous Multicore Clusters , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  Kenli Li,et al.  Scheduling Precedence Constrained Stochastic Tasks on Heterogeneous Cluster Systems , 2015, IEEE Transactions on Computers.

[25]  Shripad Thite On Covering a Graph Optimally with Induced Subgraphs , 2006, ArXiv.

[26]  Patrick Th. Eugster,et al.  From the Cloud to the Atmosphere: Running MapReduce across Data Centers , 2014, IEEE Transactions on Computers.

[27]  Heinz Stockinger,et al.  Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[28]  Douglas Thain,et al.  All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.

[29]  Bruce L. Golden,et al.  Solving the traveling salesman problem with annealing-based heuristics: a computational study , 2002, IEEE Trans. Syst. Man Cybern. Part A.

[30]  Wayne Kelly,et al.  Managing memory and reducing I/O cost for correlation matrix calculation in bioinformatics , 2013, 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[31]  田中 俊典 National Center for Biotechnology Information (NCBI) , 2012 .

[32]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[33]  Yi Pan,et al.  A Study of Average-Case Speedup and Scalability of Parallel Computations on Static Networks , 1997, PDPTA.

[34]  Laurence T. Yang,et al.  Load Scheduling Strategies for Parallel DNA Sequencing Applications , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.