A hybrid MPI/OpenMP parallel implementation of NSGA-II for finding patterns in protein sequences

Since the late 1970s, when the first DNA-based genome was sequenced, the field of biology is experiencing a significant growth in the amount of data that needs to be processed. Long ago it became impractical to analyze all this information manually, resulting in a great need for new techniques, algorithms and strategies to facilitate this work. Within the vast world of bioinformatics, we will focus on proteomics, more specifically, on the discovery of small repeated common patterns on sets of protein sequences that may represent some biological functionality. When we analyze a large number of sequences, the problem shows non-deterministic polynomial times, it implies that we could benefit from the combination of high-performance computing and computational intelligence techniques. In this paper, we address the discovery of repeated common patterns as a multiobjective optimization problem by means of a hybrid MPI/OpenMP approach which parallelizes a well-known multiobjective metaheuristic, the fast non-dominated sorting genetic algorithm (NSGA-II). Our main objective is to combine the benefits of shared-memory and distributed-memory programming paradigms to discover patterns in an accurate and efficient manner. Experiments conducted on six different datasets, comparisons with other well-known biological tools, and the obtained speed-ups and efficiencies show that our approach is able to achieve a significant performance in terms of parallel and biological results.

[1]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[2]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[3]  Yuehui Chen,et al.  Bacterial Foraging Optimization Algorithm Integrating Tabu Search for Motif Discovery , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[4]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[5]  Mikhail S. Gelfand,et al.  A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length , 2005, Bioinform..

[6]  Rong-Ming Chen,et al.  FMGA: finding motifs by genetic algorithm , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[7]  Dipankar Dasgupta,et al.  Motif discovery in upstream sequences of coordinately expressed genes , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[8]  Zhi Wei,et al.  GAME: detecting cis-regulatory elements using a genetic algorithm , 2006, Bioinform..

[9]  Anthony Skjellum,et al.  Using MPI: portable parallel programming with the message-passing interface, 2nd Edition , 1999, Scientific and engineering computation series.

[10]  Gang Li,et al.  Discovering multiple realistic TFBS motifs based on a generalized model , 2009, BMC Bioinformatics.

[11]  William Noble Grundy,et al.  ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool , 1996, Comput. Appl. Biosci..

[12]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[13]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[14]  Weiguo Liu,et al.  GPU-MEME: Using Graphics Hardware to Accelerate Motif Finding in DNA Sequences , 2008, PRIB.

[15]  Khaled Rasheed,et al.  MDGA: motif discovery using a genetic algorithm , 2005, GECCO '05.

[16]  G. Stormo,et al.  ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[17]  G. Fogel,et al.  Discovery of sequence motifs related to coexpression of genes using evolutionary computation. , 2004, Nucleic acids research.

[18]  Mary Qu Yang,et al.  Genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research , 2008, BMC Genomics.

[19]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[20]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[21]  Lee Aaron Newberg,et al.  The Gibbs Centroid Sampler , 2007, Nucleic Acids Res..

[22]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[23]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.

[24]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[27]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[28]  P. Bork,et al.  Protein sequence motifs. , 1996, Current opinion in structural biology.

[29]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[30]  Yongchao Liu,et al.  An Ultrafast Scalable Many-Core Motif Discovery Algorithm for Multiple GPUs , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[31]  Barbara Chapman,et al.  Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) , 2007 .

[32]  Lothar Thiele,et al.  Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach , 1999, IEEE Trans. Evol. Comput..

[33]  Yuehui Chen,et al.  Motif Discovery Using Evolutionary Algorithms , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[34]  Jun Qin,et al.  Parallel Motif Search using ParSeq , 2005, Parallel and Distributed Computing and Networks.

[35]  N. Anderson,et al.  Proteome and proteomics: New technologies, new concepts, and new words , 1998, Electrophoresis.

[36]  Yongchao Liu,et al.  CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units , 2010, Pattern Recognit. Lett..

[37]  Qingzhong Liu,et al.  High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics , 2009, BMC Genomics.

[38]  Jan Schröder,et al.  Massively Parallelized DNA Motif Search on the Reconfigurable Hardware Platform COPACOBANA , 2008, PRIB.

[39]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[40]  Yun Xu,et al.  A Parallel Gibbs Sampling Algorithm for Motif Finding on GPU , 2009, 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[41]  Youping Deng,et al.  Promoting synergistic research and education in genomics and bioinformatics , 2008, BMC Genomics.

[42]  Andrew M. Tyrrell,et al.  Regulatory Motif Discovery Using a Population Clustering Evolutionary Algorithm , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[44]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[45]  Finn Drabløs,et al.  Accelerating Motif Discovery: Motif Matching on Parallel Hardware , 2006, WABI.

[46]  Mireille Régnier,et al.  Rare Events and Conditional Events on Random Strings , 2004, Discret. Math. Theor. Comput. Sci..

[47]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems , 2002, Genetic Algorithms and Evolutionary Computation.

[48]  Kwong-Sak Leung,et al.  TFBS identification based on genetic algorithm with combined representations and adaptive post-processing , 2008, Bioinform..

[49]  P. James,et al.  Protein identification in the post-genome era: the rapid rise of proteomics , 1997, Quarterly Reviews of Biophysics.

[50]  Barbara M. Chapman,et al.  Performance modeling of communication and computation in hybrid MPI and OpenMP applications , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[51]  Yasuma Mori,et al.  Design and Implementation of Parallel Modified PrefixSpan Method , 2003, ISHPC.

[52]  Graziano Pesole,et al.  MoD Tools: regulatory motif discovery in nucleotide sequences from co-regulated or homologous genes , 2006, Nucleic Acids Res..

[53]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[54]  Gary B. Fogel,et al.  Evolutionary computation for discovery of composite transcription factor binding sites , 2008, Nucleic acids research.