Using deep reinforcement learning approach for solving the multiple sequence alignment problem

In the present paper, we use a deep reinforcement learning (DRL) approach for solving the multiple sequence alignment problem which is an NP-complete problem. Multiple Sequence Alignment problem simply refers to the process of arranging initial sequences of DNA, RNA or proteins in order to maximize their regions of similarity. Multiple Sequence Alignment is the first step in solving many bioinformatics problems such as constructing phylogenetic trees. In this study, our proposed approach models the Multiple Sequence Alignment problem as a DRL problem and utilizes long short-term memory networks for estimation phase in the reinforcement learning algorithm. Furthermore, the actor-critic algorithm with experience-replay method is used for much quicker convergence process. Using deep Q-learning (an RL approach) and Q-network overcomes the complexity of other approaches. The experimental evaluation is performed on 8 different real-life datasets and in every used dataset our approach outperforms other well-known approaches and tools such as MAFFT, ClustalW, and other heuristic approaches in case of scoring in solving the MSA problem.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[3]  Jiaohua Qin,et al.  Ant Colony with Genetic Algorithm Based on Planar Graph for Multiple Sequence Alignment , 2010 .

[4]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[5]  Miguel A. Vega-Rodríguez,et al.  Hybrid multiobjective artificial bee colony for multiple sequence alignment , 2016, Appl. Soft Comput..

[6]  Mehran Yazdi,et al.  Robust cascaded skin detector based on AdaBoost , 2018, Multimedia Tools and Applications.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[9]  Mehran Yazdi,et al.  Robust skin detector based on AdaBoost and statistical luminance features , 2015, 2015 International Congress on Technology, Communication and Knowledge (ICTCK).

[10]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Thomas Kiel Rasmussen,et al.  Improved Hidden Markov Model training for multiple sequence alignment by a particle swarm optimization-evolutionary algorithm hybrid. , 2003, Bio Systems.

[12]  Steffen Eger Sequence alignment with arbitrary steps and further generalizations, with applications to alignments in linguistics , 2013, Inf. Sci..

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[15]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[16]  Harish Sharma,et al.  An Efficient Bi-Level Discrete PSO Variant for Multiple Sequence Alignment , 2019 .

[17]  Hyrum Carroll,et al.  DNA reference alignment benchmarks based on tertiary structure of encoded proteins , 2007, Bioinform..

[18]  Ioan-Gabriel Mircea,et al.  A Reinforcement Learning Based Approach to Multiple Sequence Alignment , 2016, SOFA.

[19]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[20]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[21]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[22]  Kazutaka Katoh,et al.  Adding unaligned sequences into an existing alignment using MAFFT and LAST , 2012, Bioinform..

[23]  Sara Nasser,et al.  Multiple Sequence Alignment using Fuzzy Logic , 2007, 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[24]  Shyi-Ming Chen,et al.  Multiple DNA sequence alignment based on genetic simulated annealing techniques , 2007 .

[25]  Olivier Poch,et al.  AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis , 2010, Nucleic acids research.

[26]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[27]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2004, Nucleic acids research.

[28]  Rahul Chauhan,et al.  Alignment of Multiple Sequences using GA method , 2013 .

[29]  Andreas Wilm,et al.  An enhanced RNA alignment benchmark for sequence alignment programs , 2006, Algorithms for Molecular Biology.

[30]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[31]  Shi-Jay Chen,et al.  Multiple DNA Sequence Alignment Based on Genetic Algorithms and Divide-and-Conquer Techniques , 2005 .

[32]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[33]  Yi Pan,et al.  Partitioned optimization algorithms for multiple sequence alignment , 2006, 20th International Conference on Advanced Information Networking and Applications - Volume 1 (AINA'06).

[34]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .