Evolutionary search techniques for the Lyndon factorization of biosequences

A non-empty string x over an ordered alphabet is said to be a Lyndon word if it is alphabetically smaller than all of its cyclic rotations. Any string can be uniquely factored into Lyndon words and efficient algorithms exist to perform the factorization process in linear time and constant space. Lyndon words find wide-ranging applications including string matching and pattern inference in bioinformatics. Here we investigate the impact of permuting the alphabet ordering on the resulting factorization and demonstrate significant variations in the numbers of factors obtained. We also propose an evolutionary algorithm to find optimal orderings of the alphabet to enhance this factorization process and illustrate the impact of different operators. The flexibility of such an approach is illustrated by our use of five fitness functions which produce different factorizations suitable for different downstream tasks.

[1]  Eric Rivals,et al.  STAR: an algorithm to Search for Tandem Approximate Repeats , 2004, Bioinform..

[2]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[3]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[4]  Maxime Crochemore,et al.  Two-way string-matching , 1991, JACM.

[5]  Jean Pierre Duval,et al.  Factorizing Words over an Ordered Alphabet , 1983, J. Algorithms.

[6]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[7]  Guy Louchard,et al.  Common intervals in permutations , 2006, Discret. Math. Theor. Comput. Sci..

[8]  Jacqueline W. Daykin,et al.  Enhanced string factoring from alphabet orderings , 2019, Inf. Process. Lett..

[9]  Jacques-Olivier Lachaud,et al.  Lyndon + Christoffel = digitally convex , 2009, Pattern Recognit..

[10]  R. Lyndon,et al.  Free Differential Calculus, IV. The Quotient Groups of the Lower Central Series , 1958 .

[11]  Alfredo Milani,et al.  Algebraic Crossover Operators for Permutations , 2018, 2018 IEEE Congress on Evolutionary Computation (CEC).

[12]  Camelia Chira,et al.  Best-order crossover for permutation-based evolutionary algorithms , 2014, Applied Intelligence.

[13]  Marc Chemillier Periodic musical sequences and Lyndon words , 2004, Soft Comput..

[14]  Antonio Restivo,et al.  Suffix array and Lyndon factorization of a text , 2014, J. Discrete Algorithms.

[15]  Carol Bult,et al.  PERMUTATIONS , 1994 .

[16]  Amar Mukherjee,et al.  The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching , 2008 .

[17]  David Corne,et al.  Evolutionary Computation In Bioinformatics , 2003 .