Solving the Longest Common Subsequence Problem Concerning Non-Uniform Distributions of Letters in Input Strings

The longest common subsequence (LCS) problem is a prominent NP–hard optimization problem where, given an arbitrary set of input strings, the aim is to find a longest subsequence, which is common to all input strings. This problem has a variety of applications in bioinformatics, molecular biology and file plagiarism checking, among others. All previous approaches from the literature are dedicated to solving LCS instances sampled from uniform or near-to-uniform probability distributions of letters in the input strings. In this paper, we introduce an approach that is able to effectively deal with more general cases, where the occurrence of letters in the input strings follows a non-uniform distribution such as a multinomial distribution. The proposed approach makes use of a time-restricted beam search, guided by a novel heuristic named Gmpsum. This heuristic combines two complementary scoring functions in the form of a convex combination. Furthermore, apart from the close-to-uniform benchmark sets from the related literature, we introduce three new benchmark sets that differ in terms of their statistical properties. One of these sets concerns a case study in the context of text analysis. We provide a comprehensive empirical evaluation in two distinctive settings: (1) short-time execution with fixed beam size in order to evaluate the guidance abilities of the compared search heuristics; and (2) long-time executions with fixed target duration times in order to obtain high-quality solutions. In both settings, the newly proposed approach performs comparably to state-of-the-art techniques in the context of close-to-uniform instances and outperforms state-of-the-art approaches for non-uniform instances.

[1]  Todd Easton,et al.  A large neighborhood search heuristic for the longest common subsequence problem , 2008, J. Heuristics.

[2]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[3]  Daxin Zhu,et al.  A Simple Algorithm for Solving for the Generalized Longest Common Subsequence (LCS) Problem with a Substring Exclusion Constraint , 2013, Algorithms.

[4]  Chang-Biau Yang,et al.  Fast Algorithms for Finding the Common Subsequence of Multiple Sequences , 2004 .

[5]  J. Kruskal An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .

[6]  Borja Calvo,et al.  Statistical Comparison of Multiple Algorithms in MultipleProblems , 2015 .

[7]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1994, SIAM J. Comput..

[8]  Sayyed Rasoul Mousavi,et al.  A hyper-heuristic for the Longest Common Subsequence problem , 2012, Comput. Biol. Chem..

[9]  Shihabur Rahman Chowdhury,et al.  Computing a Longest Common Palindromic Subsequence , 2014, Fundam. Informaticae.

[10]  Moshe Lewenstein,et al.  Constrained LCS: Hardness and Approximation , 2008, CPM.

[11]  Tranos Zuva,et al.  A comparative analysis of text similarity measures and algorithms in research paper recommender systems , 2018, 2018 Conference on Information Communications Technology and Society (ICTAS).

[12]  Zhan Peng,et al.  A Novel Efficient Graph Model for the Multiple Longest Common Subsequences (MLCS) Problem , 2017, Front. Genet..

[13]  Charles Elkan,et al.  Beam search algorithms for multilabel learning , 2013, Machine Learning.

[14]  Maxim Bakaev Impact of familiarity on information complexity in human-computer interfaces , 2016 .

[15]  Solving the Generalized Constrained Longest Common Subsequence Problem with Many Pattern Strings , 2021 .

[16]  Yueshen Xu,et al.  COVID-19 Evolves in Human Hosts , 2020 .

[17]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[18]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[19]  Manuel López-Ibáñez,et al.  Beam search for the longest common subsequence problem , 2009, Comput. Oper. Res..

[20]  Christian Blum,et al.  Probabilistic Beam Search for the Longest Common Subsequence Problem , 2007, SLS.

[21]  Qingguo Wang,et al.  A Fast Multiple Longest Common Subsequence (MLCS) Algorithm , 2011, IEEE Transactions on Knowledge and Data Engineering.

[22]  Christian Blum,et al.  Anytime algorithms for the longest common palindromic subsequence problem , 2020, Comput. Oper. Res..

[23]  T. Pohlert The Pairwise Multiple Comparison of Mean Ranks Package (PMCMR) , 2016 .

[24]  Zhensong Zhang,et al.  A novel fast and memory efficient parallel MLCS algorithm for long and large-scale sequences alignments , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[25]  Chang-Biau Yang,et al.  The Generalized Definitions of the Two-Dimensional Largest Common Substructure Problems , 2020, Algorithmica.

[26]  Yiu-ming Cheung,et al.  A branch and bound irredundant graph algorithm for large-scale MLCS problems , 2021, Pattern Recognit..

[27]  Hamid K. Aghajan,et al.  Detecting Road Intersections from GPS Traces Using Longest Common Subsequence Algorithm , 2017, ISPRS Int. J. Geo Inf..

[28]  Guangzhong Sun,et al.  A New Progressive Algorithm for a Multiple Longest Common Subsequences Problem and Its Efficient Parallelization , 2013, IEEE Transactions on Parallel and Distributed Systems.

[29]  Harry Kesten,et al.  A PROPERTY OF THE MULTINOMIAL DISTRIBUTION , 1959 .

[30]  Borja Calvo,et al.  scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems , 2016, R J..

[31]  Donald A. Adjeroh,et al.  A new algorithm for “the LCS problem” with application in compressing genome resequencing data , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[32]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1995, SIAM J. Comput..

[33]  Satya Gautam Vadlamudi,et al.  Anytime pack search , 2015, Natural Computing.

[34]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Satya Gautam Vadlamudi,et al.  Anytime Column Search , 2012, Australasian Conference on Artificial Intelligence.

[36]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[37]  C. Blum,et al.  Longest Common Subsequence Problems , 2016 .

[38]  Cameron Bruce Fraser,et al.  Subsequences and Supersequences of Strings , 1995 .

[39]  Carlos Eduardo Ferreira,et al.  Repetition-free longest common subsequence , 2010, Discret. Appl. Math..

[40]  A. Iwaniak,et al.  Common Amino Acid Subsequences in a Universal Proteome—Relevance for Food Science , 2015, International journal of molecular sciences.

[41]  Ignacio Araya,et al.  A beam search approach to the container loading problem , 2014, Comput. Oper. Res..

[42]  Günther R. Raidl,et al.  Finding Longest Common Subsequences: New anytime A∗ search results , 2020, Appl. Soft Comput..

[43]  P. Ow,et al.  Filtered beam search in scheduling , 1988 .

[44]  Shyong Jian Shyu,et al.  Finding the longest common subsequence for multiple biological sequences by ant colony optimization , 2009, Comput. Oper. Res..

[45]  Günther R. Raidl,et al.  A Beam Search for the Longest Common Subsequence Problem Guided by a Novel Approximate Expected Length Calculation , 2019, LOD.

[46]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[47]  Sayyed Rasoul Mousavi,et al.  An improved algorithm for the longest common subsequence problem , 2012, Comput. Oper. Res..

[48]  Shiwei Wei,et al.  A path recorder algorithm for Multiple Longest Common Subsequences (MLCS) problems , 2020, Bioinform..