Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-linear Chaining Extended

Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics. Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs. Here we extend the approach so that one of the inputs can be a directed acyclic graph (DAGs), e.g. a splicing graph in transcriptomics or a variant graph in pan-genomics.

[1]  Mohamed Ibrahim Abouelhoda,et al.  A Chaining Algorithm for Mapping cDNA Sequences to Multiple Genomic Sequences , 2007, SPIRE.

[2]  D. R. Fulkerson Note on Dilworth’s decomposition theorem for partially ordered sets , 1956 .

[3]  Eugene W. Myers,et al.  Chaining multiple-alignment fragments in sub-quadratic time , 1995, SODA '95.

[4]  Juha Kärkkäinen,et al.  Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform , 2013, ESA.

[5]  Veli Mäkinen,et al.  Evaluating approaches to find exon chains based on long reads , 2016, bioRxiv.

[6]  W. Marsden I and J , 2012 .

[7]  Jouni Sirén,et al.  Indexing Variation Graphs , 2016, ALENEX.

[8]  Claus-Peter Schnorr,et al.  An Algorithm for Transitive Closure with Linear Expected Time , 1978, SIAM J. Comput..

[9]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[10]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[11]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[12]  Alexandru I. Tomescu,et al.  Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Genomics , 2015 .

[14]  Pierre Peterlongo,et al.  Read mapping on de Bruijn graphs , 2015, BMC Bioinformatics.

[15]  Yangjun Chen,et al.  An Efficient Algorithm for Answering Graph Reachability Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Bernard De Baets,et al.  Fast and Accurate cDNA Mapping and Splice Site Identification , 2014, BIOINFORMATICS.

[17]  Benedict Paten,et al.  A graph extension of the positional Burrows–Wheeler transform and its applications , 2017, Algorithms for Molecular Biology.

[18]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[19]  Gonzalo Navarro Improved approximate pattern matching on hypertext , 2000, Theor. Comput. Sci..

[20]  Alexandru I. Tomescu,et al.  On the complexity of Minimum Path Cover with Subpath Constraints for multi-assembly , 2014, BMC Bioinformatics.

[21]  Meng He,et al.  Indexing Compressed Text , 2003 .

[22]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[23]  Michael L. Fredman,et al.  On computing the length of longest increasing subsequences , 1975, Discret. Math..

[24]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[25]  Moshe Lewenstein,et al.  Pattern Matching in Hypertext , 1997, J. Algorithms.

[26]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[28]  Stefan Felsner,et al.  Recognition Algorithms for Orders of Small Width and Graphs of Small Dilworth Number , 2003, Order.

[29]  Tetsuo Shibuya,et al.  Match Chaining Algorithms for cDNA Mapping , 2003, WABI.

[30]  Bernard De Baets,et al.  A Long Fragment Aligner called ALFALFA , 2015, BMC Bioinformatics.

[31]  Djamal Belazzougui,et al.  Linear time construction of compressed text indices in compact space , 2014, STOC.

[32]  Dong Kyue Kim,et al.  String Matching in Hypertext , 1995, CPM.

[33]  Richard M. Karp,et al.  A n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs , 1971, SWAT.

[34]  James B. Orlin,et al.  Max flows in O(nm) time, or better , 2013, STOC '13.

[35]  Yangjun Chen,et al.  On the Graph Decomposition , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[36]  Pierre Peterlongo,et al.  Read Mapping on de Bruijn graph , 2015, ArXiv.

[37]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[38]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[39]  Robert E. Tarjan,et al.  Scaling and related techniques for geometry problems , 1984, STOC '84.

[40]  Yang Xiang,et al.  Path-tree: An efficient reachability indexing scheme for large directed graphs , 2011, TODS.

[41]  David Haussler,et al.  A Flow Procedure for the Linearization of Genome Sequence Graphs , 2017, bioRxiv.

[42]  H. V. Jagadish,et al.  A compression technique to materialize transitive closure , 1990, TODS.

[43]  Veli Mäkinen,et al.  Normalized N50 assembly metric using gap-restricted co-linear chaining , 2022 .

[44]  Eric Rivals,et al.  YOC, A new strategy for pairwise alignment of collinear genomes , 2015, BMC Bioinformatics.

[45]  Ulf Leser,et al.  RRCA: Ultra-Fast Multiple In-species Genome Alignments , 2014, AlCoB.

[46]  S. Louis Hakimi,et al.  On Path Cover Problems in Digraphs and Applications to Program Testing , 1979, IEEE Transactions on Software Engineering.