A Graph-Theoretic Barcode Ordering Model for Linked-Reads

Considering a set of intervals on the real line, an interval graph records these intervals as nodes and their intersections as edges. Identifying (i.e. merging) pairs of nodes in an interval graph results in a multiple-interval graph. Given only the nodes and the edges of the multiple-interval graph without knowing the underlying intervals, we are interested in the following questions. Can one determine how many intervals correspond to each node? Can one compute a walk over the multiple-interval graph nodes that reflects the ordering of the original intervals? These questions are closely related to linked-read DNA sequencing, where barcodes are assigned to long molecules whose intersection graph forms an interval graph. Each barcode may correspond to multiple molecules, which complicates downstream analysis, and corresponds to the identification of nodes of the corresponding interval graph. Resolving the above graph-theoretic problems would facilitate analyses of linked-reads sequencing data, through enabling the conceptual separation of barcodes into molecules and providing, through the molecules order, a skeleton for accurately assembling the genome. Here, we propose a framework that takes as input an arbitrary intersection graph (such as an overlap graph of barcodes) and constructs a heuristic approximation of the ordering of the original intervals.

[1]  Michael C. Schatz,et al.  LRSim: A Linked-Reads Simulator Generating Insights for Better Genome Partitioning , 2017, Computational and structural biotechnology journal.

[2]  Iman Hajirasouliha,et al.  Minerva: an alignment- and reference-free approach to deconvolve Linked-Reads for metagenomics. , 2019, Genome research.

[3]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[4]  Jay Shendure,et al.  Haplotype phasing of whole human genomes using bead-based barcode partitioning in a single tube , 2017, Nature Biotechnology.

[5]  R. Möhring Algorithmic graph theory and perfect graphs , 1986 .

[6]  Minghui Jiang,et al.  Recognizing d-Interval Graphs and d-Track Interval Graphs , 2010, Algorithmica.

[7]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[8]  Pinar Heggernes,et al.  Interval Completion Is Fixed Parameter Tractable , 2008, SIAM J. Comput..

[9]  Hanlee P. Ji,et al.  Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases , 2017, Genome Medicine.

[10]  M. Golumbic Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57) , 2004 .

[11]  Ron Shamir,et al.  Realizing Interval Graphs with Size and Distance Constraints , 1997, SIAM J. Discret. Math..

[12]  Michael R. Fellows,et al.  On the parameterized complexity of multiple-interval graph problems , 2009, Theor. Comput. Sci..

[13]  Pascal Ochem,et al.  The Maximum Clique Problem in Multiple Interval Graphs (Extended Abstract) , 2012, WG.

[14]  Pascal Ochem,et al.  The Maximum Clique Problem in Multiple Interval Graphs , 2011, Algorithmica.

[15]  Petr A. Golovach,et al.  A survey of parameterized algorithms and the complexity of edge modification , 2020, Comput. Sci. Rev..

[16]  David Coudert,et al.  A note on Integer Linear Programming formulations for linear ordering problems on graphs , 2016 .

[17]  Pierre Marijon,et al.  yacrd and fpa: upstream tools for long-read genome assembly , 2019, bioRxiv.

[18]  David B. Shmoys,et al.  Recognizing graphs with fixed interval number is NP-complete , 1984, Discret. Appl. Math..

[19]  Moshe Lewenstein,et al.  Optimization problems in multiple-interval graphs , 2007, SODA '07.

[20]  Justin Chu,et al.  ARCS: scaffolding genome drafts with linked reads , 2017, Bioinform..

[21]  Osamu Watanabe,et al.  Interval graph representation with given interval and intersection lengths , 2012, J. Discrete Algorithms.

[22]  Serafim Batzoglou,et al.  High-quality genome sequences of uncultured microbes by assembly of read clouds , 2018, Nature Biotechnology.

[23]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[24]  Ross M. McConnell Linear-Time Recognition of Circular-Arc Graphs , 2003, Algorithmica.

[25]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[26]  Reuven Bar-Yehuda,et al.  Scheduling split intervals , 2002, SODA '02.

[27]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[28]  Ross McConnell Linear-Time Recognition of Circular-Arc Graphs ; CU-CS-914-01 , 2001 .

[29]  Yong Wang,et al.  Ultra-low input single tube linked-read library method enables short-read NGS systems to generate highly accurate and economical long-range sequencing information for de novo genome assembly and haplotype phasing , 2019, bioRxiv.

[30]  Akira Tanaka,et al.  The worst-case time complexity for generating all maximal cliques and computational experiments , 2006, Theor. Comput. Sci..

[31]  Michal Pilipczuk,et al.  Subexponential Parameterized Algorithm for Interval Completion , 2016, SODA.

[32]  Sven Rahmann,et al.  Genome analysis , 2022 .

[33]  Kellogg S. Booth,et al.  Testing for the Consecutive Ones Property, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms , 1976, J. Comput. Syst. Sci..

[34]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.