20th International Workshop on Algorithms in Bioinformatics, WABI 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference)

We define a new problem in comparative genomics, denoted PQ-Tree Search, that takes as input a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function h, integer parameters dT and dS , and a new genome S. The objective is to identify in S approximate new instances of the gene cluster that could vary from the known gene orders by genome rearrangements that are constrained by T , by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by dT and dS , respectively. We prove that the PQ-Tree Search problem is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-Tree Search in O∗(2γ) time, where γ is the maximum degree of a node in T and O∗ is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-tree. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters. 2012 ACM Subject Classification Applied computing → Bioinformatics

[1]  Rod A Wing,et al.  Assembly and Validation of the Genome of the Nonmodel Basal Angiosperm Amborella , 2013, Science.

[2]  David C. Schwartz,et al.  Genomics via Optical Mapping III: Contiging Genomic DNA , 1998, ISMB.

[3]  Christina Boucher,et al.  Error correcting optical mapping data , 2018, bioRxiv.

[4]  Yi Yang,et al.  Alignment of Optical Maps , 2005, RECOMB.

[5]  Miron Livny,et al.  Validation of rice genome sequence by optical mapping , 2007, BMC Genomics.

[6]  E. Dimalanta,et al.  A Whole-Genome Shotgun Optical Map of Yersinia pestis Strain KIM , 2002, Applied and Environmental Microbiology.

[7]  Sharma V. Thankachan,et al.  On the Hardness and Inapproximability of Recognizing Wheeler Graphs , 2019, ESA.

[8]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[9]  Jessica Severin,et al.  Shotgun optical mapping of the entire Leishmania major Friedlin genome. , 2004, Molecular and biochemical parasitology.

[10]  Nic Herndon,et al.  Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool , 2015, bioRxiv.

[11]  Reuven Bar-Yehuda,et al.  Scheduling split intervals , 2002, SODA '02.

[12]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[13]  Juha Kärkkäinen,et al.  Linear-time String Indexing and Analysis in Small Space , 2016, ACM Trans. Algorithms.

[14]  Pinar Heggernes,et al.  Interval Completion Is Fixed Parameter Tractable , 2008, SIAM J. Comput..

[15]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[16]  Sergey Koren,et al.  HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads , 2020, bioRxiv.

[17]  David C. Schwartz,et al.  A Single Molecule Scaffold for the Maize Genome , 2009, PLoS genetics.

[18]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[19]  Ryan R. Wick,et al.  Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads , 2016, bioRxiv.

[20]  Robert E. Tarjan,et al.  Scaling and related techniques for geometry problems , 1984, STOC '84.

[21]  Osamu Watanabe,et al.  Interval graph representation with given interval and intersection lengths , 2012, J. Discrete Algorithms.

[22]  Michael Liem,et al.  Rapid de novo assembly of the European eel genome from nanopore sequencing reads , 2017, Scientific Reports.

[23]  Nan Li,et al.  Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. , 2012, Briefings in functional genomics.

[24]  Peter A. Meric,et al.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse , 2009, PLoS biology.

[25]  Ross McConnell Linear-Time Recognition of Circular-Arc Graphs ; CU-CS-914-01 , 2001 .

[26]  Ming Xiao,et al.  Towards a More Accurate Error Model for BioNano Optical Maps , 2016, ISBRA.

[27]  Michael C. Schatz,et al.  LRSim: A Linked-Reads Simulator Generating Insights for Better Genome Partitioning , 2017, Computational and structural biotechnology journal.

[28]  George B. Mertzios A matrix characterization of interval and proper interval graphs , 2008, Appl. Math. Lett..

[29]  Veli Mäkinen,et al.  Indexing Graphs for Path Queries with Applications in Genome Research , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Hermann Schichl,et al.  An Exact Method for the Minimum Feedback Arc Set Problem , 2021, ACM J. Exp. Algorithmics.

[31]  Veli Mäkinen,et al.  Linear time minimum segmentation enables scalable founder reconstruction , 2019, Algorithms for Molecular Biology.

[32]  Kellogg S. Booth,et al.  Testing for the Consecutive Ones Property, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms , 1976, J. Comput. Syst. Sci..

[33]  K. Kupkova,et al.  Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics , 2016, Computational and structural biotechnology journal.

[34]  Stephane Rombauts,et al.  OMSim: a simulator for optical map data , 2017, Bioinform..

[35]  Richard M. Leggett,et al.  Alvis: a tool for contig and read ALignment VISualisation and chimera detection , 2019, BMC Bioinformatics.

[36]  Michael Hiller,et al.  The axolotl genome and the evolution of key tissue formation regulators , 2018, Nature.

[37]  Pierre Marijon,et al.  yacrd and fpa: upstream tools for long-read genome assembly , 2019, bioRxiv.

[38]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[39]  Veli Mäkinen,et al.  Linear Time Maximum Segmentation Problems in Column Stream Model , 2019, SPIRE.

[40]  David B. Shmoys,et al.  Recognizing graphs with fixed interval number is NP-complete , 1984, Discret. Appl. Math..

[41]  Minghui Jiang,et al.  Recognizing d-Interval Graphs and d-Track Interval Graphs , 2010, Algorithmica.

[42]  Veli Mäkinen,et al.  A framework for space-efficient read clustering in metagenomic samples , 2017, BMC Bioinformatics.

[43]  Marco Previtali,et al.  Bidirectional Variable-Order de Bruijn Graphs , 2016, LATIN.

[44]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[45]  Heng Li,et al.  Fast and accurate long-read assembly with wtdbg2 , 2019, Nature Methods.

[46]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[47]  David C. Schwartz,et al.  Whole-Genome Shotgun Optical Mapping of Rhodospirillum rubrum , 2004, Applied and Environmental Microbiology.

[48]  David C. Schwartz,et al.  High-resolution human genome structure by single-molecule analysis , 2010, Proceedings of the National Academy of Sciences.

[49]  Christina Boucher,et al.  Misassembly detection using paired-end sequence reads and optical mapping data , 2014, Bioinform..

[50]  C. R. Subramanian,et al.  Induced Acyclic Tournaments in Random Digraphs: Sharp Concentration, Thresholds and Algorithms , 2014, Discuss. Math. Graph Theory.

[51]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Solon P. Pissis,et al.  Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication , 2019, ICALP.

[53]  Ying Wang,et al.  Improving contig binning of metagenomic data using \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {d}_2^S $$\end{doc , 2017, BMC Bioinformatics.

[54]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[55]  Travis Gagie,et al.  Wheeler graphs: A framework for BWT-based data structures☆ , 2017, Theor. Comput. Sci..

[56]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[57]  Han Cao,et al.  Modelling BioNano optical data and simulation study of genome map assembly , 2018, Bioinform..

[58]  Moshe Lewenstein,et al.  Optimization problems in multiple-interval graphs , 2007, SODA '07.

[59]  Stefano Lonardi,et al.  Accurate detection of chimeric contigs via Bionano optical maps , 2018, Bioinform..

[60]  Fabio Cunial,et al.  Fully-functional bidirectional Burrows-Wheeler indexes , 2019, CPM.

[61]  Jing Li,et al.  De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms , 2017, Scientific Reports.

[62]  Justin Chu,et al.  ARCS: scaffolding genome drafts with linked reads , 2017, Bioinform..

[63]  Rene De La Briandais File searching using variable length keys , 1959, IRE-AIEE-ACM Computer Conference.

[64]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[65]  Esko Ukkonen,et al.  Haplotype Inference Via Hierarchical Genotype Parsing , 2007, WABI.

[66]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[67]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[68]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[69]  Lior Pachter,et al.  Pseudoalignment for metagenomic read assignment , 2015, Bioinform..

[70]  Brian C. Thomas,et al.  Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization , 2013, Genome research.

[71]  Petr A. Golovach,et al.  A survey of parameterized algorithms and the complexity of edge modification , 2020, Comput. Sci. Rev..

[72]  Jie Xu,et al.  Detecting Large Indels Using Optical Map Data , 2018, bioRxiv.

[73]  Burkhard Morgenstern,et al.  A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences , 2002, Appl. Math. Lett..

[74]  Alexandru I. Tomescu,et al.  Graphs Cannot Be Indexed in Polynomial Time for Sub-quadratic Time String Matching, Unless SETH Fails , 2020, SOFSEM.

[75]  Roberto Grossi,et al.  Degenerate String Comparison and Applications , 2018, WABI.

[76]  Akira Tanaka,et al.  The worst-case time complexity for generating all maximal cliques and computational experiments , 2006, Theor. Comput. Sci..

[77]  Esko Ukkonen,et al.  Finding Founder Sequences from a Set of Recombinants , 2002, WABI.

[78]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[79]  Ron Shamir,et al.  Realizing Interval Graphs with Size and Distance Constraints , 1997, SIAM J. Discret. Math..

[80]  Alberto Policriti,et al.  Regular Languages meet Prefix Sorting , 2019, SODA.

[81]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[82]  Le Vinh,et al.  A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads , 2015, Algorithms for Molecular Biology.

[83]  M. Strous,et al.  The Binning of Metagenomic Contigs for Microbial Physiology of Mixed Cultures , 2012, Front. Microbio..

[84]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[85]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[86]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[87]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[88]  Michael R. Fellows,et al.  On the parameterized complexity of multiple-interval graph problems , 2009, Theor. Comput. Sci..

[89]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[90]  Pascal Ochem,et al.  The Maximum Clique Problem in Multiple Interval Graphs (Extended Abstract) , 2012, WG.

[91]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.

[92]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[93]  S. Tringe,et al.  MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm , 2014, Microbiome.

[94]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[95]  Faraz Hach,et al.  HASLR: Fast Hybrid Assembly of Long Reads , 2020, bioRxiv.

[96]  Enno Ohlebusch,et al.  Bidirectional search in a string with wavelet trees and bidirectional matching statistics , 2012, Inf. Comput..

[97]  David Coudert,et al.  A note on Integer Linear Programming formulations for linear ordering problems on graphs , 2016 .

[98]  D. Schwartz,et al.  Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. , 1993, Science.

[99]  Yuan Jiang,et al.  BMC3C: binning metagenomic contigs using codon usage, sequence composition and read coverage , 2018, Bioinform..

[100]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[101]  J. Gilbert,et al.  Metagenomics - a guide from sampling to data analysis , 2012, Microbial Informatics and Experimentation.

[102]  David C. Schwartz,et al.  An algorithm for assembly of ordered restriction maps from single DNA molecules , 2006, Proceedings of the National Academy of Sciences.

[103]  Paul Medvedev,et al.  Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers , 2011, RECOMB.

[104]  Alfred V. Aho,et al.  The Transitive Reduction of a Directed Graph , 1972, SIAM J. Comput..

[105]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[106]  Zhengyang Wang,et al.  SolidBin: improving metagenome binning with semi-supervised normalized cut , 2019, Bioinform..

[107]  Pinar Heggernes,et al.  A new representation of proper interval graphs with an application to clique-width , 2009, Electron. Notes Discret. Math..

[108]  Christina Boucher,et al.  Variable-Order de Bruijn Graphs , 2014, 2015 Data Compression Conference.

[109]  Tao Jiang,et al.  OMGS: Optical Map-based Genome Scaffolding , 2019, bioRxiv.

[110]  Siu-Ming Yiu,et al.  OMSV enables accurate and comprehensive identification of large structural variations from nanochannel-based single-molecule optical maps , 2017, Genome Biology.

[111]  Michal Pilipczuk,et al.  Subexponential Parameterized Algorithm for Interval Completion , 2016, SODA.

[112]  Hanlee P. Ji,et al.  Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases , 2017, Genome Medicine.

[113]  Niema Moshiri,et al.  ViralMSA: Massively scalable reference-guided multiple sequence alignment of viral genomes , 2020, bioRxiv.

[114]  M. Golumbic Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57) , 2004 .

[115]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[116]  Marie-France Sagot,et al.  WENGAN: Efficient and high quality hybrid de novo assembly of human genomes , 2019, bioRxiv.

[117]  Roberto Grossi,et al.  On the Complexity of String Matching for Graphs , 2023, ICALP.

[118]  Jian Wang,et al.  Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[119]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[120]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[121]  Ryan R Wick,et al.  Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks , 2018, bioRxiv.

[122]  Fabio Cunial,et al.  Fast matching statistics in small space , 2018, SEA.

[123]  Ravi B. Boppana,et al.  Approximating maximum independent sets by excluding subgraphs , 1990, BIT.

[124]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[125]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[126]  Serafim Batzoglou,et al.  High-quality genome sequences of uncultured microbes by assembly of read clouds , 2018, Nature Biotechnology.

[127]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[128]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[129]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[130]  Kurt Mehlhorn,et al.  Cycle bases in graphs characterization, algorithms, complexity, and applications , 2009, Comput. Sci. Rev..

[131]  Adam M. Phillippy,et al.  Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit , 2019, bioRxiv.

[132]  Blake A. Simmons,et al.  MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets , 2016, Bioinform..

[133]  Deacon J. Sweeney,et al.  Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus) , 2012, Nature Biotechnology.

[134]  Iman Hajirasouliha,et al.  Minerva: an alignment- and reference-free approach to deconvolve Linked-Reads for metagenomics. , 2019, Genome research.

[135]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[136]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.