Approximate search for known gene clusters in new genomes using PQ-trees

We define a new problem in comparative genomics, denoted PQ-Tree Search, that takes as input a PQ-tree $T$ representing the known gene orders of a gene cluster of interest, a gene-to-gene substitution scoring function $h$, integer parameters $d_T$ and $d_S$, and a new genome $S$. The objective is to identify in $S$ approximate new instances of the gene cluster that could vary from the known gene orders by genome rearrangements that are constrained by $T$, by gene substitutions that are governed by $h$, and by gene deletions and insertions that are bounded from above by $d_T$ and $d_S$, respectively. We prove that the PQ-Tree Search problem is NP-hard and propose a parameterized algorithm that solves the optimization variant of PQ-Tree Search in $O^*(2^{\gamma})$ time, where $\gamma$ is the maximum degree of a node in $T$ and $O^*$ is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQ-tree. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters. The code for the tool as well as all the data needed to reconstruct the results are publicly available on GitHub (this http URL).

[1]  Christophe Paul,et al.  Perfect Sorting by Reversals Is Not Always Difficult , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Jens Stoye,et al.  Reversal Distance without Hurdles and Fortresses , 2004, CPM.

[3]  Anna E. Sheppard,et al.  Plasmid Classification in an Era of Whole-Genome Sequencing: Application in Studies of Antibiotic Resistance Epidemiology , 2017, Front. Microbiol..

[4]  Laxmi Parida,et al.  Using PQ Structures for Genomic Rearrangement Phylogeny , 2006, J. Comput. Biol..

[5]  Gad M. Landau,et al.  Gene Proximity Analysis across Whole Genomes via PQ Trees1 , 2005, J. Comput. Biol..

[6]  Tatiana A. Tatusova,et al.  RefSeq microbial genomes database: new representation and annotation strategy , 2013, Nucleic Acids Res..

[7]  Christina Cramer,et al.  Antibiotic Susceptibility Profiles ofEscherichia coli Strains Lacking Multidrug Efflux Pump Genes , 2001, Antimicrobial Agents and Chemotherapy.

[8]  Gad M. Landau,et al.  A Combinatorial Approach to Automatic Discovery of Cluster-Patterns , 2003, WABI.

[9]  Michal Ziv-Ukelson,et al.  Discovery of multi-operon colinear syntenic blocks in microbial genomes , 2020, Bioinform..

[10]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[11]  Michael Jünger,et al.  A Branch-and-Cut Approach to Physical Mapping of Chromosomes by Unique End-Probes , 1997, J. Comput. Biol..

[12]  Fedor V. Fomin,et al.  Kernelization: Theory of Parameterized Preprocessing , 2019 .

[13]  B. Wanner,et al.  Evidence for a fourteen-gene, phnC to phnP locus for phosphonate metabolism in Escherichia coli. , 1993, Gene.

[14]  S. Louis Hakimi,et al.  Complexity Results for Scheduling Tasks with Discrete Starting Times , 1982, J. Algorithms.

[15]  J. Marsh,et al.  Operon Gene Order Is Optimized for Ordered Protein Complex Assembly , 2016, Cell reports.

[16]  Jens Stoye,et al.  Computation of Median Gene Clusters , 2008, RECOMB.

[17]  Xin He,et al.  Identifying Conserved Gene Clusters in the Presence of Homology Families , 2005, J. Comput. Biol..

[18]  Annie Chateau,et al.  Reconstructing Ancestral Gene Orders Using Conserved Intervals , 2004, WABI.

[19]  Rolf Niedermeier,et al.  Interval scheduling and colorful independent sets , 2012, J. Sched..

[20]  Michael R. Fellows,et al.  Fundamentals of Parameterized Complexity , 2013 .

[21]  A. Mérieau,et al.  Plasmids as scribbling pads for operon formation and propagation. , 2013, Research in microbiology.

[22]  Jens Stoye,et al.  Finding approximate gene clusters with Gecko 3 , 2016, Nucleic acids research.

[23]  J. Mark Keil,et al.  On the complexity of scheduling tasks with discrete starting times , 1992, Oper. Res. Lett..

[24]  David Sankoff,et al.  Common Intervals and Symmetric Difference in a Model-Free Phylogenomics, with an Application to Streptophyte Evolution , 2006, Comparative Genomics.

[25]  Frits C. R. Spieksma,et al.  The complexity of scheduling short tasks with few starting times , 1992 .

[26]  Jens Stoye,et al.  Algorithms for Finding Gene Clusters , 2001, WABI.

[27]  W. Eberhard Evolution in Bacterial Plasmids and Levels of Selection , 1990, The Quarterly Review of Biology.

[28]  F. Dyda,et al.  Mechanisms of Evolution in High-Consequence Drug Resistance Plasmids , 2016, mBio.

[29]  Kellogg S. Booth,et al.  Testing for the Consecutive Ones Property, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms , 1976, J. Comput. Syst. Sci..

[30]  Geoffrey Zweig,et al.  Physical mapping of chromosomes using unique probes , 1994, SODA '94.

[31]  F. Spieksma On the approximability of an interval scheduling problem , 1999 .

[32]  D. Nies,et al.  Efflux-mediated heavy metal resistance in prokaryotes. , 2003, FEMS microbiology reviews.

[33]  Yan Zhang,et al.  PATRIC, the bacterial bioinformatics database and analysis resource , 2013, Nucleic Acids Res..

[34]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[35]  D. Giedroc,et al.  Copper Transport and Trafficking at the Host–Bacterial Pathogen Interface , 2014, Accounts of chemical research.

[36]  Takeaki Uno,et al.  Fast Algorithms to Enumerate All Common Intervals of Two Permutations , 1997, Algorithmica.

[37]  Mathieu Raffinot,et al.  The Algorithmic of Gene Teams , 2002, WABI.

[38]  Jens Stoye,et al.  Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences , 2004, CPM.

[39]  Marco Fondi,et al.  Origin and evolution of operons and metabolic pathways. , 2009, Research in microbiology.

[40]  A. Aertsen,et al.  The impact of insertion sequences on bacterial genome plasticity and adaptability , 2017, Critical reviews in microbiology.

[41]  Wah Chiu,et al.  Structure of the AcrAB-TolC multidrug efflux pump , 2014, Nature.