CSBFinder: discovery of colinear syntenic blocks across thousands of prokaryotic genomes

MOTIVATION Identification of conserved syntenic blocks across microbial genomes is important for several problems in comparative genomics such as gene annotation, study of genome organization and evolution and prediction of gene interactions. Current tools for syntenic block discovery do not scale up to the large quantity of prokaryotic genomes available today. RESULTS We present a novel methodology for the discovery, ranking and taxonomic distribution analysis of colinear syntenic blocks (CSBs)-groups of genes that are consistently located close to each other, in the same order, across a wide range of taxa. We present an efficient algorithm that identifies CSBs in large genomic datasets. The algorithm is implemented and incorporated in a novel tool with a graphical user interface, denoted CSBFinder, that ranks the discovered CSBs according to a probabilistic score and clusters them to families according to their gene content similarity. We apply CSBFinder to data mine 1487 prokaryotic genomes including chromosomes and plasmids. For post-processing analysis, we generate heatmaps for visualizing the distribution of CSB family members across various taxa. We exemplify the utility of CSBFinder in operon prediction, in deciphering unknown gene function and in taxonomic analysis of colinear syntenic blocks. AVAILABILITY AND IMPLEMENTATION CSBFinder software and code are publicly available at https://github.com/dinasv/CSBFinder. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  S. Salzberg,et al.  Prediction of operons in microbial genomes. , 2001, Nucleic acids research.

[2]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[3]  M. Adams,et al.  Nucleotide sequence and genetic characterization reveal six essential genes for the LIV-I and LS transport systems of Escherichia coli. , 1990, The Journal of biological chemistry.

[4]  Xuegong Zhang,et al.  Computational operon prediction in whole-genomes and metagenomes. , 2016, Briefings in functional genomics.

[5]  Jens Stoye,et al.  Finding approximate gene clusters with Gecko 3 , 2016, Nucleic acids research.

[6]  S. Létoffé,et al.  The housekeeping dipeptide permease is the Escherichia coli heme transporter and functions with two optional peptide binding proteins , 2006, Proceedings of the National Academy of Sciences.

[7]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[8]  Frédéric Boyer,et al.  Bacterial syntenies: an exact approach with gene quorum , 2011, BMC Bioinformatics.

[9]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[10]  Katharina Jahn Efficient Computation of Approximate Gene Clusters Based on Reference Occurrences , 2011, J. Comput. Biol..

[11]  Y. van de Peer,et al.  i-ADHoRe 3.0—fast and sensitive detection of genomic homology in extremely large data sets , 2011, Nucleic acids research.

[12]  S. Sze,et al.  Large-scale analysis of gene clustering in bacteria. , 2008, Genome research.

[13]  Jeremy D. DeBarry,et al.  MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity , 2012, Nucleic acids research.

[14]  P Guerdoux-Jamet,et al.  Mapping the bacterial cell architecture into the chromosome. , 2000, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[15]  S. Teichmann,et al.  Protein Complexes Are under Evolutionary Selection to Assemble via Ordered Pathways , 2013, Cell.

[16]  P. Lio’,et al.  Histidine biosynthetic pathway and genes: structure, regulation, and evolution. , 1996, Microbiological reviews.

[17]  Xin He,et al.  Identifying Conserved Gene Clusters in the Presence of Homology Families , 2005, J. Comput. Biol..

[18]  James K. Hane,et al.  A novel mode of chromosomal evolution peculiar to filamentous Ascomycete fungi , 2011, Genome Biology.

[19]  E. Koonin,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[20]  James C. Schnable,et al.  SynFind: Compiling Syntenic Regions across Any Set of Genomes on Demand , 2015, Genome biology and evolution.

[21]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[22]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[23]  Bernard M. E. Moret,et al.  Evaluating synteny for improved comparative studies , 2014, Bioinform..

[24]  Jens Stoye,et al.  Computation of Median Gene Clusters , 2009, J. Comput. Biol..

[25]  Igor B. Rogozin,et al.  Computational approaches for the analysis of gene neighbourhoods in prokaryotic genomes , 2004, Briefings Bioinform..

[26]  Fabio Rinaldi,et al.  RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond , 2015, Nucleic Acids Res..

[27]  C. Dieterich,et al.  CYNTENATOR: Progressive Gene Order Alignment of 17 Vertebrate Genomes , 2010, PloS one.

[28]  P Bork,et al.  Exploitation of gene context. , 2000, Current opinion in structural biology.

[29]  C. Yanofsky,et al.  Thr region between the operator and first structural gene of the tryptophan operon of Escherichia coli may have a regulatory function. , 1973, Journal of molecular biology.

[30]  Rida Assaf,et al.  Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center , 2016, Nucleic Acids Res..

[31]  E. Rocha The organization of the bacterial genome. , 2008, Annual review of genetics.

[32]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[33]  J. Plumbridge Regulation of the Utilization of Amino Sugars by Escherichia coli and Bacillus subtilis: Same Genes, Different Control , 2015, Journal of Molecular Microbiology and Biotechnology.

[34]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[35]  I-Min A. Chen,et al.  IMG/M: integrated genome and metagenome comparative data analysis system , 2016, Nucleic Acids Res..

[36]  David Sankoff,et al.  Tests for Gene Clustering , 2003, J. Comput. Biol..