Coordinate systems for supergenomes

BackgroundGenome sequences and genome annotation data have become available at ever increasing rates in response to the rapid progress in sequencing technologies. As a consequence the demand for methods supporting comparative, evolutionary analysis is also growing. In particular, efficient tools to visualize-omics data simultaneously for multiple species are sorely lacking. A first and crucial step in this direction is the construction of a common coordinate system. Since genomes not only differ by rearrangements but also by large insertions, deletions, and duplications, the use of a single reference genome is insufficient, in particular when the number of species becomes large.ResultsThe computational problem then becomes to determine an order and orientations of optimal local alignments that are as co-linear as possible with all the genome sequences. We first review the most prominent approaches to model the problem formally and then proceed to showing that it can be phrased as a particular variant of the Betweenness Problem. It is NP hard in general. As exact solutions are beyond reach for the problem sizes of practical interest, we introduce a collection of heuristic simplifiers to resolve ordering conflicts.ConclusionBenchmarks on real-life data ranging from bacterial to fly genomes demonstrate the feasibility of computing good common coordinate systems.

[1]  L. Eggleston,et al.  The pathway of oxidation of acetate in baker's yeast. , 1952, The Biochemical journal.

[2]  Mathieu Blanchette,et al.  Genetic Map Refinement Using a Comparative Genomic Approach , 2009, J. Comput. Biol..

[3]  Haixu Tang,et al.  De novo repeat classification and fragment assembly , 2004, RECOMB.

[4]  Gerhard Reinelt,et al.  A Cutting Plane Algorithm for the Linear Ordering Problem , 1984, Oper. Res..

[5]  Kunihiko Sadakane,et al.  Detecting Superbubbles in Assembly Graphs , 2013, WABI.

[6]  Kevin P. Byrne,et al.  The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. , 2005, Genome research.

[7]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[8]  Xiaoyu Chen,et al.  Comparative assessment of methods for aligning multiple genome sequences , 2010, Nature Biotechnology.

[9]  Xiaoyi Cao,et al.  Comparative epigenomics: defining and utilizing epigenomic variations across species, time‐course, and individuals , 2014, Wiley interdisciplinary reviews. Systems biology and medicine.

[10]  Gerhard Reinelt,et al.  The Linear Ordering Problem , 2011 .

[11]  David Haussler,et al.  Cactus: Algorithms for genome multiple sequence alignment. , 2011, Genome research.

[12]  Manolis Kellis,et al.  Evolutionary dynamics and tissue specificity of human long noncoding RNAs in six mammals , 2014, Genome research.

[13]  A. Alexeevski,et al.  Moss phylogeny reconstruction using nucleotide pangenome of complete Mitogenome sequences , 2015, Biochemistry (Moscow).

[14]  D. Bartel,et al.  Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. , 2015, Cell reports.

[15]  H. Tettelin,et al.  The microbial pan-genome. , 2005, Current opinion in genetics & development.

[16]  David Haussler,et al.  Building a Pan-Genome Reference for a Population , 2015, J. Comput. Biol..

[17]  R. Haselbeck,et al.  Function and expression of yeast mitochondrial NAD- and NADP-specific isocitrate dehydrogenases. , 1993, The Journal of biological chemistry.

[18]  Gerhard Reinelt,et al.  Consecutive Ones and a Betweenness Problem in Computational Biology , 1998, IPCO.

[19]  James H. Collier,et al.  An Information Measure for Comparing Top k Lists , 2013, 2014 IEEE 10th International Conference on e-Science.

[20]  Juan José Pantrigo,et al.  Branch and bound for the cutwidth minimization problem , 2013, Comput. Oper. Res..

[21]  W. Koh,et al.  Single-cell genome sequencing: current state of the science , 2016, Nature Reviews Genetics.

[22]  Gerhard Reinelt,et al.  The Linear Ordering Problem: Exact and Heuristic Methods in Combinatorial Optimization , 2011 .

[23]  G. Fischer,et al.  Comparative study on synteny between yeasts and vertebrates. , 2011, Comptes rendus biologies.

[24]  Andrés Moya,et al.  Genome Rearrangement Distances and Gene Order Phylogeny in γ-Proteobacteria , 2005 .

[25]  Jennifer A. Scott,et al.  Reducing the Total Bandwidth of a Sparse Unsymmetric Matrix , 2006, SIAM J. Matrix Anal. Appl..

[26]  Jean Diatta,et al.  Multilevel clustering models and interval convexities , 2017, Discret. Appl. Math..

[27]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[28]  Xuemin Lin,et al.  A Fast and Effective Heuristic for the Feedback Arc Set Problem , 1993, Inf. Process. Lett..

[29]  Sara El-Metwally,et al.  Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges , 2013, PLoS Comput. Biol..

[30]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[31]  Jaroslav Opatrny,et al.  Total Ordering Problem , 1979, SIAM J. Comput..

[32]  B. Lemire,et al.  The Carboxyl Terminus of the Saccharomyces cerevisiaeSuccinate Dehydrogenase Membrane Subunit, SDH4p, Is Necessary for Ubiquinone Reduction and Enzyme Stability* , 1997, The Journal of Biological Chemistry.

[33]  Kay Nieselt,et al.  GenomeRing: alignment visualization based on SuperGenome coordinates , 2012, Bioinform..

[34]  Michael Hahsler,et al.  Getting Things in Order: An Introduction to the R Package seriation , 2008 .

[35]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[36]  Youssef Saab,et al.  A Fast and Effective Algorithm for the Feedback Arc Set Problem , 2001, J. Heuristics.

[37]  David Haussler,et al.  Alignathon: a competitive assessment of whole-genome alignment methods , 2014, bioRxiv.

[38]  Michael R. Fellows,et al.  Tractable Parameterizations for the Minimum Linear Arrangement Problem , 2013, ESA.

[39]  Knut Reinert,et al.  Genome alignment with graph data structures: a comparison , 2014, BMC Bioinformatics.

[40]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[41]  Glenn Hickey,et al.  Superbubbles, Ultrabubbles and Cacti , 2017, bioRxiv.

[42]  Kenli Li,et al.  Scheduling Precedence Constrained Stochastic Tasks on Heterogeneous Cluster Systems , 2015, IEEE Transactions on Computers.

[43]  A. Tucker,et al.  A structure theorem for the consecutive 1's property☆ , 1972 .

[44]  Gerhard Reinelt,et al.  The simultaneous consecutive ones problem , 2009, Theor. Comput. Sci..

[45]  João Meidanis,et al.  On the Consecutive Ones Property , 1998, Discret. Appl. Math..

[46]  Henrik Kaessmann,et al.  Evolutionary dynamics of coding and non-coding transcriptomes , 2014, Nature Reviews Genetics.

[47]  Barry O'Sullivan,et al.  A fixed-parameter algorithm for the directed feedback vertex set problem , 2008, JACM.

[48]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[49]  André Raspaud,et al.  On Bandwidth, Cutwidth, and Quotient Graphs , 1995, RAIRO Theor. Informatics Appl..

[50]  William G. Poole,et al.  An algorithm for reducing the bandwidth and profile of a sparse matrix , 1976 .

[51]  W. S. Robinson A Method for Chronologically Ordering Archaeological Deposits , 1951, American Antiquity.

[52]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[53]  David Haussler,et al.  A Flow Procedure for the Linearization of Genome Sequence Graphs , 2017, RECOMB.

[54]  Kellogg S. Booth,et al.  Testing for the Consecutive Ones Property, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms , 1976, J. Comput. Syst. Sci..

[55]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[56]  Kay Nieselt,et al.  Efficient merging of genome profile alignments , 2019, Bioinform..

[57]  P. Stadler,et al.  Comparison of splice sites reveals that long noncoding RNAs are evolutionarily well conserved , 2015, RNA.

[58]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[59]  Fillia Makedon,et al.  Topological Bandwidth , 1983, CAAP.

[60]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[61]  Madhu Sudan,et al.  A Geometric Approach to Betweenness , 1995, ESA.

[62]  Innar Liiv,et al.  Seriation and matrix reordering methods: An historical overview , 2010, Stat. Anal. Data Min..

[63]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[64]  Robert Giegerich,et al.  Explaining and Controlling Ambiguity in Dynamic Programming , 2000, CPM.

[65]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[66]  Fanica Gavril,et al.  Some NP-complete problems on graphs , 2011, CISS 2011.

[67]  D Haussler,et al.  Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae. , 1999, RNA.

[68]  D. Tautz,et al.  Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence , 2016, eLife.

[69]  Kay Nieselt,et al.  High-Resolution Transcriptome Maps Reveal Strain-Specific Regulatory Features of Multiple Campylobacter jejuni Isolates , 2013, PLoS genetics.

[70]  I. Tanaka,et al.  Crystal Structure of the Monomeric Isocitrate Dehydrogenase in the Presence of NADP+ , 2003, Journal of Biological Chemistry.

[71]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[72]  Dimitrios M. Thilikos,et al.  A Note on Exact Algorithms for Vertex Ordering Problems on Graphs , 2012, Theory of Computing Systems.

[73]  Richard Friedberg,et al.  Genome rearrangement by the double cut and join operation. , 2008, Methods in molecular biology.

[74]  N. Perna,et al.  progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement , 2010, PloS one.

[75]  Uriel Feige,et al.  Coping with the NP-Hardness of the Graph Bandwidth Problem , 2000, SWAT.

[76]  David Haussler,et al.  A Flow Procedure for the Linearization of Genome Sequence Graphs , 2017, bioRxiv.

[77]  Sonja J. Prohaska,et al.  The Footprint Sorting Problem , 2004, J. Chem. Inf. Model..

[78]  Thomas R. Gingeras,et al.  Comparison of the transcriptional landscapes between human and mouse tissues , 2014, Proceedings of the National Academy of Sciences.

[79]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[80]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[81]  Kiyoshi Ezawa,et al.  Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map , 2016, BMC Bioinformatics.

[82]  Patrice Bertrand Systems of sets such that each set properly intersects at most one other set - Application to cluster analysis , 2008, Discret. Appl. Math..

[83]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[84]  Abraham Duarte,et al.  Linear Layout Problems , 2018, Handbook of Heuristics.

[85]  David Haussler,et al.  Comparative assembly hubs: Web-accessible browsers for comparative genomics , 2013, Bioinform..