Encoding Data Using Biological Principles: The Multisample Variant Format for Phylogenomics and Population Genomics

Rapid progress in the fields of phylogenomics and population genomics has driven increases in both the size of multi-genomic datasets and the number and complexity of genome-wide analyses. We present the Multisample Variant Format, specifically designed to store multiple sequence alignments for phylogenomics and population genomic analysis. The signature feature of MVF is a distinctive encoding of aligned sites with specific biological information content (e.g., invariant, low-coverage). This biological pattern-based encoding of sequence data allows for rapid filtering and quality control of data and speeds up computation for many analyses. Similar to other modern formats, MVF has a simple data structure and flexible header structure to accommodate project metadata, allowing to also serve as an effective data publication and sharing format. We also propose several variants of the MVF format to accommodate protein and codon alignments, quality scores, and a mix of de novo and reference-aligned data. Using the MVFtools package, MVF files can be converted from other common sequence formats. MVFtools completes tasks ranging from simple transformation and filtering operations to complex genome-wide visualizations in only a few minutes, even on large datasets. In addition to presentation of MVF and MVFtools, we also discuss the application both in MVF and other existing data formats of the broader concept of using biological principles and patterns to inform sequence data encoding.

[1]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[2]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[3]  Stephen A. Smith,et al.  Optimizing de novo assembly of short-read RNA-seq data for phylogenomics , 2013, BMC Genomics.

[4]  Xiaofang Jiang,et al.  Extensive introgression in a malaria vector species complex revealed by phylogenomics , 2015, Science.

[5]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[6]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[7]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[8]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[9]  Saravanaraj N. Ayyampalayam,et al.  Phylotranscriptomic analysis of the origin and early diversification of land plants , 2014, Proceedings of the National Academy of Sciences.

[10]  Melanie A. Huntley,et al.  Evolution of genes and genomes on the Drosophila phylogeny , 2007, Nature.

[11]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[13]  Matthew W. Hahn,et al.  Detection and Polarization of Introgression in a Five-taxon Phylogeny , 2014 .

[14]  Camilo Salazar,et al.  Genome‐wide patterns of divergence and gene flow across a butterfly radiation , 2013, Molecular ecology.

[15]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[16]  Simon H. Martin,et al.  Genome-wide evidence for speciation with gene flow in Heliconius butterflies , 2013, Genome research.

[17]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[18]  Yinlong Xie,et al.  Dissecting Molecular Evolution in the Highly Diverse Plant Clade Caryophyllales Using Transcriptome Sequencing , 2015, Molecular biology and evolution.

[19]  Axel Visel,et al.  De novo transcriptome assembly of drought tolerant CAM plants, Agave deserti and Agave tequilana , 2013, BMC Genomics.

[20]  David Reich,et al.  Testing for ancient admixture between closely related populations. , 2011, Molecular biology and evolution.

[21]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[22]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .