NeXML: Rich, Extensible, and Verifiable Representation of Comparative Data and Metadata

Abstract In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input–output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.

[1]  R. Guralnick,et al.  Biodiversity informatics: automated approaches for documenting global biodiversity patterns and processes , 2009, Bioinform..

[2]  Sean R. Eddy,et al.  ATV: display and manipulation of annotated phylogenetic , 2001, Bioinform..

[3]  C. Peota Novel approach. , 2011, Minnesota medicine.

[4]  D. Maddison,et al.  The Tree of Life Web Project , 2007 .

[5]  U. Brandes,et al.  GraphML Progress Report ? Structural Layer Proposal , 2001 .

[6]  Karin Kiontke,et al.  Trends, Stasis, and Drift in the Evolution of Nematode Vulva Development , 2007, Current Biology.

[7]  John M. Hancock,et al.  Using ontologies to describe mouse phenotypes , 2004, Genome Biology.

[8]  Paul J. Walmsley,et al.  XML Schema Part 0: Primer Second Edition , 2004 .

[9]  Oliver Eulenstein,et al.  Bioinformatics Research and Applications , 2008 .

[10]  M. Ruggero,et al.  Similarity of Traveling-Wave Delays in the Hearing Organs of Humans and Other Tetrapods , 2007, Journal for the Association for Research in Otolaryngology.

[11]  Vivek Gopalan,et al.  Bio::NEXUS: a Perl API for the NEXUS format for comparative biological data , 2007, BMC Bioinformatics.

[12]  K. Sjölander,et al.  Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA). , 2006, Omics : a journal of integrative biology.

[13]  Cynthia L. Smith,et al.  Integrating phenotype ontologies across multiple species , 2010, Genome Biology.

[14]  Christian M. Zmasek,et al.  phyloXML: XML for evolutionary biology and comparative genomics , 2009, BMC Bioinformatics.

[15]  Jacqueline L. Whalley,et al.  Access the most recent version at doi: 10.1101/gr.095612.109 Supplemental Material P , 2009 .

[16]  Nelson Rios,et al.  Connecting evolutionary morphology to genomics using ontologies: a case study from Cypriniformes including zebrafish. , 2007, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[17]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[18]  D. Maddison,et al.  Mesquite: a modular system for evolutionary analysis. Version 2.6 , 2009 .

[19]  Paul O. Lewis,et al.  NCL: a C++ class library for interpreting data files in NEXUS format , 2003, Bioinform..

[20]  Chris Mungall,et al.  Phenotype ontologies: the bridge between genomics and evolution. , 2007, Trends in ecology & evolution.

[21]  Hilmar Lapp,et al.  Evolutionary Characters, Phenotypes and Ontologies: Curating Data from the Systematic Biology Literature , 2010, PloS one.

[22]  Arvind Malhotra,et al.  XML Schema Part 2: Datatypes Second Edition , 2004 .

[23]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[24]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[25]  Duhong Chen,et al.  The PhyLoTA Browser: processing GenBank for molecular phylogenetics research. , 2008, Systematic biology.

[26]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[27]  M. Donoghue,et al.  Analyzing large data sets: rbcL 500 revisited. , 1997, Systematic biology.

[28]  David C. Fallside,et al.  Xml schema part 0: primer , 2000 .

[29]  David J. Patterson,et al.  uBioRSS: Tracking taxonomic literature using RSS , 2007, Bioinform..

[30]  Nicolas Rodriguez,et al.  PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees , 2005, Nucleic Acids Res..

[31]  William H. Piel,et al.  PhyloWidget: web-based visualizations for the tree of life , 2008, Bioinform..

[32]  Enrico Pontelli,et al.  Initial Implementation of a Comparative Data Analysis Ontology , 2009, Evolutionary bioinformatics online.

[33]  Samuel A. Smits,et al.  jsPhyloSVG: A Javascript Library for Visualizing Interactive and Vector-Based Phylogenetic Trees on the Web , 2010, PloS one.

[34]  Hilmar Lapp,et al.  The Teleost Anatomy Ontology: Anatomical Representation for the Genomics Age , 2010, Systematic biology.

[35]  Susan M. Drake A Novel Approach. , 1996 .

[36]  A. Janetos A New Biology for the 21st Century , 2009 .

[37]  D. Maddison,et al.  MacClade 4: analysis of phy-logeny and character evolution , 2003 .

[38]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[39]  Norman F Johnson,et al.  Biodiversity informatics. , 2007, Annual review of entomology.

[40]  D. Maddison,et al.  NEXUS: an extensible file format for systematic information. , 1997, Systematic biology.

[41]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[42]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[43]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[44]  F. Bisby The quiet revolution: biodiversity informatics and the internet. , 2000, Science.

[45]  Ryan Scherle,et al.  LINKING BIG: THE CONTINUING PROMISE OF EVOLUTIONARY SYNTHESIS , 2009, Evolution; international journal of organic evolution.

[46]  Peter D. Karp,et al.  An Evaluation of Ontology Exchange Languages for Bioinformatics , 2000, ISMB.

[47]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[48]  A. Townsend Peterson,et al.  VertNet: A New Model for Biodiversity Data Sharing , 2010, PLoS biology.

[49]  Steven Pemberton,et al.  RDFa in XHTML: Syntax and Processing , 2008 .

[50]  Mark A. Miller,et al.  Creating the CIPRES Science Gateway for inference of large phylogenetic trees , 2010, 2010 Gateway Computing Environments Workshop (GCE).

[51]  X. Xia,et al.  DAMBE: software package for data analysis in molecular biology and evolution. , 2001, The Journal of heredity.

[52]  Michael C Whitlock,et al.  Data Archiving , 2010, The American Naturalist.

[53]  David Beech,et al.  XML-Schema Part 1: Structures Second Edition , 2004 .

[54]  Vivek Gopalan,et al.  Nexplorer: phylogeny-based exploration of sequence family data , 2006, Bioinform..

[55]  Roderic D. M. Page,et al.  Biodiversity informatics: the challenge of linking data and the role of shared identifiers , 2008, Briefings Bioinform..

[56]  Paula M. Mabee,et al.  Phenex: Ontological Annotation of Phenotypic Diversity , 2010, PloS one.

[57]  M. Whitlock,et al.  The need for archiving data in evolutionary biology , 2010, Journal of evolutionary biology.

[58]  Gabriel Cardona,et al.  Extended Newick: it is time for a standard representation of phylogenetic networks , 2008, BMC Bioinformatics.

[59]  Suzanne J. Matthews,et al.  A Novel Approach for Compressing Phylogenetic Trees , 2010, ISBRA.

[60]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[61]  Fabian Schreiber,et al.  Letter to the Editor: SeqXML and OrthoXML: standards for sequence and orthology information , 2011, Briefings Bioinform..