OnionTree XML: A Format to Exchange Gene-Related Probabilities

Many medical and biological genetics and functional genomics studies include genome-wide analysis. Due to the coordination of cellular functions, the behavior of groups of genes rather than of a single gene can be more informative in these studies. Experimental and technical developments now allow genome-wide measurement of many molecular components of the cell, including mRNA transcripts (1), DNA sequence (2–5) and structure (6–10), DNA binding by transcriptional regulators (11), microRNAs, proteins, and metabolites (12). For each type of data, analysis software has been developed, much of it available within the R/Bioconductor framework (13). A major issue remains for data mining or statistical inference on high-throughput data due to the “curse of dimensionality” arising from the tens of thousands of molecular components generally being measured in only tens or hundreds of conditions. A logical approach to this problem is the use of Bayesian statistics (14), where prior information developed from many years of targeted biological studies can be used to reduce the search space during model fitting. For many analyses, there are several steps required for data processing, from image acquisition and processing through normalization to data mining or statistical inference. Often, it is necessary to create a pipeline for the analysis. The ideal pipeline would allow the integration of both prior knowledge and potentially the use of measurements in one molecular domain to guide inference in another. For instance, genes known a priori to function in parallel redundant pathways may be more likely to show genetic interactions in a genome-wide association study (GWAS). Alternatively genes that share transcription factor binding determined by ChIP-seq measurements may be more likely to show correlated expression. The Bayesian framework is quite natural for data exchange in this case, especially for programs that handle different forms of gene-related information and different representations of the data. XML (eXtendible Markup Language) was invented in the late 1990’s (15) as a way to represent documents in a machine-readable hypertext form. The represented information is organized as a tree, and a pre-given description of the tree allows verifying the data. The tree nodes are XML elements. Elements can contain each other. If a node A is a child of node B, the element corresponding to B contains that corresponding to A. Each of the elements belongs to a type, and the list of the types and their possible relations is the essence of the description (XML schema) mentioned above. Each schema corresponds to a definite data type, e.g. a book, an image, a worksheet, etc. Over the past decade, XML became the most common way of Internet data exchange. Current bioinformatics practice uses a large variety of XML-based languages that describe different data types (e.g., for a review see (15)). We mention a few of them that are most applicable to this domain. XEMBL (16) is an XML format for EMBL data. CisML (17) and SmallBisMark (18) are for sequence motif information such as transcription factor binding sites, while MAGE-ML (19, 20) is intended for microarray metadata representation. SBML (21) and CellML (22) capture biological network models, and MFAML (23) describes metabolic fluxes. In addition, there are XML formats (24, 25) that represent Bayesian information in a very general form. However, we require a format for Bayesian information that is suited to biological systems, but which is not too specialized, unlike the biological XMLs noted above. Our goal is an XML to encode relationships as probabilities of interactions for the purposes of genetics and bioinformatics, with the interpretation of the message in the XML depending on the context of the parser. This will permit the interchange of probabilistic information between bioinformatics frameworks that refer to different aspects of genomics knowledge.

[1]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[2]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[3]  Stefano Morosetti,et al.  Prediction of Nucleosome Positioning in Genomes: Limits and Perspectives of Physical and Bioinformatic Approaches , 2010, Journal of biomolecular structure & dynamics.

[4]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[5]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[6]  Yehoshua Sagiv,et al.  Modeling and querying probabilistic XML data , 2009, SGMD.

[7]  Sang Yup Lee,et al.  MFAML: a standard data structure for representing and exchanging metabolic flux models , 2005, Bioinform..

[8]  Bart De Moor,et al.  Importing MAGE-ML format microarray data into BioConductor , 2004, Bioinform..

[9]  Alexander V. Favorov,et al.  A Markov Chain Monte Carlo Technique for Identification of Combinations of Allelic Variants Underlying Complex Diseases in Humans , 2005, Genetics.

[10]  Hugo Naya,et al.  Composition Profile of the Human Genome at the Chromosome Level , 2009, Journal of biomolecular structure & dynamics.

[11]  Alan J. Robinson,et al.  XEMBL: distributing EMBL data in XML format , 2002, Bioinform..

[12]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[13]  Michael F. Ochs,et al.  Knowledge-based data analysis comes of age , 2010, Briefings Bioinform..

[14]  Peter J. Hunter,et al.  An Overview of CellML 1.1, a Biological Model Description Language , 2003, Simul..

[15]  C. Burge,et al.  Most mammalian mRNAs are conserved targets of microRNAs. , 2008, Genome research.

[16]  Xiaoguang Liu,et al.  Genome-Wide Identification and Evolutionary Analysis of Arabidopsis Sm Genes Family , 2011, Journal of biomolecular structure & dynamics.

[17]  Z. Frenkel,et al.  Nucleosome Positioning Pattern Derived from Oligonucleotide Compositions of Genomic Sequences , 2011, Journal of biomolecular structure & dynamics.

[18]  B Rannala,et al.  Finding Genes Influencing Susceptibility to Complex Diseases in the Post-Genome Era , 2001, American journal of pharmacogenomics : genomics-related research in drug development and clinical practice.

[19]  Steven M. Johnson,et al.  Painting a Perspective on the Landscape of Nucleosome Positioning , 2010, Journal of biomolecular structure & dynamics.

[20]  John Quackenbush,et al.  Seeded Bayesian Networks: Constructing genetic networks from microarray data , 2008, BMC Systems Biology.

[21]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[22]  Peter M. Haverty,et al.  CisML: an XML-based format for sequence motif detection software , 2004, Bioinform..

[23]  Susumu Goto,et al.  The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[24]  Ramanathan Sowdhamini,et al.  Phylogenetic Analysis and Selection Pressures of 5-HT Receptors in Human and Non-human Primates: Receptor of an Ancient Neurotransmitter , 2010, Journal of biomolecular structure & dynamics.

[25]  Patrick Lambrix,et al.  A review of standards for data exchange within systems biology , 2007, Proteomics.

[26]  Yong-Doo Park,et al.  High-Throughput Integrated Analyses for the Tyrosinase-Induced Melanogenesis: Microarray, Proteomics and Interactomics Studies , 2010, Journal of biomolecular structure & dynamics.

[27]  Hiroaki Kitano,et al.  The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models , 2003, Bioinform..

[28]  Leyla Isik,et al.  Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. , 2009, Cancer research.

[29]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[30]  Tapash Chandra Ghosh,et al.  Relationship between Gene Compactness and Base Composition in Rice and Human Genome , 2010, Journal of biomolecular structure & dynamics.

[31]  C. Mitra,et al.  Conserved Short Sequences in Promoter Regions of Human Genome , 2010, Journal of biomolecular structure & dynamics.