ThaleMine: A Warehouse for Arabidopsis Data Integration and Discovery

ThaleMine (https://apps.araport.org/thalemine/) is a comprehensive data warehouse that integrates a wide array of genomic information of the model plant Arabidopsis thaliana. The data collection currently includes the latest structural and functional annotation from the Araport11 update, the Col-0 genome sequence, RNA-seq and array expression, co-expression, protein interactions, homologs, pathways, publications, alleles, germplasm and phenotypes. The data are collected from a wide variety of public resources. Users can browse gene-specific data through Gene Report pages, identify and create gene lists based on experiments or indexed keywords, and run GO enrichment analysis to investigate the biological significance of selected gene sets. Developed by the Arabidopsis Information Portal project (Araport, https://www.araport.org/), ThaleMine uses the InterMine software framework, which builds well-structured data, and provides powerful data query and analysis functionality. The warehoused data can be accessed by users via graphical interfaces, as well as programmatically via web-services. Here we describe recent developments in ThaleMine including new features and extensions, and discuss future improvements. InterMine has been broadly adopted by the model organism research community including nematode, rat, mouse, zebrafish, budding yeast, the modENCODE project, as well as being used for human data. ThaleMine is the first InterMine developed for a plant model. As additional new plant InterMines are developed by the legume and other plant research communities, the potential of cross-organism integrative data analysis will be further enabled.

[1]  Nicholas J. Provart,et al.  An “Electronic Fluorescent Pictograph” Browser for Exploring and Analyzing Large-Scale Biological Data Sets , 2007, PloS one.

[2]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[3]  S. Goff,et al.  A High-Throughput Arabidopsis Reverse Genetics System Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.004630. , 2002, The Plant Cell Online.

[4]  C. Mungall,et al.  Gene Ontology Consortium : going forward The Gene Ontology , 2015 .

[5]  Gos Micklem,et al.  YeastMine—an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit , 2012, Database J. Biol. Databases Curation.

[6]  R. Amasino,et al.  The WiscDsLox T-DNA collection: an arabidopsis community resource generated by using an improved high-throughput T-DNA sequencing pipeline , 2006, Journal of Plant Research.

[7]  Sergio Contrino,et al.  modMine: flexible access to modENCODE data , 2011, Nucleic Acids Res..

[8]  Rafael C. Jimenez,et al.  The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases , 2013, Nucleic Acids Res..

[9]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[10]  Vivek Krishnakumar,et al.  MTGD: The Medicago truncatula genome database. , 2015, Plant & cell physiology.

[11]  Qian Li,et al.  Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model , 2016, Database J. Biol. Databases Curation.

[12]  Anushya Muruganujan,et al.  PANTHER version 10: expanded protein families and functions, and analysis tools , 2015, Nucleic Acids Res..

[13]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[14]  Sergio Contrino,et al.  InterMine: extensive web services for modern biology , 2014, Nucleic Acids Res..

[15]  David M. Goodstein,et al.  Phytozome: a comparative platform for green plant genomics , 2011, Nucleic Acids Res..

[16]  Kimberly Van Auken,et al.  WormBase 2016: expanding to enable helminth genomic research , 2015, Nucleic Acids Res..

[17]  Emily M. Strait,et al.  The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome , 2015, Genesis.

[18]  Sergio Contrino,et al.  Cross‐organism analysis using InterMine , 2015, Genesis.

[19]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[20]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[21]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[22]  J. E. Richardson,et al.  MouseMine: a new data warehouse for MGI , 2015, Mammalian Genome.

[23]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[24]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[25]  Seung Y. Rhee,et al.  Genomic Signatures of Specialized Metabolism in Plants , 2014, Science.

[26]  Sergio Contrino,et al.  InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data , 2012, Bioinform..

[27]  Kara Dolinski,et al.  The BioGRID interaction database: 2015 update , 2014, Nucleic Acids Res..

[28]  M. Schmid,et al.  Genome-Wide Insertional Mutagenesis of Arabidopsis thaliana , 2003, Science.

[29]  Weisong Liu,et al.  The Rat Genome Database 2015: genomic, phenotypic and environmental variations and disease , 2014, Nucleic Acids Res..

[30]  Karsten M. Borgwardt,et al.  1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana , 2016, Cell.

[31]  Yong Li,et al.  An Arabidopsis thaliana T-DNA mutagenized population (GABI-Kat) for flanking sequence tag-based reverse genetics , 2003, Plant Molecular Biology.

[32]  Kengo Kinoshita,et al.  ATTED-II in 2016: A Plant Coexpression Database Towards Lineage-Specific Coexpression , 2015, Plant & cell physiology.

[33]  Geet Duggal,et al.  Accurate, fast, and model-aware transcript expression quantification with Salmon , 2015 .

[34]  Julie M. Sullivan,et al.  FlyMine: an integrated database for Drosophila and Anopheles genomics , 2007, Genome Biology.

[35]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[36]  Yvonne M. Bradford,et al.  ZFIN, The zebrafish model organism database: Updates and new directions , 2015, Genesis.

[37]  Matthew R. Hanlon,et al.  Araport: the Arabidopsis Information Portal , 2014, Nucleic Acids Res..

[38]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2010, Nucleic Acids Res..

[39]  Jun Li,et al.  HRGRN: A Graph Search-Empowered Integrative Database of Arabidopsis Signaling Transduction, Metabolism and Gene Regulation Networks , 2015, Plant & cell physiology.

[40]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[41]  F. Thibaud-Nissen,et al.  Araport11: a complete reannotation of the Arabidopsis thaliana reference genome , 2016, bioRxiv.

[42]  Wei Huang,et al.  Legume information system (LegumeInfo.org): a key component of a set of federated data resources for the legume family , 2015, Nucleic Acids Res..

[43]  Lawrence Kelley,et al.  ePlant and the 3D Data Display Initiative: Integrative Systems Biology on the World Wide Web , 2011, PloS one.