Biological Data Management in a Dataspace Framework

Biological data reside in specialized databases that represent different data interpretation stages or different facets of biological phenomena. For example, microbial genomes are sequenced by organizations worldwide, follow an annotation process (gene prediction, functional characterization) that is often specific to each sequencing center, and end up in one of the primary archival public sequence data repositories, such as GenBank. Genome sequence data include information on gene coordinates, locus identifiers, gene names and protein functions. Analyzing microbial genomes requires however additional functional annotations, such as motifs, domains, and pathways, which are provided by diverse, usually heterogeneous, auxiliary annotation sources, such as Pfam, InterPro, COG, and KEGG. Secondary public resources such as EBI’s Genome Reviews and NCBI’s RefSeq integrate such additional functional annotations with the sequences from the primary sequence data sources, sometimes together with a review and curation of the associated annotations. Such secondary resources share common goals, but contain different collections of genomes or data with different degrees of resolution regarding the same genomes. These differences are the result of diverse annotation methods, curation techniques, and functional characterization employed across microbial genome data sources. Tertiary resources such as the Integrated Microbial Genomes (IMG) system [4] aim at providing high levels of data diversity in terms of the number of genomes integrated in the system from public sources, data coherence in terms of the quality of the gene annotations, and data completeness in terms of breadth of the functional annotations. Such a data context is critical for multi genome comparative analysis used in the functional characterization of microbial genomes, The increasing number of biological databases, the emergence of new types of data that need to be captured, as well as evolving technologies, methods and biological knowledge add to the complexity of data management required to support biological data analysis. A typical biological data management system involves accessing or gathering data from multiple sources, followed by data correlation, classification, review, and curation using domain specific tools (e.g., functional clusters, ontologies) and expertise. In practice, biological data management is less daunting when it is considered in the context of an iterative strategy based on gradual data integration while accumulating domain specific knowledge throughout the integration process. The recently proposed dataspace abstraction [1] provides the framework for such a strategy which has proved to be effective in devising systems such as IMG. For example, IMG’s dataspace includes several primary and secondary microbial genome data sources