Increasing numbers of whole-genome sequences are available, but to interpret them fully requires more than listing all genes. Genome databases are faced with the challenges of integrating heterogenous data and enabling data mining. In comparison to a data warehousing approach, where integration is achieved through replication of all relevant data in a unified schema, distributed approaches provide greater flexibility and maintainability. These are important in a field where new data is generated rapidly and our understanding of the data changes. Interoperability between distributed data sources allows data maintenance to be separated from integration and analysis. Simple ways to access the data can facilitate the development of new data mining tools and the transition from model genome analysis to comparative genomics. With the MIPS Arabidopsis thaliana genome database (MAtDB, http://mips.gsf.de/proj/thal/db) our aim is to go beyond a data repository towards creating an integrated knowledge resource. To this end, the Arabidopsis genome has been a backbone against which to structure and integrate heterogenous data. The challenges to be met are continuous updating of data, the design of flexible data models that can evolve with new data, the integration of heterogenous data, e.g. through the use of ontologies, comprehensive views and visualization of complex information, simple interfaces for application access locally or via the Internet, and knowledge transfer across species.
[1]
Roland Arnold,et al.
MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome
,
2002,
Nucleic Acids Res..
[2]
Philip Lijnzaad,et al.
The Ensembl genome database project
,
2002,
Nucleic Acids Res..
[3]
Dmitrij Frishman,et al.
Functional and structural genomics using PEDANT
,
2001,
Bioinform..
[4]
B. Haas,et al.
Full-length messenger RNA sequences greatly improve genome annotation
,
2002,
Genome Biology.
[5]
The Arabidopsis Genome Initiative.
Analysis of the genome sequence of the flowering plant Arabidopsis thaliana
,
2000,
Nature.
[6]
Takashi Matsumoto,et al.
RiceGAAS: an automated annotation system and database for rice genome sequence
,
2002,
Nucleic Acids Res..
[7]
Mark D. Wilkinson,et al.
BioMOBY: An Open Source Biological Web Services Proposal
,
2002,
Briefings Bioinform..