论文信息 - The federated database – a basis for biobank-based post-genome studies, integrating phenome and genome data from 600 000 twin pairs in Europe

The federated database – a basis for biobank-based post-genome studies, integrating phenome and genome data from 600 000 twin pairs in Europe

Integration of complex data and data management represent major challenges in large-scale biobank-based post-genome era research projects like GenomEUtwin (an international collaboration between eight Twin Registries) with extensive amounts of genotype and phenotype data combined from different data sources located in different countries. The challenge lies not only in data harmonization and constant update of clinical details in various locations, but also in the heterogeneity of data storage and confidentiality of sensitive health-related and genetic data. Solid infrastructure must be built to provide secure, but easily accessible and standardized, data exchange also facilitating statistical analyses of the stored data. Data collection sites desire to have full control of the accumulation of data, and at the same time the integration should facilitate effortless slicing and dicing of the data for different types of data pooling and study designs. Here we describe how we constructed a federated database infrastructure for genotype and phenotype information collected in seven European countries and Australia and connected this database setting via a network called TwinNET to guarantee effortless data exchange and pooled analyses. This federated database system offers a powerful facility for combining different types of information from multiple data sources. The system is transparent to end users and application developers, since it makes the set of federated data sources look like a single system. The user need not be aware of the format or site where the data are stored, the language or programming interface of the data source, how the data are physically stored, whether they are partitioned and/or replicated or what networking protocols are used. The user sees a single standardized interface with the desired data elements for pooled analyses.

[1] Laura M. Haas,et al. Data integration through database federation , 2002, IBM Syst. J..

[2] Jan-Eric Litton,et al. Data modeling and data communication in GenomEUtwin. , 2003, Twin research : the official journal of the International Society for Twin Studies.

[3] Eli Herscovitz,et al. Secure virtual private networks: the future of data communications , 1999, Int. J. Netw. Manag..

[4] Tim Sprosen,et al. UK Biobank: from concept to reality. , 2005, Pharmacogenomics.

[5] Laura M. Haas,et al. Towards an information infrastructure for the grid , 2004, IBM Syst. J..

[6] Leena Peltonen,et al. GenomEUtwin: A Strategy to Identify Genetic Influences on Health and Disease , 2003, Twin Research.

[7] P. Lichtenstein,et al. The Swedish Twin Registry: a unique resource for clinical, epidemiological and genetic studies , 2002, Journal of internal medicine.

[8] G Stix,et al. The mice that warred. , 2001, Scientific American.

[9] James A. Hendler,et al. The Semantic Web" in Scientific American , 2001 .

[10] H. Tunstall-Pedoe,et al. Myocardial Infarction and Coronary Deaths in the World Health Organization MONICA Project: Registration Procedures, Event Rates, and Case‐Fatality Rates in 38 Populations From 21 Countries in Four Continents , 1994, Circulation.

[11] Stephen Fox,et al. Heterogeneous distributed database systems for production use , 1990, ACM Comput. Surv..