Bridging Data Management and Knowledge Discovery in the Life Sciences

In this work we present an application for integrating and analyzing life science data using a biomedical data warehouse system and tools developed in-house enabling knowledge discovery tasks. Knowledge discovery is known as a process where different steps have to be coupled in order to solve a specified question. In order to create such a combina- tion of steps, a data miner using our in-house developed knowledge discovery tool KD 3 is able to assemble functional ob- jects to a data mining workflow. The generated workflows can easily be used for ulterior purposes by only adding new data and parameterizing the functional objects in the process. Workflows guide the performance of data integration and aggregation tasks, which were defined and implemented using a public available open source tool. To prove the concept of our application, intelligent query models were designed and tested for the identification of genotype-phenotype correla- tions in Marfan Syndrome. It could be shown that by using our application, a data miner can easily develop new knowl- edge discovery algorithms that may later be used to retrieve medical relevant information by clinical researchers.

[1]  Michael Y. Galperin The Molecular Biology Database Collection: 2008 update , 2007, Nucleic Acids Res..

[2]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[3]  Matteo Golfarelli,et al.  Beyond data warehousing: what's next in business intelligence? , 2004, DOLAP '04.

[4]  Christian Baumgartner,et al.  A bioinformatics framework for genotype-phenotype correlation in humans with Marfan syndrome caused by FBN1 gene mutations , 2006, J. Biomed. Informatics.

[5]  Flemming Skovby,et al.  Classic, atypically severe and neonatal Marfan syndrome: twelve mutations and genotype–phenotype correlations in FBN1 exons 24–40 , 2001, European Journal of Human Genetics.

[6]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[7]  Glen Brice,et al.  The importance of mutation detection in Marfan syndrome and Marfan‐related disorders: report of 193 FBN1 mutations , 2007, Human mutation.

[8]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[9]  U. Francke,et al.  Cysteine substitutions in epidermal growth factor-like domains of fibrillin-1: distinct effects on biochemical and clinical phenotypes. , 1999, American journal of human genetics.

[10]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[12]  Subbarao Kambhampati,et al.  Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[13]  Ulf Leser,et al.  Integration molekularbiologischer Daten , 2003, Datenbank-Spektrum.

[14]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[15]  Wieland Schwinger,et al.  Data Integration in Digital Libraries: Approaches and Challenges , 2002 .

[16]  Seth Stovack Kessler Piezoelectric-based in-situ damage detection of composite materials for structural health monitoring systems , 2002 .

[17]  Laura M. Haas,et al.  Data integration through database federation , 2002, IBM Syst. J..

[18]  Michael Y. Galperin The Molecular Biology Database Collection: 2007 update , 2006, Nucleic Acids Res..

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[20]  Stephen P Gardner,et al.  Ontologies and semantic data integration. , 2005, Drug discovery today.

[21]  P Wordsworth,et al.  Twelve novel FBN1 mutations in Marfan syndrome and Marfan related phenotypes test the feasibility of FBN1 mutation testing in clinical practice , 2002, Journal of medical genetics.

[22]  Bernhard Pfeifer,et al.  A data warehouse for prostate cancer biomarker discovery , 2007, International Conference on Bioinformatics & Computational Biology.

[23]  Vladimir Brusic,et al.  Data Warehousing in Molecular Biology , 2000, Briefings Bioinform..

[24]  M Claustres,et al.  Effect of mutation type and location on clinical outcome in 1,013 probands with Marfan syndrome or related phenotypes and FBN1 mutations: an international study. , 2007, American journal of human genetics.

[25]  P. Robinson,et al.  The molecular genetics of Marfan syndrome and related microfibrillopathies , 2000, Journal of medical genetics.