Sync your data: update propagation for heterogeneous protein databases

The traditional model of bench (wet) chemistry in many life sciences domain is today actively complimented by computer-based discoveries utilizing the growing number of online data sources. A typical computer-based discovery scenario for many life scientists includes the creation of local caches of pertinent information from multiple online resources such as Swissprot [Nucleic Acid Res. 1(28), 45–48 (2000)], PIR [Nucleic Acids Res. 28(1), 41–44 (2000)], PDB [The Protein DataBank. Wiley, New York (2003)], to enable efficient data analysis. This local caching of data, however, exposes their research and eventual results to the problems of data staleness, that is, cached data may quickly be obsolete or incorrect, dependent on the updates that are made to the source data. This represents a significant challenge to the scientific community, forcing scientists to be continuously aware of the frequent changes made to public data sources, and more importantly aware of the potential effects on their own derived data sets during the course of their research. To address this significant challenge, in this paper we present an approach for handling update propagation between heterogeneous databases, guaranteeing data freshness for scientists irrespective of their choice of data source and its underlying data model or interface. We propose a middle-layer–based solution wherein first the change in the online data source is translated to a sequence of changes in the middle-layer; next each change in the middle-layer is propagated through an algebraic representation of the translation between the source and the target; and finally the net-change is translated to a set of changes that are then applied to the local cache. In this paper, we present our algebraic model that represents the mapping of the online resource to the local cache, as well as our adaptive propagation algorithm that can incrementally propagate both schema and data changes from the source to the cache in a data model independent manner. We present a case study based on a joint ongoing project with our collaborators in the Chemistry Department at UMass-Lowell to explicate our approach.

[1]  H. Griffin,et al.  The European Bioinformatics Institute , 1995 .

[2]  Susan B. Davidson,et al.  On the updatability of XML views over relational databases , 2003, WebDB.

[3]  Garcia-MolinaHector,et al.  Change detection in hierarchically structured information , 1996 .

[4]  Jennifer Widom,et al.  Representing and querying changes in semistructured data , 1998, Proceedings 14th International Conference on Data Engineering.

[5]  Surajit Chaudhuri,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications. , 1995 .

[6]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[7]  Ashish Gupta,et al.  Using Partial Information to Update Materialized Views , 1995, Inf. Syst..

[8]  Erhard Rahm,et al.  Data Warehouse Scenarios for Model Management , 2000, ER.

[9]  N HansonEric,et al.  Timer-driven database triggers and alerters , 1999 .

[10]  Haiyuan Xu,et al.  KF-Diff+: Highly Efficient Change Detection Algorithm for XML Documents , 2002, OTM.

[11]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[12]  Stefan Göbel,et al.  Development of meta databases for geospatial data in the WWW , 1998, GIS '98.

[13]  Inderpal Singh Mumick,et al.  Incremental Maintenance Of Views With Duplicates , 1999 .

[14]  Erhard Rahm,et al.  Rondo: a programming platform for generic model management , 2003, SIGMOD '03.

[15]  Mukesh K. Mohania,et al.  Incremental Maintenance of Materialized Views , 1997, DEXA.

[16]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[17]  Laura M. Haas,et al.  Transforming Heterogeneous Data with Database Middleware: Beyond Integration , 1999, IEEE Data Eng. Bull..

[18]  Serge Abiteboul,et al.  Incremental Maintenance for Materialized Views over Semistructured Data , 1998, VLDB.

[19]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[20]  Eric N. Hanson,et al.  Timer-driven database triggers and alerters: semantics and a challenge , 1999, SGMD.

[21]  Elke A. Rundensteiner,et al.  Incremental Maintenance of Materialized Object-Oriented Views in MultiView: Strategies and Performance Evaluation , 1998, IEEE Trans. Knowl. Data Eng..

[22]  Laks V. S. Lakshmanan,et al.  TAX: A Tree Algebra for XML , 2001, DBPL.

[23]  Arthur M. Keller,et al.  Updates to Relational Databases Through Views Involving Joins , 1982, International Conference on Data and Knowledge Bases.

[24]  Elke A. Rundensteiner,et al.  Incremental maintenance of schema-restructuring views in SchemaSQL , 2004, IEEE Transactions on Knowledge and Data Engineering.

[25]  Yue Zhuge,et al.  Graph structured views and their incremental maintenance , 1998, Proceedings 14th International Conference on Data Engineering.

[26]  Peter B. McGarvey,et al.  The Protein Information Resource (PIR) , 2000, Nucleic Acids Res..

[27]  Daniela Florescu,et al.  Storing and Querying XML Data using an RDMBS , 1999, IEEE Data Eng. Bull..

[28]  Alon Y. Halevy,et al.  Updating XML , 2001, SIGMOD '01.

[29]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[30]  Carlo Zaniolo,et al.  Efficient Management of Multiversion Documents by Object Referencing , 2001, VLDB.

[31]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[32]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[34]  Jennifer Widom,et al.  View maintenance in a warehousing environment , 1995, SIGMOD '95.

[35]  Ambuj K. Singh,et al.  Efficient view maintenance at data warehouses , 1997, SIGMOD '97.

[36]  M. Moorhouse,et al.  The Protein Databank , 2005 .

[37]  B. Clark,et al.  The selective reaction of methoxyamine with cytidine residues in mammalian initiator transfer ribonucleic acid. , 1974, Nucleic acids research.

[38]  Elke A. Rundensteiner,et al.  Gangam: a transformation modeling framework , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[39]  Tok Wang Ling,et al.  17th International Conference on Conceptual Modeling (ER'98) , 1999, Data Knowl. Eng..

[40]  Mike P. Papazoglou,et al.  A semantic meta-modelling approach to schema transformation , 1995, CIKM '95.

[41]  Elke A. Rundensteiner,et al.  Incremental Maintenance of Schema-Restructuring Views , 2002, EDBT.

[42]  Fusheng Wang,et al.  Temporal queries in XML document archives and web warehouses , 2003, 10th International Symposium on Temporal Representation and Reasoning, 2003 and Fourth International Conference on Temporal Logic. Proceedings..

[43]  Jennifer Widom,et al.  Change detection in hierarchically structured information , 1996, SIGMOD '96.

[44]  Paolo Atzeni,et al.  Management of Multiple Models in an Extensible Database Design Tool , 1996, EDBT.

[45]  Elke A. Rundensteiner,et al.  SERF: schema evolution through an extensible, re-usable and flexible framework , 1998, CIKM '98.

[46]  Carole A. Goble,et al.  Transparent access to multiple bioinformatics information sources , 2001, IBM Syst. J..

[47]  David B. Lomet,et al.  Bulletin of the Technical Committee on Data Engineering Special Issue on Data Reduction Techniques Announcements and Notices Letter from the Editor-in-chief 1 Technical Committee Election Changing Editorial Staa Letter from the Special Issue Editor the New Jersey Data Reduction Report , 2022 .

[48]  Peer Kröger,et al.  A Computational Biology Database Digest: Data, Data Analysis, and Data Management , 2004, Distributed and Parallel Databases.

[49]  Renée J. Miller,et al.  The Use of Information Capacity in Schema Integration and Translation , 1993, VLDB.

[50]  Serge Abiteboul,et al.  Monitoring XML data on the Web , 2001, SIGMOD '01.

[51]  Nick Roussopoulos,et al.  Integration of Data, Schema and Meta-Schema in the Context of Self-Documenting Data Models , 1983, ER.

[52]  Elke A. Rundensteiner,et al.  Sangam: A Framework for Modeling Heterogeneous Database Transformations , 2003, ICEIS.

[53]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[54]  Susan B. Davidson,et al.  View Maintenance for Hierarchical Semistructured Data , 2000, DaWaK.

[55]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[56]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[57]  Laura M. Haas,et al.  Integrating life sciences data-with a little Garlic , 2000, Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering.

[58]  Elke A. Rundensteiner,et al.  Sangam - a solution to support multiple data models, their mappings and maintenance , 2001, SIGMOD '01.

[59]  Frank Wm. Tompa,et al.  Efficiently updating materialized views , 1986, SIGMOD '86.

[60]  Sergio Greco,et al.  A Query Language for XML Based on Graph Grammars , 2004, World Wide Web.

[61]  Elke A. Rundensteiner,et al.  AUP: Adaptive Change Propagation Across Data Model Boundaries , 2004, BNCOD.

[62]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[63]  Arnon Rosenthal,et al.  Theoretically Sound Transformations for Practical Database Design , 1987, ER.

[64]  M. Suyama [Genome database]. , 2004, Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme.

[65]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.

[66]  J. Gross,et al.  Graph Theory and Its Applications , 1998 .

[67]  Jennifer Widom,et al.  Integrating and Accessing Heterogeneous Information Sources in TSIMMIS , 1994 .