Using grouping strategy and pattern discovery for delta extraction in a limited collaborative environment

This work considers extracting delta in a distributed environment where the collaboration from highly autonomous operational database management systems is limited to granting read only access on a set of selected relational tables. Because of inherently huge volume of data in data warehouse system, it is critical to minimise communication costs as much as possible. Based on the observation that usually, two consecutive snapshots are not very different, a statistical-based group hash method is developed to minimise the volumes of data required to complete the data extraction. In addition, to relax the assumption that the changes to remote data are only caused by random events, we define a progression pattern to describe data changes with temporal regularities and also propose a method for progression pattern discovery.

[1]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[2]  Laura M. Haas,et al.  Information integration in the enterprise , 2008, CACM.

[3]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[4]  Yong Zhao,et al.  Cloud Computing and Grid Computing 360-Degree Compared , 2008, GCE 2008.

[5]  Zheng Lu,et al.  Delta extraction in a limited collaborative environment , 2015, Int. J. Intell. Inf. Database Syst..

[6]  Miriam A. M. Capretz,et al.  Data management in cloud environments: NoSQL and NewSQL data stores , 2013, Journal of Cloud Computing: Advances, Systems and Applications.

[7]  Divesh Srivastava,et al.  The Information Manifold , 1995 .

[8]  Andrea Calì,et al.  Query rewriting and answering under constraints in data integration systems , 2003, IJCAI.

[9]  James F. Allen,et al.  Actions and Events in Interval Temporal Logic , 1994 .

[10]  Eugene Wong,et al.  Query processing in a system for distributed databases (SDD-1) , 1981, TODS.

[11]  Aladdin Enterprises,et al.  ZLIB Compressed Data Format Specification version 3.3 , 1996 .

[12]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[13]  Divyakant Agrawal,et al.  Big data and cloud computing: current state and future opportunities , 2011, EDBT/ICDT '11.

[14]  Paul D Jeanne Ellis Ormrod Leedy,et al.  Practical Research: Planning and Design , 1974 .

[15]  Sung-Bae Cho,et al.  An efficient algorithm to compute differences between structured documents , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Jiawei Han,et al.  Efficient mining of partial periodic patterns in time series database , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[17]  C. Hartshorne,et al.  Collected Papers of Charles Sanders Peirce , 1935, Nature.

[18]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[19]  Elke A. Rundensteiner,et al.  Multiversion-based view maintenance over distributed data sources , 2004, TODS.

[20]  Janusz R. Getta,et al.  Identify and Extract Delta of Materialized View with Limited Collaborations , 2008, PDPTA.

[21]  Johannes Gehrke,et al.  DEMON: mining and monitoring evolving data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[22]  Jano Moreira de Souza,et al.  Performance Tests in Data Warehousing ETLM Process for Detection of Changes in Data Origin , 2003, DaWaK.

[23]  Jennifer Widom,et al.  Integrating heterogeneous databases: lazy or eager? , 1996, CSUR.

[24]  Tilmann Rabl,et al.  Solving Big Data Challenges for Enterprise Application Performance Management , 2012, Proc. VLDB Endow..

[25]  Herbert A. Simon,et al.  Computer Science as Empirical Inquiry , 2011 .

[26]  Guy M. Lohman,et al.  Differential files: their application to the maintenance of large databases , 1976, TODS.

[27]  R. Dorfman The Detection of Defective Members of Large Populations , 1943 .

[28]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[29]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[30]  Jeffrey D. Ullman,et al.  Constructing virtual databases on the world-wide web , 2001 .

[31]  Allen Newell,et al.  Computer science as empirical inquiry: symbols and search , 1976, CACM.

[33]  Hector Garcia-Molina,et al.  Meaningful change detection in structured data , 1997, SIGMOD '97.

[34]  Imtiaz Ahmad,et al.  Cloud Computing Pricing Models: A Survey , 2013 .

[35]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[36]  Phokion G. Kolaitis Schema mappings, data exchange, and metadata management , 2005, PODS '05.

[37]  Albert Boonstra,et al.  Analyzing inter-organizational systems from a power and interest perspective , 2005, Int. J. Inf. Manag..

[38]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[39]  Jennifer Widom,et al.  Research problems in data warehousing , 1995, CIKM '95.

[40]  Jon Williamson,et al.  Abduction, Reason, and Science: Processes of Discovery and Explanation , 2003 .

[41]  Eric A. Brewer,et al.  Towards robust distributed systems (abstract) , 2000, PODC '00.

[42]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[43]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[44]  Sridhar Ramaswamy,et al.  Cyclic association rules , 1998, Proceedings 14th International Conference on Data Engineering.

[45]  Jennifer Widom,et al.  Active Database Systems: Triggers and Rules For Advanced Database Processing , 1994 .

[46]  Nanjangud C. Narendra,et al.  Cloud Pricing Models: A Survey and Position Paper. , 2013, 2013 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM).

[47]  Joseph L. Hellerstein,et al.  Mining partially periodic event patterns with unknown periods , 2001, Proceedings 17th International Conference on Data Engineering.

[48]  Kenneth A. Ross,et al.  Implementing Incremental View Maintenance in Nested Data Models , 1997, DBPL.

[49]  Prabhu Ram,et al.  Extracting delta for incremental data warehouse maintenance , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[50]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[51]  Daniel J. Abadi,et al.  Data Management in the Cloud: Limitations and Opportunities , 2009, IEEE Data Eng. Bull..

[52]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[53]  Jim Gray,et al.  A Conversation with Jim Gray , 2003, ACM Queue.

[54]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[55]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[56]  Ahmed Patel,et al.  Review of pricing models for grid & cloud computing , 2011, 2011 IEEE Symposium on Computers & Informatics.

[57]  Silvana Castano,et al.  A Discovery-Based Approach to Database Ontology Design , 2004, Distributed and Parallel Databases.

[58]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[59]  Marcello Pagano,et al.  On the informativeness and accuracy of pooled testing in estimating prevalence of a rare disease: Application to HIV screening , 1995 .

[60]  David J. DeWitt,et al.  X-Diff: an effective change detection algorithm for XML documents , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[61]  Markus Helfert,et al.  Discovering business rules from business process models , 2011, CompSysTech '11.

[62]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[63]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Changing Environments , 2004, Multiple Classifier Systems.

[64]  Peter C. Lockemann,et al.  Distributed Events in Active Database Systems: Letting the Genie out of the Bottle , 1998, Data Knowl. Eng..

[65]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[66]  Peter J. Denning,et al.  The science in computer science , 2013, CACM.

[67]  Jayant Madhavan,et al.  Composing Mappings Among Data Sources , 2003, VLDB.

[68]  Inderpal Singh Mumick,et al.  Deriving Production Rules For Incremental View Maintenance , 1999 .

[69]  Amihai Motro,et al.  Superviews: Virtual Integration of Multiple Databases , 1987, IEEE Transactions on Software Engineering.

[70]  Laurie J. Kirsch,et al.  The Impact of Data Integration on the Costs and Benefits of Information Systems , 1992, MIS Q..

[71]  Claudio Sartori,et al.  Incremental maintenance of multi-source views , 2001, Proceedings 12th Australasian Database Conference. ADC 2001.

[72]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[73]  Yang Wen Semantic integration of structured and semistructured data sources , 2002 .

[74]  Inderpal Singh Mumick,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications , 1999, IEEE Data Eng. Bull..

[75]  Heikki Mannila,et al.  Discovering Generalized Episodes Using Minimal Occurrences , 1996, KDD.

[76]  Sergio Greco,et al.  A Logic Programming Approach to the Integration, Repairing and Querying of Inconsistent Databases , 2001, ICLP.

[77]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[78]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[79]  Zheng Lu,et al.  Delta extraction optimization for view maintenance in a limited collaborative environment , 2012, 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom).

[80]  Yue Zhuge,et al.  Consistency Algorithms for Multi-Source Warehouse View Maintenance , 2004, Distributed and Parallel Databases.

[81]  X.S. Wang,et al.  Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences , 1998, IEEE Trans. Knowl. Data Eng..

[82]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[83]  Frank G. Goethals Important issues for evaluating inter-organizational data integration configurations , 2008 .

[84]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[85]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[86]  Surajit Chaudhuri,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications. , 1995 .

[87]  Jennifer Widom,et al.  View maintenance in a warehousing environment , 1995, SIGMOD '95.

[88]  Ambuj K. Singh,et al.  Efficient view maintenance at data warehouses , 1997, SIGMOD '97.

[89]  Kasper Østerbye,et al.  Structural and cognitive problems in providing version control for hypertext , 1992, ECHT '92.

[90]  Anja Haake CoVer: a contextual version server for hypertext applications , 1993, ECHT '92.

[91]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[92]  Gang Zhou,et al.  A framework for supporting data integration using the materialized and virtual approaches , 1996, SIGMOD '96.

[93]  D. Du,et al.  Pooling Designs And Nonadaptive Group Testing: Important Tools For Dna Sequencing , 2006 .

[94]  Alfons Kemper,et al.  Integrating semi-join-reducers into state-of-the-art query processors , 2001, Proceedings 17th International Conference on Data Engineering.

[95]  Daniela E. Damian,et al.  Selecting Empirical Methods for Software Engineering Research , 2008, Guide to Advanced Empirical Software Engineering.

[96]  Eugene Wong,et al.  Decomposition—a strategy for query processing , 1976, TODS.