Efficient maintenance of common keys in archives of continuous query results from deep websites

In many real-world applications, it is important to create a local archive containing versions of structured results of continuous queries (queries that are evaluated periodically) submitted to autonomous database-driven Web sites (e.g., deep Web). Such history of digital information is a potential gold mine for all kinds of scientific, media and business analysts. An important task in this context is to maintain the set of common keys of the underlying archived results as they play pivotal role in data modeling and analysis, query processing, and entity tracking. A set of attributes in a structured data is a common key iff it is a key for all versions of the data in the archive. Due to the data-driven nature of key discovery from the archive, unlike traditional keys, the common keys are not temporally invariant. That is, keys identified in one version may be different from those in another version. Hence, in this paper, we propose a novel technique to maintain common keys in an archive containing a sequence of versions of evolutionary continuous query results. Given the current common key set of existing versions and a new snapshot, we propose an algorithm called COKE (COmmon KEy maintenancE) which incrementally maintains the common key set without undertaking expensive minimal keys computation from the new snapshot. Furthermore, it exploits certain interesting evolutionary features of real-world data to further reduce the computation cost. Our exhaustive empirical study demonstrates that COKE has excellent performance and is orders of magnitude faster than a baseline approach for maintenance of common keys.

[1]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[2]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[3]  Serge Abiteboul,et al.  Monitoring XML data on the Web , 2001, SIGMOD '01.

[4]  Oren Etzioni,et al.  Structured Querying of Web Text A Technical Challenge , 2006 .

[5]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[6]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[7]  Daniel Sánchez,et al.  Using association rules to mine for strong approximate dependencies , 2008, Data Mining and Knowledge Discovery.

[8]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[9]  Hannu Toivonen,et al.  TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies , 1999, Comput. J..

[10]  Heikki Mannila,et al.  Approximate Inference of Functional Dependencies from Relations , 1995, Theor. Comput. Sci..

[11]  Jean-Marc Petit,et al.  Efficient Discovery of Functional Dependencies and Armstrong Relations , 2000, EDBT.

[12]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[13]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD 2000.

[14]  Keishi Tajima,et al.  Archiving scientific data , 2002, SIGMOD '02.

[15]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[16]  Calton Pu,et al.  WebCQ-detecting and delivering information changes on the web , 2000, CIKM '00.

[17]  Howard J. Hamilton,et al.  Mining functional dependencies from data , 2007, Data Mining and Knowledge Discovery.

[18]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[19]  Edward L. Robertson,et al.  FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances - Extended Abstract , 2001, DaWaK.

[20]  Alexandros Ntoulas,et al.  Answering bounded continuous search queries in the world wide web , 2007, WWW '07.

[21]  Rosine Cicchetti,et al.  FUN: An Efficient Algorithm for Mining Functional and Embedded Dependencies , 2001, ICDT.

[22]  Jeffrey F. Naughton,et al.  A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data , 2007, VLDB.

[23]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[24]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[25]  Kirk Pruhs,et al.  Freshness-Aware Scheduling of Continuous Queries in the Dynamic Web , 2005, WebDB.

[26]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[27]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[28]  Sandeep Pandey,et al.  Monitoring the dynamic web to respond to continuous queries , 2003, WWW '03.

[29]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.