Materialization and Decomposition of Dataspaces for Efficient Search

Dataspaces consist of large-scale heterogeneous data. The query interface of accessing tuples should be provided as a fundamental facility by practical dataspace systems. Previously, an efficient index has been proposed for queries with keyword neighborhood over dataspaces. In this paper, we study the materialization and decomposition of dataspaces, in order to improve the query efficiency. First, we study the views of items, which are materialized in order to be reused by queries. When a set of views are materialized, it leads to select some of them as the optimal plan with the minimum query cost. Efficient algorithms are developed for query planning and view generation. Second, we study the partitions of tuples for answering top-k queries. Given a query, we can evaluate the score bounds of the tuples in partitions and prune those partitions with bounds lower than the scores of top-k answers. We also provide theoretical analysis of query cost and prove that the query efficiency cannot be improved by increasing the number of partitions. Finally, we conduct an extensive experimental evaluation to illustrate the superior performance of proposed techniques.

[1]  Gregory Dobson,et al.  Worst-Case Analysis of Greedy Heuristics for Integer Programming with Nonnegative Data , 1982, Math. Oper. Res..

[2]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[3]  Jeffrey Naughton,et al.  The case for a wide-table approach to manage sparse relational data sets , 2007, SIGMOD '07.

[4]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[5]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[6]  G. Dobson,et al.  Greedy Heuristics for Integer Programming with Non-negative Data , 2022 .

[7]  Martin L. Kersten,et al.  Efficient k-NN search on vertically decomposed data , 2002, SIGMOD '02.

[8]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[9]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[10]  Alistair Moffat,et al.  Fast on-line index construction by geometric partitioning , 2005, CIKM '05.

[11]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[12]  David Maier,et al.  A first tutorial on dataspaces , 2008, Proc. VLDB Endow..

[13]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[14]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[15]  P. Grassberger,et al.  Measuring the Strangeness of Strange Attractors , 1983 .

[16]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[17]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[18]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[19]  Daniel C. Zilio,et al.  DB2 advisor: an optimizer smart enough to recommend its own indexes , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[20]  Rada Chirkova,et al.  Materializing views with minimal size to answer queries , 2003, PODS '03.

[21]  Rakesh Agrawal,et al.  Storage and Querying of E-Commerce Data , 2001, VLDB.

[22]  Leonard Pitt,et al.  Optimal indexing using near-minimal space , 2003, PODS '03.

[23]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[24]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[25]  Jeffrey Scott Vitter,et al.  Efficient Update of Indexes for Dynamically Changing Web Documents , 2006, World Wide Web.

[26]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[27]  Jérôme Darmont,et al.  Data mining-based materialized view and index selection in data warehouses , 2007, Journal of Intelligent Information Systems.

[28]  Alon Y. Halevy,et al.  Bootstrapping pay-as-you-go data integration systems , 2008, SIGMOD Conference.

[29]  Jianzhong Li,et al.  iVA-File: Efficiently Indexing Sparse Wide Tables in Community Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[30]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[31]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[32]  Christos Faloutsos,et al.  Deflating the dimensionality curse using multiple fractal dimensions , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[33]  Surajit Chaudhuri,et al.  Index selection for databases: a hardness study and a principled heuristic solution , 2004, IEEE Transactions on Knowledge and Data Engineering.

[34]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[35]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[36]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[37]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[38]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[39]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[40]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[41]  Jeffrey F. Naughton,et al.  A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data , 2007, VLDB.

[42]  Christos Faloutsos,et al.  Estimating the Selectivity of Spatial Queries Using the 'Correlation' Fractal Dimension , 1995, VLDB.

[43]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[44]  Alon Y. Halevy,et al.  Indexing dataspaces , 2007, SIGMOD '07.

[45]  Rada Chirkova,et al.  Query evaluation using overlapping views: completeness and efficiency , 2006, SIGMOD Conference.

[46]  Jeffrey F. Naughton,et al.  Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[47]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[48]  Jens Dittrich,et al.  iTrails: Pay-as-you-go Information Integration in Dataspaces , 2007, VLDB.

[49]  Gerhard Weikum,et al.  IO-Top-k: index-access optimized top-k query processing , 2006, VLDB.

[50]  Gideon Schechtman,et al.  Approximating bounded 0-1 integer linear programs , 1993, [1993] The 2nd Israel Symposium on Theory and Computing Systems.

[51]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[52]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[53]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[54]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[55]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[56]  Martin L. Kersten,et al.  Updating a cracked database , 2007, SIGMOD '07.

[57]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[58]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.