iVA-File: Efficiently Indexing Sparse Wide Tables in Community Systems

In community web management systems (CWMS), storage structures inspired by universal tables are being used increasingly to manage sparse datasets. Such a sparse wide table (SWT) typically embodies thousands of attributes, with many of them being undefined in each tuple, and low-dimensional structured similarity search on a combination of numerical and text attributes is a common operation. However, many properties of such wide tables and their associated Web 2.0 services render most multi-dimensional indexing structures irrelevant. Recent studies in this area have mainly focused on improving the storage efficiency and efficient deployment of inverted indices; so far no new index has been proposed for indexing SWTs. The inverted index is fast for scanning but not efficient in reducing random accesses to the data file as it captures little information about the content of attribute values. In this paper, we propose the iVA-file that works on the basis of approximate contents and keeps scanning efficiency within a bounded range. We introduce the nG-signature to approximately represent data strings and improve the existing approximate vectors for numerical values. We also propose an efficient query processing strategy for the iVA-file, which is different from strategies used for existing scan-based indices. To enable the use of different metrics of distance between a query and a tuple that may vary from application to application, the iVA-file has been designed to be metric-oblivious and to provide efficient filter-and-refine search based on any rational metric. Extensive experiments on real datasets show that the iVA-file outperforms existing proposals in query efficiency significantly, at the same time, keeps a good update speed.

[1]  Jeffrey Naughton,et al.  The case for a wide-table approach to manage sparse relational data sets , 2007, SIGMOD '07.

[2]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[3]  Clement T. Yu,et al.  Effective keyword search in relational databases , 2006, SIGMOD Conference.

[4]  Jeffrey F. Naughton,et al.  Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[6]  Beng Chin Ooi,et al.  Fast High-Dimensional Data Search in Incomplete Databases , 1998, VLDB.

[7]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[8]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[9]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[10]  Hakan Ferhatosmanoglu,et al.  Indexing Incomplete Databases , 2006, EDBT.

[11]  Hakan Ferhatosmanoglu,et al.  Approximate encoding for direct access and query processing over compressed bitmaps , 2006, VLDB.

[12]  Robert E. Tarjan,et al.  Storing a sparse table , 1979, CACM.

[13]  Raghu Ramakrishnan,et al.  Theory of nearest neighbors indexability , 2006, TODS.

[14]  David Maier,et al.  Maximal objects and the semantics of universal relation databases , 1983, TODS.

[15]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[17]  Setrag Khoshafian,et al.  A decomposition storage model , 1985, SIGMOD Conference.

[18]  Beng Chin Ooi,et al.  One table stores all: Enabling painless free-and-easy data publishing and sharing , 2007, CIDR.

[19]  Yannis E. Ioannidis,et al.  An efficient bitmap encoding scheme for selection queries , 1999, SIGMOD '99.

[20]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[21]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[22]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[23]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[24]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[25]  Kotagiri Ramamohanarao,et al.  A Signature File Scheme Based on Multiple Organizations for Indexing Very Large Text Databases. , 1990 .

[26]  Rakesh Agrawal,et al.  Storage and Querying of E-Commerce Data , 2001, VLDB.

[27]  Hosagrahar V. Jagadish,et al.  On effective multi-dimensional indexing for strings , 2000, SIGMOD 2000.

[28]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[29]  Divesh Srivastava,et al.  On effective multi-dimensional indexing for strings , 2000, SIGMOD '00.

[30]  Wolfgang Müller,et al.  Faster Exact Histogram Intersection on Large Data Collections Using Inverted VA-Files , 2004, CIVR.

[31]  Beng Chin Ooi,et al.  EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data , 2008, SIGMOD Conference.

[32]  Jianzhong Li,et al.  Bit transposition for very large scientific and statistical databases , 1986, Algorithmica.

[33]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[34]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.