Flexible and efficient IR using array databases

The Matrix Framework is a recent proposal by Information Retrieval (IR) researchers to flexibly represent information retrieval models and concepts in a single multi-dimensional array framework. We provide computational support for exactly this framework with the array database system SRAM (Sparse Relational Array Mapping), that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules. To demonstrate their effect on text retrieval, we apply them in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage.

[1]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[4]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[5]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[6]  Jim Melton,et al.  SQL:2003 has been published , 2004, SGMD.

[7]  Arnon Rosenthal,et al.  Outerjoin simplification and reordering for query optimization , 1997, TODS.

[8]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[9]  Ophir Frieder,et al.  Integrating structured data and text: a relational approach , 1997 .

[10]  Limsoon Wong,et al.  A query language for multidimensional arrays: design, implementation, and optimization techniques , 1996, SIGMOD '96.

[11]  A. P. deVries,et al.  RAM: array processing over a relational DBMS , 2003 .

[12]  Alistair Moffat,et al.  Simplified similarity scoring using term ranks , 2005, SIGIR '05.

[13]  Marcin Zukowski,et al.  Super-Scalar RAM-CPU Cache Compression , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[14]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[15]  David Maier,et al.  A call to order , 1993, PODS '93.

[16]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[17]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[18]  Guido Moerkotte,et al.  Optimizing Join Orders , 1993 .

[19]  John Grant,et al.  Logic-based approach to semantic query optimization , 1990, TODS.

[20]  Hans-Jörg Schek,et al.  PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases , 2004, Knowledge and Information Systems.

[21]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[22]  Keshav Pingali,et al.  A Relational Approach to the Compilation of Sparse Matrix Programs , 1997, Euro-Par.

[23]  Gene H. Golub,et al.  Matrix computations , 1983 .

[24]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[25]  Marcin Zukowski,et al.  MonetDB/X100: Hyper-Pipelining Query Execution , 2005, CIDR.

[26]  David J. DeWitt,et al.  Weaving Relations for Cache Performance , 2001, VLDB.

[27]  Peter Baumann,et al.  A Database Array Algebra for Spatio-Temporal Data and Beyond , 1999, NGITS.

[28]  Dennis Shasha,et al.  AQuery: Query Language for Ordered Data, Optimization Techniques, and Experiments , 2003, VLDB.

[29]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[30]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[31]  Peter Baumann,et al.  Storage of multidimensional arrays based on arbitrary tiling , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[32]  Ian H. Witten,et al.  Compressing and indexing documents and images , 1999 .

[33]  Gabriella Kazai,et al.  A general matrix framework for modelling Information Retrieval , 2006, Inf. Process. Manag..

[34]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[35]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[36]  Dan Suciu,et al.  Comprehension syntax , 1994, SGMD.

[37]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[38]  LibkinLeonid,et al.  A query language for multidimensional arrays , 1996 .

[39]  David Maier,et al.  Algebraic Manipulation of Scientific Datasets , 2004, VLDB.

[40]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[41]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.

[42]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.