Toward DB-IR Integration: Per-Document Basis Transactional Index Maintenance

While information retrieval(IR) and databases(DB) have been developed independently, there have been emerging requirements that both data management and efficient text retrieval should be supported simultaneously in an information system such as health care systems, bulletin boards, XML data management, and digital libraries. Recently DB-IR integration issue has been budded in the research field. The great divide between DB and IR has caused different manners in index maintenance for newly arriving documents. While DB has extended its SQL layer to cope with text fields due to lack of intact mechanism to build IR-like index, IR usually treats a block of new documents as a logical unit of index maintenance since it has no concept of integrity constraint. However, towards DB-IR integration, a transaction on adding or updating a document should include maintenance of the postings lists accompanied by the document - hence per-document basis transactional index maintenance. In this paper, performance of a few strategies for per-document basis transaction for inserting documents -- direct index update, stand-alone auxiliary index and pulsing auxiliary index - will be evaluated. The result tested on the KRISTAL-IRMS shows that the pulsing auxiliary strategy, where long postings lists in the auxiliary index are in-place updated to the main index whereas short lists are directly updated in the auxiliary index, can be a challenging candidate for text field indexing in DB-IR integration.

[1]  Ricardo A. Baeza-Yates,et al.  Third edition of the "XML and information retrieval" workshop first workshop on integration of IR and DB (WIRD) jointly held at SIGIR'2004, Sheffield, UK, July 29th, 2004 , 2004, SIGF.

[2]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[3]  Lin Guo,et al.  Efficient inverted lists and query algorithms for structured value ranking in update-intensive relational databases , 2005, 21st International Conference on Data Engineering (ICDE'05).

[4]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[5]  Raghu Ramakrishnan,et al.  The QUIQ engine: a hybrid IR-DB system , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[6]  Alistair Moffat,et al.  Improved word-aligned binary compression for text indexing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  A. N. Wilschut,et al.  On the integration of IR and Databases , 1999 .

[8]  Hugh E. Williams,et al.  Efficient online index maintenance for contiguous inverted lists , 2006, Inf. Process. Manag..

[9]  T. Chiueh,et al.  Eecient Real-time Index Updates in Text Retrieval Systems , 1999 .

[10]  Hector Garcia-Molina,et al.  Synthetic workload performance analysis of incremental updates , 1994, SIGIR '94.

[11]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[12]  CarmelDavid,et al.  XML and information retrieval , 2000 .

[13]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[14]  Wann-Yun Shieh,et al.  A statistics-based approach to incrementally update inverted files , 2005, Inf. Process. Manag..

[15]  Charles L. A. Clarke,et al.  Hybrid index maintenance for growing text collections , 2006, SIGIR.

[16]  Hugh E. Williams,et al.  In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems , 2004, ACSC.

[17]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[18]  Jan O. Pedersen,et al.  Optimization for dynamic inverted index maintenance , 1989, SIGIR '90.

[19]  Arjen P. de Vries,et al.  Efficient and Flexible Information Retrieval using MonetDB/X100 , 2007, CIDR.

[20]  Hector Garcia-Molina,et al.  The Gold Mailer , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[21]  Jae-Gil Lee,et al.  Odysseus: a high-performance ORDBMS tightly-coupled with IR features , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Sihem Amer-Yahia,et al.  Report on the DB/IR panel at SIGMOD 2005 , 2005, SGMD.

[23]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[24]  Ricardo A. Baeza-Yates,et al.  Database and Information Retrieval Techniques for XML , 2005, ASIAN.

[25]  Gerhard Weikum,et al.  Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System , 2000, VLDB.

[26]  Sharad Mehrotra,et al.  The Gold Text Indexing Engine , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[27]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[28]  Jim Gray The next database revolution , 2004, SIGMOD '04.

[29]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[30]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[31]  Gerhard Weikum,et al.  An Efficient and Versatile Query Engine for TopX Search , 2005, VLDB.

[32]  Jeffrey Scott Vitter,et al.  Dynamic maintenance of web indexes using landmarks , 2003, WWW '03.