DB-IR integration using tight-coupling in the Odysseus DBMS

As many recent applications require integration of structured data and text data, unifying database (DB) and information retrieval (IR) technologies has become one of major challenges in our field. There have been active discussions on the system architecture for DB-IR integration, but a clear agreement has not been reached yet. Along this direction, we have advocated the use of the tight-coupling architecture and developed a novel structure of the IR index as well as tightly-coupled query processing algorithms. In tight-coupling, the text data type is supported from the storage system just like a built-in data type so that the query processor can efficiently handle queries involving both structured data and text data. In this paper, for archival purposes, we consolidate our achievements reported at non-regular publications over the last ten years or so, extending them by adding greater details on the IR index and the query processing algorithms. All the features in this paper are fully implemented in the Odysseus DBMS that has been under development at KAIST for over 23 years. We show that Odysseus significantly outperforms two open-source DBMSs and one open-source search engine (with some exceptional cases) in processing DB-IR integration queries. These results indeed demonstrate superiority of the tight-coupling architecture for DB-IR integration.

[1]  Ricardo A. Baeza-Yates,et al.  The Continued Saga of DB-IR Integration , 2004, VLDB.

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Vishu Krishnamurthy,et al.  All Your Data: The Oracle Extensibility Architecture , 2001, Compontent Database Systems.

[4]  Ewald Geschwinde,et al.  PostgreSQL Developer's Handbook , 2001 .

[5]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[6]  Ingmar Weber,et al.  The CompleteSearch Engine: Interactive, Efficient, and Towards IR& DB Integration , 2007, CIDR.

[7]  Fabian M. Suchanek,et al.  ESTER: efficient search on text, entities, and relations , 2007, SIGIR.

[8]  Arjen P. de Vries,et al.  Efficient and Flexible Information Retrieval using MonetDB/X100 , 2007, CIDR.

[9]  Bruce G. Lindsay,et al.  Implementation of SQL3 Structured Types with Inheritance and Value Substitutability , 1999, VLDB.

[10]  Otis Gospodnetic,et al.  Lucene in Action, Second Edition: Covers Apache Lucene 3.0 , 2010 .

[11]  Kevin Chen-Chuan Chang,et al.  Beyond pages: supporting efficient, scalable entity search with dual-inversion index , 2010, EDBT '10.

[12]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[13]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[14]  Gerhard Weikum,et al.  TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.

[15]  Gerhard Weikum DB&IR: both sides now , 2007, SIGMOD '07.

[16]  Kyu-Young Whang A New DBMS Architecture for DB-IR Integration , 2007, APWeb/WAIM.

[17]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[18]  David J. DeWitt,et al.  Mixed Mode XML Query Processing , 2003, VLDB.

[19]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[20]  W. Bruce Croft,et al.  Integrating IR and RDBMS using cooperative indexing , 1995, SIGIR '95.

[21]  Jae-Gil Lee,et al.  Odysseus: a high-performance ORDBMS tightly-coupled with IR features , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[23]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[24]  Jennifer Widom,et al.  The Lowell database research self-assessment , 2003, CACM.

[25]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[26]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[27]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[28]  Yun Wang,et al.  High Level Indexing of User-Defined Types , 1999, VLDB.

[29]  Kyu-Young Whang DB-IR integration and its application to a massively-parallel search engine , 2009, CIKM.

[30]  Surajit Chaudhuri,et al.  DBXplorer: enabling keyword search over relational databases , 2002, SIGMOD '02.

[31]  Roberto Cornacchia,et al.  Flexible and efficient IR using array databases , 2007, The VLDB Journal.

[32]  Ravi Krishnamurthy,et al.  The Multilevel Grid File - A Dynamic Hierarchical Multidimensional File Structure , 1991, DASFAA.

[33]  Jae-Gil Lee,et al.  Tightly-coupled spatial database features in the Odysseus/OpenGIS DBMS for high-performance , 2010, GeoInformatica.

[34]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[35]  Il-Yeol Song,et al.  ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality , 2013, SIGMOD '13.

[36]  Alexandros Biliris The performance of three database storage structures for managing large objects , 1992, SIGMOD '92.

[37]  황규영,et al.  Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems , 2002 .

[38]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[39]  Beng Chin Ooi,et al.  The Claremont report on database research , 2008, SGMD.