Integrating DB and IR Technologies: What is the Sound of One Hand Clapping?

Databases (DB) and information retrieval (IR) have evolved as separate fields. However, modern applications such as customer support, health care, and digital libraries require capabilities for both data and text management. In such settings, traditional DB queries, in SQL or XQuery, are not flexible enough to handle applicationspecific scoring and ranking. IR systems, on the other hand, lack efficient support for handling structured parts of the data and metadata, and do not give the application developer adequate control over the ranking function. This paper analyzes the requirements of advanced text- and data-rich applications for an integrated platform. The core functionality must be manageable, and the API should be easy to program against. A particularly important issue that we highlight is how to reconcile flexibility in scoring and ranking models with optimizability, in order to accommodate a wide variety of target applications efficiently. We discuss whether such a system needs to be designed from scratch, or can be incrementally built on top of existing architectures. The results of our analyses are cast into a series of challenges to the DB and IR communities.

[1]  Surajit Chaudhuri,et al.  A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[2]  Norbert Fuhr,et al.  Retrieval Quality vs. Effectiveness of Relevance-Oriented Search in XML Documents , 2003 .

[3]  Chad Carson,et al.  Optimizing queries over multimedia repositories , 1996, SIGMOD '96.

[4]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[5]  Carlo Zaniolo,et al.  Optimization of sequence queries in database systems , 2001, PODS '01.

[6]  Gerhard Weikum,et al.  Adding Relevance to XML , 2000, WebDB.

[7]  Miron Livny,et al.  The Case for Enhanced Abstract Data Types , 1997, VLDB.

[8]  Pavel Zezula,et al.  Processing XML Queries with Tree Signatures , 2003, Intelligent Search on XML Data.

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Werner Kießling,et al.  Foundations of Preferences in Database Systems , 2002, VLDB.

[11]  Gerhard Weikum,et al.  Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System , 2000, VLDB.

[12]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[13]  Carlo Zaniolo,et al.  Negation and Aggregates in Recursive Rules: the LDL++ Approach , 1993, DOOD.

[14]  Lin Guo XRANK : Ranked Keyword Search over XML Documents , 2003 .

[15]  Gerhard Weikum,et al.  Ontology-Enabled XML Search , 2003, Intelligent Search on XML Data.

[16]  Ronald Fagin,et al.  Searching the workplace web , 2003, WWW '03.

[17]  Miron Livny,et al.  The Design and Implementation of a Sequence Database System , 1996, VLDB.

[18]  Sihem Amer-Yahia,et al.  Tree Pattern Relaxation , 2002, EDBT.

[19]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[20]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[21]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[22]  W. Bruce Croft Language models for information retrieval , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[23]  Michael J. Carey,et al.  On saying “Enough already!” in SQL , 1997, SIGMOD '97.

[24]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[25]  Walid G. Aref,et al.  Rank-aware query optimization , 2004, SIGMOD '04.

[26]  William I. Grosky,et al.  The Handbook of Multimedia Information Management , 1997 .

[27]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[28]  Norbert Fuhr,et al.  XIRQL: An XML query language based on information retrieval concepts , 2004, TOIS.

[29]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[30]  Surajit Chaudhuri,et al.  Join queries with external text sources: execution and optimization techniques , 1995, SIGMOD '95.

[31]  Nicholas Kushmerick,et al.  Similarity-based Queries for XML Databases Using ELIXIR , 2001, WWW Posters.

[32]  Gerhard Weikum,et al.  Intelligent Search on XML Data: Applications, Languages, Models, Implementations, and Benchmarks , 2003 .

[33]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[34]  Clement T. Yu,et al.  Distributed Top-N Query Processing with Possibly Uncooperative Local Systems , 2003, VLDB.

[35]  Cong Yu,et al.  Querying structured text in an XML database , 2003, SIGMOD '03.

[36]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[37]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[38]  Vagelis Hristidis,et al.  PREFER: a system for the efficient execution of multi-parametric ranked queries , 2001, SIGMOD '01.

[39]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[40]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[41]  Raghu Ramakrishnan,et al.  The QUIQ engine: a hybrid IR-DB system , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[42]  Norbert Fuhr,et al.  Probabilistic datalog: Implementing logical information retrieval for advanced applications , 2000, J. Am. Soc. Inf. Sci..

[43]  P. Venkat Rangan,et al.  Handbook of Multimedia Information Management , 1997 .

[44]  Christian S. Jensen,et al.  A Foundation for Conventional and Temporal Query Optimization Addressing Duplicates and Ordering , 2001, IEEE Trans. Knowl. Data Eng..

[45]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[46]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[47]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[48]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[49]  Gerhard Weikum,et al.  The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking , 2002, EDBT.

[50]  Hans-Jörg Schek,et al.  PowerDB-IR: information retrieval on top of a database cluster , 2001, CIKM '01.

[51]  Raghu Ramakrishnan,et al.  SRQL: Sorted Relational Query Language , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[52]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[53]  V. S. Subrahmanian,et al.  Probabilistic Interval XML , 2003, ICDT.

[54]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[55]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[56]  Michael Rys Full-Text Search with XQuery: A Status Report , 2003, Intelligent Search on XML Data.

[57]  Sihem Amer-Yahia,et al.  Phrase Matching in XML , 2003, VLDB.

[58]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[59]  Michael J. Carey,et al.  Reducing the Braking Distance of an SQL Query Engine , 1998, VLDB.

[60]  Luis Gravano,et al.  Top-k selection queries over relational databases: Mapping strategies and performance evaluation , 2002, TODS.

[61]  Peter J. Haas,et al.  Online Query Processing , 2001, SIGMOD Conference.

[62]  Divesh Srivastava,et al.  A System for Keyword Proximity Search on XML Databases , 2003, VLDB.

[63]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[64]  Raghu Ramakrishnan,et al.  Probabilistic Optimization of Top N Queries , 1999, VLDB.

[65]  Werner Kießling,et al.  Personalized Keyword Search with Partial-Order Preferences , 2002, SBBD.

[66]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[67]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[68]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[69]  Carlo Zaniolo,et al.  User defined aggregates in object-relational systems , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[70]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[71]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[72]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.