Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes

This paper presents a probabilistic relational modelling (implementation) of the major probabilistic retrieval models. Such a high-level implementation is useful since it supports the ranking of any object, it allows for the reasoning across structured and unstructured data, and it gives the software (knowledge) engineer control over ranking and thus supports customisation. The contributions of this paper include the specification of probabilistic SQL (PSQL) and probabilistic relational algebra (PRA), a new relational operator for probability estimation (the relational Bayes), the probabilistic relational modelling of retrieval models, a comparison of modelling retrieval with traditional SQL versus modelling retrieval with PSQL, and a comparison of the performance of probability estimation with traditional SQL versus PSQL. The main findings are that the PSQL/PRA paradigm allows for the description of advanced retrieval models, is suitable for solving large-scale retrieval tasks, and outperforms traditional SQL in terms of abstraction and performance regarding probability estimation.

[1]  Norbert Fuhr,et al.  A probabilistic NF2 relational algebra for integrated information retrieval and database systems , 1996 .

[2]  Hans-Jörg Schek,et al.  PowerDB-IR: information retrieval on top of a database cluster , 2001, CIKM '01.

[3]  Stephen Robertson Term frequency and term value , 1981, SIGIR 1981.

[4]  Hans-Jörg Schek,et al.  The relational model with relation-valued attributes , 1986, Inf. Syst..

[5]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[6]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[7]  Roberto Cornacchia,et al.  A Parameterised Search System , 2007, ECIR.

[8]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems , 2003, Lecture Notes in Computer Science.

[10]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[11]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[12]  Thomas Roelleke,et al.  Solving the Enterprise TREC Task with Probabilistic Data Models , 2006, TREC.

[13]  David J. DeWitt,et al.  The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS , 2005, VLDB.

[14]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[15]  Gabriella Kazai,et al.  A general matrix framework for modelling Information Retrieval , 2006, Inf. Process. Manag..

[16]  Arjen P. de Vries,et al.  Relevance information: a loss of entropy but a gain for IDF? , 2005, SIGIR '05.

[17]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[18]  Robert B. Ross,et al.  Probabilistic Aggregates , 2002, ISMIS.

[19]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[20]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[21]  Dan Suciu,et al.  Foundations of probabilistic answers to queries , 2005, SIGMOD '05.

[22]  Peter Schefe Natürlichsprachlicher Zugang zu Datenbanken? , 1983, Angew. Inform..

[23]  Norbert Fuhr,et al.  Efficient processing of vague queries using a data stream approach , 1995, SIGIR '95.

[24]  Abraham Silberschatz,et al.  Database Systems Concepts , 1997 .

[25]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[26]  Thomas Roelleke A frequency-based and a poisson-based definition of the probability of being informative , 2003, SIGIR '03.

[27]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[28]  Stephen E. Robertson,et al.  On Event Spaces and Probabilistic Models in Information Retrieval , 2005, Information Retrieval.

[29]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[30]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[31]  Norbert Fuhr,et al.  Probabilistic, object-oriented logics for annotation-based retrieval in digital libraries , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[32]  C. J. van Rijsbergen,et al.  Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval , 1987, SIGIR 1987.

[33]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[34]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[35]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[36]  C. J. van Rijsbergen,et al.  A Non-Classical Logic for Information Retrieval , 1997, Comput. J..

[37]  Ian A. Macleod Text retrieval and the relational model , 1991 .

[38]  Thomas Roelleke POOL: probabilistic object oriented logical representation and retrieval of complex objects: a model for hypermedia retrieval , 1999 .

[39]  Norbert Fuhr,et al.  A Probabilistic Framework for Vague Queries and Imprecise Information in Databases , 1990, VLDB.

[40]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[41]  Kevin Chen-Chuan Chang,et al.  RankSQL: query algebra and optimization for relational top-k queries , 2005, SIGMOD '05.

[42]  Djoerd Hiemstra,et al.  A probabilistic justification for using tf×idf term weighting in information retrieval , 2000, International Journal on Digital Libraries.

[43]  Amihai Motro,et al.  Accommodating imprecision in database systems: issues and solutions , 1990, SGMD.

[44]  Timo Niemi,et al.  A straightforward NF 2 relational interface with applications in information retrieval , 1995 .

[45]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[46]  Jennifer Widom,et al.  The Lowell database research self-assessment , 2003, CACM.

[47]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[48]  Thomas Roelleke,et al.  A parallel derivation of probabilistic information retrieval models , 2006, SIGIR.

[49]  Yiyu Yao,et al.  On modeling information retrieval with probabilistic inference , 1995, TOIS.

[50]  Dennis Tsichritzis,et al.  Advances in Database Technology — EDBT '90 , 1990, Lecture Notes in Computer Science.

[51]  Mounia Lalmas,et al.  Modelling Vague Content and Structure Querying in XML Retrieval with a Probabilistic Object-Relational Framework , 2004, FQAS.

[52]  Stephen E. Robertson,et al.  Term frequency and term value , 1981, SIGIR '81.

[53]  Janusz Kacprzyk,et al.  Intelligent Exploration of the Web , 2003, Studies in Fuzziness and Soft Computing.

[54]  Dan Suciu,et al.  Answering Queries from Statistics and Probabilistic Views , 2005, VLDB.

[55]  Norbert Fuhr,et al.  Retrieval of complex objects using a four-valued logic , 1996, SIGIR '96.

[56]  Patrick Bosc,et al.  Fuzzy queries and relational databases , 1994, SAC '94.

[57]  Suk Kyoon Lee,et al.  An Extended Relational Database Model for Uncertain and Imprecise Information , 1992, VLDB.

[58]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[59]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[60]  Hans-Jörg Schek,et al.  Data Structures for an Integrated Data Base Management and Information Retrieval System , 1982, VLDB.

[61]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[62]  Ophir Frieder,et al.  Integrating Structured Data and Text: A Relational Approach , 1997, J. Am. Soc. Inf. Sci..

[63]  Stephen E. Robertson,et al.  Large Test Collection Experiments on an Operational, Interactive System: Okapi at TREC , 1995, Inf. Process. Manag..

[64]  Hans-Jörg Schek,et al.  PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases , 2004, Knowledge and Information Systems.

[65]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[66]  Patrick Bosc,et al.  Fuzzy querying with SQL: extensions and implementation aspects , 1988 .

[67]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[68]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[69]  Ian A. Macleod Text retrieval and the relational model , 1991, J. Am. Soc. Inf. Sci..

[70]  Gerhard Weikum,et al.  Probabilistic information retrieval approach for ranking of database query results , 2006, TODS.

[71]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[72]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[73]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[74]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..