A database approach to information retrieval: The remarkable relationship between language models and region models

In this report, we unify two quite distinct approaches to information retrieval: region models and language models. Region models were developed for structured document retrieval. They provide a well-defined behaviour as well as a simple query language that allows application developers to rapidly develop applications. Language models are particularly useful to reason about the ranking of search results, and for developing new ranking approaches. The unified model allows application developers to define complex language modeling approaches as logical queries on a textual database. We show a remarkable one-to-one relationship between region queries and the language models they represent for a wide variety of applications: simple ad-hoc search, cross-language retrieval, video retrieval, and web search.

[1]  W. Marsden I and J , 2012 .

[2]  Andrew Trotman,et al.  The Simplest Query Language That Could Possibly Work , 2004 .

[3]  Torsten Grust,et al.  Pathfinder: XQuery - The Relational Way , 2005, VLDB.

[4]  Wendell Piez Half-steps toward LMNL , 2004, Extreme Markup Languages®.

[5]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[6]  Djoerd Hiemstra,et al.  Probabilistic Approaches to Video Retrieval , 2004, TRECVID.

[7]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[8]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[9]  Norbert Fuhr,et al.  Models for Integrated Information Retrieval and Database Systems. , 1996 .

[10]  Norbert Fuhr,et al.  HySpirit - A Probabilistic Inference Engine for Hypermedia Retrieval in Large Databases , 1998, EDBT.

[11]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[12]  Forbes J. Burkowski Retrieval activities in a database consisting of heterogeneous collections of structured text , 1992, SIGIR '92.

[13]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[14]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[15]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[16]  Alan F. Smeaton,et al.  TRECVID 2004 Experiments in Dublin City University , 2004, TRECVID.

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  Cong Yu,et al.  XQuery 1.0 and XPath 2.0 Full-Text , 2009, Encyclopedia of Database Systems.

[19]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[20]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[21]  Jani Jaakkola Nested text-region algebra , 1999 .

[22]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[23]  James P. Callan,et al.  Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding , 2003, TREC.

[24]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[25]  Dennis Tsichritzis,et al.  The ANSI/X3/SPARC DBMS Framework Report of the Study Group on Dabatase Management Systems , 1978, Inf. Syst..

[26]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[27]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[28]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[29]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[30]  Charles L. A. Clarke,et al.  An Algebra for Structured Text Search and a Framework for its Implementation , 1995, Comput. J..

[31]  Gilad Mishne,et al.  Language Models for Searching in Web Corpora , 2004, TREC.

[32]  Alan F. Smeaton,et al.  TRECVid 2006 Experiments at Dublin City University , 2012, TRECVID.

[33]  Djoerd Hiemstra,et al.  An XML-IR-DB-Sandwich: Is it Better with an Algebra in Between? , 2004, SIGIR 2004.

[34]  Djoerd Hiemstra,et al.  TIJAH at INEX 2004 Modeling Phrases and Relevance Feedback , 2004, INEX.

[35]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[36]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[37]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[38]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[39]  Norbert Fuhr Towards Data Abstraction in Networked Information Retrieval Systems , 1999, Inf. Process. Manag..

[40]  C. M. Sperberg-McQueen,et al.  GODDAG: A Data Structure for Overlapping Hierarchies , 2000, DDEP/PODDP.

[41]  Barbara Catania,et al.  REPORT on the EDBT'04 workshop on database technologies for handling XML information on the web , 2004, SGMD.

[42]  Steven J. DeRose,et al.  Markup Overlap: A Review and a Horse , 2004, Extreme Markup Languages®.

[43]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[44]  Hans-Jörg Schek,et al.  Methods for the administration of textual data in database systems , 1980, SIGIR '80.

[45]  Katsuya Masuda,et al.  A Ranking Model of Proximal and Structural Text Retrieval Based on Region Algebra , 2003, ACL.

[46]  Gabriella Kazai INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.

[47]  Ricardo A. Baeza-Yates,et al.  Proximal nodes: a model to query document databases by content and structure , 1997, TOIS.

[48]  Airi Salminen PAT expressions: an algebra for text search , 2007 .