Score region algebra : a flexible framework for structured information retrieval

Approximately three decades ago researchers realized that they would have to structure data to be able to store and access large amounts of data streams that were produced each day. As a result, database management systems were designed and developed, used to keep the data in one place and for finding relevant information in this data. On the other hand, a large amount of textual documents was still stored and accessed in unstructured format. Retrieval of such textual documents, containing relevant information with respect to a user query, has been an open research question studied in the information retrieval area for half a century. Information retrieval studies resulted in numerous retrieval models and retrieval systems whose goal is to rank relevant documents according to their estimated rel- evance to a user query. Although having similar goals research areas of databases and information retrieval developed mostly independently from each other. Recently, the new `wave of documents' is `threatening' to bring these two areas closer to each other.

[1]  M. Petkovic,et al.  Content-based Video Retrieval Supported by Database Technology , 2003 .

[2]  Arjen P. de Vries,et al.  Content independence in multimedia databases , 2001, J. Assoc. Inf. Sci. Technol..

[3]  Rohini K. Srihari,et al.  Incorporating query term dependencies in language models for document retrieval , 2003, SIGIR '03.

[4]  Hans-Jörg Schek,et al.  Generating Vector Spaces On-the-fly for Flexible XML Retrieval , 2002 .

[5]  Ricardo A. Baeza-Yates,et al.  Integrating contents and structure in text retrieval , 1996, SGMD.

[6]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[7]  W. S. Cooper Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems , 1968 .

[8]  Yosi Mass,et al.  Component Ranking and Automatic Query Refinement for XML Retrieval , 2004, INEX.

[9]  W. Bruce Croft Knowledge-based and statistical approaches to text retrieval , 1993, IEEE Expert.

[10]  Hinrich Schütze,et al.  Personalized search , 2002, CACM.

[11]  Pekka Kilpeläinen,et al.  Using sgrep for querying structured text files 1 , 1996 .

[12]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  Benjamin Piwowarski,et al.  Bayesian Networks and INEX ’ 03 , 2008 .

[15]  Hans-Jörg Schek,et al.  Data Structures for an Integrated Data Base Management and Information Retrieval System , 1982, VLDB.

[16]  Benjamin Piwowarski,et al.  Expected Ratio of Relevant Units: A Measure for Structured Information Retrieval , 2008 .

[17]  Arjen P. de Vries,et al.  The Mirror MMDBMS Architecture , 1999, VLDB.

[18]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[19]  Jonathan Robie XQL (XML Query Language) , 1999 .

[20]  Chris D. Paice,et al.  The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases , 1980, SIGIR '80.

[21]  Carolyn J. Crouch,et al.  Flexible Retrieval Based on the Vector Space Model , 2004, INEX.

[22]  Paul Over,et al.  TRECVID 2004 - An Overview , 2004, TRECVID.

[23]  C. Paice Soft evaluation of Boolean search queries in information retrieval systems , 1984 .

[24]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[25]  Djoerd Hiemstra,et al.  The Simplest Evaluation Measures for XML Information Retrieval that Could Possibly Work , 2005 .

[26]  David J. DeWitt,et al.  On supporting containment queries in relational database management systems , 2001, SIGMOD '01.

[27]  Arjen P. de Vries,et al.  Moa and the Multi-model Architecture: A New Perspective on XNF 2 , 2003, DEXA.

[28]  Arjeh M. Cohen,et al.  Synchronized Multimedia Integration Language (SMIL) 2.0 , 1998 .

[29]  Thijs Westerveld,et al.  Using generative probabilistic models for multimedia retrieval , 2005, SIGF.

[30]  Matthew Young-Lai,et al.  One-pass evaluation of region algebra expressions , 2003, Inf. Syst..

[31]  Tova Milo,et al.  Algebras for Querying Text Regions: Expressive Power and Optimization , 1998, J. Comput. Syst. Sci..

[32]  Dennis Tsichritzis,et al.  The ANSI/X3/SPARC DBMS Framework Report of the Study Group on Dabatase Management Systems , 1978, Inf. Syst..

[33]  Thijs Westerveld,et al.  Structural features in content oriented XML retrieval , 2005, CIKM '05.

[34]  Kenney Ng A Maximum Likelihood Ratio Information Retrieval Model , 1999, TREC.

[35]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[36]  Gabriella Kazai,et al.  Tolerance to irrelevance: a user-effort oriented evaluation of retrieval systems without predefined retrieval unit , 2004 .

[37]  Gonzalo Navarro,et al.  IXPN: An Index-Based XPath Implementation , 2008 .

[38]  Serge Abiteboul,et al.  Querying Semi-Structured Data , 1997, Encyclopedia of Database Systems.

[39]  Timo Ojala,et al.  TRECVID 2005 Experiments at Media Team Oulu , 2005, TRECVID.

[40]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[41]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[42]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[43]  Djoerd Hiemstra,et al.  Utilizing Structural Knowledge for Information Retrieval in XML Databases , 2005 .

[44]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[45]  Seyed M. M. Tahaghoghi,et al.  Hybrid XML Retrieval Revisited , 2004, INEX.

[46]  Ophir Frieder,et al.  Integrating Structured Data and Text: A Relational Approach , 1997, J. Am. Soc. Inf. Sci..

[47]  Forbes J. Burkowski Retrieval activities in a database consisting of heterogeneous collections of structured text , 1992, SIGIR '92.

[48]  Hans-Jörg Schek,et al.  ETH Zürich at INEX: Flexible Information Retrieval from XML with PowerDB-XML , 2002, INEX Workshop.

[49]  Andreas Stolcke,et al.  Structure and performance of a dependency language model , 1997, EUROSPEECH.

[50]  Djoerd Hiemstra,et al.  Vague Element Selection and Query Rewriting for XML Retrieval , 2006 .

[51]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[52]  Henk Ernst Blok Database Optimization Aspects for Information Retrieval , 2002 .

[53]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[54]  W. Bruce Croft,et al.  A retrieval model incorporating hypertext links , 1989, Hypertext.

[55]  David Hawking,et al.  How Valuable is External Link Evidence When Searching Enterprise Webs? , 2004, ADC.

[56]  Claudio Carpineto,et al.  Merging XML Indices , 2004, INEX.

[57]  W. Bruce Croft,et al.  Providing Government Information on the Internet: Experiences with THOMAS , 1995, DL.

[58]  Aaron Skonnard,et al.  Essential XML Quick Reference: A Programmer's Reference to XML, XPath, XSLT, XML Schema, SOAP, and More , 2001 .

[59]  Norbert Fuhr,et al.  A Query Language and User Interface for XML Information Retrieval , 2003, Intelligent Search on XML Data.

[60]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[61]  Torsten Grust,et al.  Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps , 2003, VLDB.

[62]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[63]  Timothy W. Finin,et al.  Information retrieval on the semantic web , 2002, CIKM '02.

[64]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[65]  Dong Xu,et al.  Columbia University TRECVID-2006 Video Search and High-Level Feature Extraction , 2006, TRECVID.

[66]  Gabriella Kazai,et al.  The overlap problem in content-oriented XML retrieval evaluation , 2004, SIGIR '04.

[67]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[68]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[69]  Georges Quénot,et al.  CLIPS at TRECVID : Shot Boundary Detection and Feature Detection , 2003, TRECVID.

[70]  Andreas Henrich,et al.  Combining Multimedia Retrieval and Text Retrieval to Search Structured Documents in Digital Libraries , 2000, DELOS.

[71]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[72]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[73]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[74]  Torsten. Grust,et al.  Accelerating XPath location steps , 2002, SIGMOD '02.

[75]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[76]  Djoerd Hiemstra,et al.  Exploiting Query Structure and Document Structure to Improve Document Retrieval Effectiveness , 2006 .

[77]  Edward A. Fox,et al.  Research Contributions , 2014 .

[78]  Michael Kifer,et al.  Database Systems : An Application-Oriented Approach , 2005 .

[79]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[80]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[81]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[82]  Paul Ogilvie,et al.  Using Language Models for Flat Text Queries in XML Retrieval , 2003 .

[83]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[84]  James P. Callan,et al.  Hierarchical Language Models for XML Component Retrieval , 2004, INEX.

[85]  Hans-Jörg Schek,et al.  PowerDB-IR: information retrieval on top of a database cluster , 2001, CIKM '01.

[86]  Steven J. DeRose The SGML FAQ Book: Understanding the Foundation of HTML and XML , 1997 .

[87]  Katsuya Masuda,et al.  A Ranking Model of Proximal and Structural Text Retrieval Based on Region Algebra , 2003, ACL.

[88]  Vojkan Mihajlovic Score Region Algebra: A Framework for Structured IR , 2005 .

[89]  Shlomo Argamon,et al.  Choosing the Right Bigrams for Information Retrieval , 2004 .

[90]  Gary Geunbae Lee,et al.  Probabilistic information retrieval model for a dependency structured indexing system , 2005, Inf. Process. Manag..

[91]  Ophir Frieder,et al.  A Parallel DBMS Approach to IR in TREC-3 , 1994, TREC.

[92]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[93]  Fabio Gasparetti,et al.  Personalized Search on the World Wide Web , 2007, The Adaptive Web.

[94]  Norbert Fuhr,et al.  Content-oriented XML retrieval with HyRex , 2002, INEX Workshop.

[95]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[96]  J. Lee Analyzing the Effectiveness of Extended Boolean Models in Information Retrieval , 1995 .

[97]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[98]  Arvind Malhotra,et al.  XML Schema Part 2: Datatypes Second Edition , 2004 .

[99]  Alon Y. Halevy,et al.  Query Optimization by Predicate Move-Around , 1994, VLDB.

[100]  George A. Miller,et al.  WordNet: A Lexical Database for the English Language , 2002 .

[101]  Roeland Ordelman,et al.  Dutch speech recognition in multimedia information retrieval , 2003 .

[102]  Steven J. DeRose,et al.  Xml linking language (xlink), version 1. 0 , 2000, WWW 2000.

[103]  Bernard J. Jansen,et al.  The effect of query complexity on Web searching results , 2000, Inf. Res..

[104]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[105]  Peter Dadam,et al.  A DBMS prototype to support extended NF2 relations: an integrated view on flat tables and hierarchies , 1986, SIGMOD '86.

[106]  Rob Miller,et al.  Lightweight Structured Text Processing , 1999, USENIX Annual Technical Conference, General Track.

[107]  Gaston H. Gonnet,et al.  Mind Your Grammar: a New Approach to Modelling Text , 1987, VLDB.

[108]  Norbert Fuhr,et al.  Models in Information Retrieval , 2001, ESSIR.

[109]  Paul Over,et al.  TRECVID 2003 - an overview , 2003 .

[110]  Sihem Amer-Yahia,et al.  GalaTex: a conformant implementation of the XQuery full-text language , 2005, WWW '05.

[111]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[112]  Michael E. Senko,et al.  DIAM II: The binary infological level and its database language - FORAL , 1976, SIGMOD 1976.

[113]  W. Bruce Croft,et al.  INQUERY System Overview , 1993, TIPSTER.

[114]  William S. Cooper,et al.  Getting beyond Boole , 1988, Inf. Process. Manag..

[115]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[116]  Jin Zhao,et al.  Video Retrieval Using High Level Features: Exploiting Query Matching and Confidence-Based Weighting , 2006, CIVR.

[117]  Gilad Mishne,et al.  Language Models for Searching in Web Corpora , 2004, TREC.

[118]  A. N. Wilschut,et al.  On the integration of IR and Databases , 1999 .

[119]  Charles L. A. Clarke,et al.  The MultiText retrieval system (demonstration abstract) , 1999, SIGIR '99.

[120]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[121]  Charles L. A. Clarke,et al.  An Algebra for Structured Text Search and a Framework for its Implementation , 1995, Comput. J..

[122]  Djoerd Hiemstra,et al.  The TIJAH XML-IR system at INEX 2003 , 2003, INEX.

[123]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[124]  Wessel Kraaij,et al.  Porter's stemming algorithm for Dutch , 1994 .

[125]  Jun'ichi Tsujii,et al.  A Robust Retrieval Engine for Proximal and Structural Search , 2003, HLT-NAACL.

[126]  James A. Thom,et al.  HiXEval: Highlighting XML Retrieval Evaluation , 2005, INEX.

[127]  David C. Hay,et al.  Requirements Analysis: From Business Views to Architecture , 2002 .

[128]  Arjen P. de Vries,et al.  The Mirror DBMS at TREC-8 , 1999, TREC.

[129]  Shlomo Geva GPX - Gardens Point XML Information Retrieval at INEX 2004 , 2004, INEX.

[130]  Djoerd Hiemstra,et al.  PFTijah: text search in an XML database system , 2006 .

[131]  Ricardo A. Baeza-Yates,et al.  XQL and proximal nodes , 2002, J. Assoc. Inf. Sci. Technol..

[132]  Peter Boncz,et al.  UvA-DARE ( Digital Academic Repository ) Monet ; a next-Generation DBMS Kernel For Query-Intensive Applications , 2007 .

[133]  Djoerd Hiemstra,et al.  An XML-IR-DB-Sandwich: Is it Better with an Algebra in Between? , 2004, SIGIR 2004.

[134]  Djoerd Hiemstra,et al.  The TIJAH XML information retrieval system , 2006, SIGIR '06.

[135]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[136]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[137]  Jovan Pehcevski,et al.  RMIT INEX experiments : XML Retrieval using Lucy / eXist , 2004 .

[138]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[139]  E. Rasmussen Evaluation in Information Retrieval , 2002 .

[140]  Daniel E. Rose,et al.  V-Twin: A Lightweight Engine for Interactive Use , 1996, TREC.

[141]  James P. Callan,et al.  Language Models and Structured Document Retrieval , 2002, INEX Workshop.

[142]  Djoerd Hiemstra,et al.  TIJAH: Embracing IR Methods in XML Databases , 2005, Information Retrieval.

[143]  Djoerd Hiemstra,et al.  Score region algebra: building a transparent XML-R database , 2005, CIKM '05.

[144]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[145]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[146]  Norbert Fuhr,et al.  XIRQL: An XML query language based on information retrieval concepts , 2004, TOIS.

[147]  Charles L. A. Clarke,et al.  Schema-Independent Retrieval from Heterogeneous Structured Text , 1994 .

[148]  Benjamin Piwowarski EPRUM Metrics and INEX 2005 , 2005, INEX.

[149]  James Allan,et al.  A survey in indexing and searching XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[150]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[151]  Gabriella Kazai,et al.  INEX 2005 Evaluation Measures , 2005, INEX.

[152]  Peter Schäuble,et al.  SPIDER: a multiuser information retrieval system for semistructured and dynamic data , 1993, SIGIR.

[153]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[154]  M. de Rijke,et al.  An Element-based Approach to XML Retrieval , 2004 .

[155]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[156]  Menzo Windhouwer,et al.  Efficient Relational Storage and Retrieval of XML Documents , 2000, WebDB.

[157]  W. Bruce Croft,et al.  Formal multiple-bernoulli models for language modeling , 2004, SIGIR '04.

[158]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[159]  Djoerd Hiemstra,et al.  An Integrated Approach to Text and Image Retrieval- The Lowlands Team at Trecvid 2005 , 2005, TRECVID.

[160]  Roni Rosenfeld,et al.  A whole sentence maximum entropy language model , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[161]  Norbert Fuhr,et al.  Models for Integrated Information Retrieval and Database Systems. , 1996 .

[162]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[163]  Alexander G. Hauptmann,et al.  The Use and Utility of High-Level Semantic Features in Video Retrieval , 2005, CIVR.

[164]  Irving L. Traiger,et al.  A history and evaluation of System R , 1981, CACM.

[165]  François Schiettecatte,et al.  Document Retrieval Using The MPS Information Server (A Report on the TREC-4 Experiment) , 1995, TREC.

[166]  Benjamin Piwowarski,et al.  An Algebra for Structured Queries in Bayesian Networks , 2004, INEX.

[167]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[168]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[169]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[170]  Norbert Fuhr,et al.  A probabilistic relational model for the integration of IR and databases , 1993, SIGIR.

[171]  Airi Salminen PAT expressions: an algebra for text search , 2007 .

[172]  David Garlan,et al.  Lightweight structure in text , 2002 .

[173]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[174]  Luis M. de Campos,et al.  A Multi-layered Bayesian Network Model for Structured Document Retrieval , 2003, ECSQARU.

[175]  Mohand Boughanem,et al.  Using a Relevance Propagation Method for Adhoc and Heterogeneous Tracks at INEX 2004 , 2004, INEX.

[176]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[177]  Torsten Grust,et al.  Accelerating XPath evaluation in any RDBMS , 2004, TODS.

[178]  Jani Jaakkola Nested text-region algebra , 1999 .

[179]  Djoerd Hiemstra,et al.  A Database Approach to Content-based XML Retrieval , 2002, INEX Workshop.

[180]  DoanAnHai,et al.  Semantic-integration research in the database community , 2005 .

[181]  Wolfgang Meier,et al.  eXist: An Open Source Native XML Database , 2002, Web, Web-Services, and Database Systems.

[182]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[183]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[184]  David Hawking,et al.  TREC 14 Enterprise Track at CSIRO and ANU , 2005, TREC.

[185]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[186]  Vojkan Mihajlovic,et al.  Automatic Annotation of Formula 1 Races for Content-Based Video Retrieval , 2001 .