Semistructured Data Search

This paper presents a selection of methods for searching in heterogeneous data collections where some amount of structure is available. We start with a general retrieval framework, based on generative probabilistic modeling, for ranking unstructured document representations. Then, we consider structure at two different levels: documents and queries. For documents, the internal structure is captured through the use of multiple document fields, and various approaches to setting field weights are discussed. For queries, the focus is on effectively utilizing additional input data that the user might provide along with the keyword query, such as target categories or example documents. We place a particular emphasis on methods that are robust with respect to the availability of structured data and are able to deal with inconsistent or incomplete information.

[1]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[2]  Ihab F. Ilyas,et al.  Interpreting keyword queries over web knowledge bases , 2012, CIKM '12.

[3]  Jeffrey Katzer,et al.  A study of the overlap among document representations , 1983, SIGIR '83.

[4]  Gianluca Demartini,et al.  L3S at INEX 2007: Query Expansion for Entity Ranking Using a Highly Accurate Ontology , 2007, INEX.

[5]  Jaap Kamps,et al.  Exploiting the category structure of Wikipedia for entity ranking , 2013, Artif. Intell..

[6]  Lora Aroyo,et al.  The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I , 2011, SEMWEB.

[7]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[8]  M. de Rijke,et al.  A few examples go a long way: constructing query models from elaborate query formulations , 2008, SIGIR '08.

[9]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[10]  Paul Thomas,et al.  Overview of the TREC 2009 Entity Track , 2009, TREC.

[11]  Andrew Trotman,et al.  Focused Access to XML Documents, 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2007, Dagstuhl Castle, Germany, December 17-19, 2007. Selected Papers , 2008, INEX.

[12]  James A. Thom,et al.  Entity ranking in Wikipedia: utilising categories, links and topic difficulty prediction , 2009, Information Retrieval.

[13]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[14]  Mounia Lalmas,et al.  Overview of the INEX 2007 Entity Ranking Track , 2008, INEX.

[15]  M. de Rijke,et al.  Exploiting External Collections for Query Expansion , 2012, TWEB.

[16]  Tao Tao,et al.  Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[17]  Mounia Lalmas,et al.  Advances in XML Information Retrieval, Third International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2004, Dagstuhl Castle, Germany, December 6-8, 2004, Revised Selected Papers , 2005, INEX.

[18]  Krisztian Balog,et al.  When Simple is (more than) Good Enough: Effective Semantic Search with (almost) no Semantics , 2012, ECIR.

[19]  Soumen Chakrabarti,et al.  Learning joint query interpretation and response ranking , 2013, WWW '13.

[20]  Javed A. Aslam,et al.  Condorcet fusion for improved retrieval , 2002, CIKM '02.

[21]  Victor Lavrenko,et al.  A Generative Theory of Relevance , 2008, The Information Retrieval Series.

[22]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[23]  Iadh Ounis,et al.  Multinomial Randomness Models for Retrieval with Document Fields , 2007, ECIR.

[24]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[25]  Jian-Yun Nie,et al.  Adapting information retrieval to query contexts , 2008, Inf. Process. Manag..

[26]  Krisztian Balog,et al.  On the Modeling of Entities for Ad-Hoc Entity Search in the Web of Data , 2012, ECIR.

[27]  Gabriella Kazai,et al.  Advances in XML Information Retrieval and Evaluation, 4th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2005, Dagstuhl Castle, Germany, November 28-30, 2005, Revised Selected Papers , 2006, INEX.

[28]  Evgeniy Gabrilovich,et al.  Concept-Based Information Retrieval Using Explicit Semantic Analysis , 2011, TOIS.

[29]  Gianluca DemartiniClaudiu Why finding entities in Wikipedia is difficult, sometimes , 2010 .

[30]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[31]  W. Bruce Croft,et al.  A Probabilistic Retrieval Model for Semistructured Data , 2009, ECIR.

[32]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[33]  Stefan M. Rüger,et al.  Integrating Document Features for Entity Ranking , 2008, INEX.

[34]  Gilad Mishne,et al.  Language Models for Searching in Web Corpora , 2004, TREC.

[35]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[36]  Franciska de Jong,et al.  Generative Probabilistic Models , 2007, Multimedia Retrieval.

[37]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[38]  Stephen E. Robertson,et al.  Simple BM25 extension to multiple weighted fields , 2004, CIKM '04.

[39]  Stephen E. Robertson,et al.  Field-Weighted XML Retrieval Based on BM25 , 2005, INEX.

[40]  Krisztian Balog,et al.  Overview of the TREC 2011 Entity Track , 2011, TREC.

[41]  M. de Rijke,et al.  Query modeling for entity search based on terms, categories, and examples , 2011, TOIS.

[42]  Roi Blanco,et al.  Effective and Efficient Entity Search in RDF Data , 2011, SEMWEB.

[43]  D. R. Elchesen,et al.  General: Effectiveness of Combining Title Words and Index Terms in Machine Retrieval Searches , 1972, Nature.

[44]  Krisztian Balog,et al.  Overview of the TREC 2010 Entity Track , 2010, TREC.

[45]  Paavo Arvola,et al.  Entity Ranking Based on Category Expansion , 2008, INEX.

[46]  James A. Thom,et al.  Use of Wikipedia Categories in Entity Ranking , 2007, ArXiv.

[47]  Tim Furche,et al.  Structured Text Retrieval , 2009, Encyclopedia of Database Systems.

[48]  Jeffrey Dalton,et al.  Semantic Entity Retrieval using Web Queries over Structured RDF Data , 2010 .

[49]  Peter Bailey,et al.  Overview of the TREC 2007 Enterprise Track , 2007, TREC.

[50]  W. Bruce Croft,et al.  Ranking using multiple document types in desktop search , 2010, SIGIR '10.

[51]  James P. Callan,et al.  Combining document representations for known-item search , 2003, SIGIR.

[52]  James P. Callan,et al.  Hierarchical Language Models for XML Component Retrieval , 2004, INEX.

[53]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..