Toward Entity Retrieval over Structured and Text Data

real-world applications increasingly involve both struc- tured data and text. Hence, managing both in an e-cient and integrated manner has received much attention from both the IR and database communities. To date, however, little research has been devoted to semantic issues in the integration of text and data. In this paper we introduced a problem in this realm: entity retrieval. Given data frag- ments that describe various aspects of a real-world entity, flnd all other data fragments as well as text documents that describe that same entity. As such, entity retrieval is a novel retrieval problem, which difiers from both regu- lar text retrieval and database search in that it explicitly requires matching information at the semantic level; match- ing syntactically as done in the current search engines and relational databases would be inherently non-optimal. We deflne entity retrieval and conduct a case study of retrieving information about a researcher from both the Web and a bibliographic database (DBLP). We propose several meth- ods for exploiting the structured information in the database to improve entity retrieval over the text collection. Specif- ically, we present a query expansion mechanism based on extracted information from structured data. Experiment results show that selectively using more structured infor- mation to expand the text query improves entity retrieval performance on text. We conclude the paper with future research directions for entity retrieval.

[1]  Timothy W. Finin,et al.  Information retrieval on the semantic web , 2002, CIKM '02.

[2]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[3]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[4]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[5]  Dan Roth,et al.  Probabilistic Reasoning for Entity & Relation Recognition , 2002, COLING.

[6]  Roy Goldman,et al.  WSQ/DSQ: a practical approach for combined querying of databases and the Web , 2000, SIGMOD '00.

[7]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[8]  Mounia Lalmas,et al.  Report on the INEX 2003 workshop , 2004, SIGF.

[9]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[10]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[11]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[12]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[13]  David Carmel,et al.  XML and information retrieval: a SIGIR 2000 workshop , 2001, SGMD.

[14]  David A. Evans,et al.  Design and Evaluation of the CLARIT-TREC-2 System , 1993, TREC.

[15]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[16]  Sihem Amer-Yahia,et al.  Texquery: a full-text search extension to xquery , 2004, WWW '04.

[17]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  Jessica L. Milstead,et al.  Metadata: Cataloging by Any Other Name. , 1999 .

[19]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[20]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[21]  Aristides Gionis,et al.  Automated Ranking of Database Query Results , 2003, CIDR.

[22]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[23]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[24]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[25]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[26]  J. Milstead,et al.  Metadata : Cataloging by any other name... : Special Intranet section , 1999 .

[27]  Luis Gravano,et al.  Efficient IR-Style Keyword Search over Relational Databases , 2003, VLDB.

[28]  Luis Gravano,et al.  Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[29]  Roy Goldman,et al.  WSQ/DSQ: a practical approach for combined querying of databases and the Web , 2000, SIGMOD 2000.

[30]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[31]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[32]  Jeffrey F. Naughton,et al.  On the integration of structure indexes and inverted lists , 2004, Proceedings. 20th International Conference on Data Engineering.

[33]  Ricardo A. Baeza-Yates,et al.  Second edition of the "XML and information retrieval" workshop held at SIGIR'2002, Tampere, Finland, Aug 15th, 2002 , 2002, SIGF.

[34]  Hui Fang,et al.  Entity Retrieval over Structured Data , 2005 .

[35]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[36]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[37]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[38]  J Allan,et al.  Readings in information retrieval. , 1998 .