Searching for Entities When Retrieval Meets Extraction

Retrieving entities from inside of documents, instead of searching for documents or web pages themselves, has become an active topic in both commercial search systems and academic information retrieval research area. Taking into account information needs about entities represented as descriptions with targeted answer entity types, entity search tasks are to return ranked lists of answer entities from unstructured texts, such as news or web pages. Although it works in the same environment as document retrieval, entity retrieval tasks require finer-grained answers entities which need more syntactic and semantic analyses on germane documents than document retrieval. This work proposes a two-layer probability model for addressing this task, which integrates germane document identification and answer entity extraction. Germane document identification retrieves highly related germane documents containing answer entities, while answer entity extraction finds answer entities by utilizing syntactic or linguistic information from those documents. This work theoretically demonstrates the integration of germane document identification and answer entity extraction for the entity retrieval task with the probability model. Moreover, this probability approach helps to reduce the overall retrieval complexity while maintaining high accuracy in locating answer entities. Serial studies are conducted in this dissertation on both germane document identification and answer entity extraction. The learning to rank method is investigated for germane document identification. This method first constructs a model on the training data set using query features, document features, similarity features and rank features. Then the model estimates the probability of the germane documents on testing data sets with the learned model. The experiment indicates that the learning to rank method is significantly better than the baseline systems, which treat germane document identification as a conventional document retrieval problem. The answer entity extraction method aims to correctly extract the answer entities from the germane documents. The methods of answer entity extraction without contexts (such as named entity recognition tools for extraction and knowledge base for extraction) and answer entity extraction with contexts (such as tables/lists as contexts and subject-verb-object structures as contexts) are investigated. These methods individually, however, can extract only parts of answer entities. The method of treating the answer entity extraction problem as a classification problem with the features from the above extraction methods runs significantly better than any of the individual extraction methods.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Luo Si,et al.  Purdue at TREC 2010 Entity Track: A Probabilistic Framework for Matching Types Between Candidate and Target Entities , 2010, TREC.

[3]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[4]  Maarten de Rijke,et al.  Combining Term-Based and Category-Based Representations for Entity Search , 2009, INEX.

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[7]  Jeffrey P. Bigham,et al.  Organizing and Searching the World Wide Web of Facts - Step One: The One-Million Fact Extraction Challenge , 2006, AAAI.

[8]  Jaap Kamps,et al.  Focused Search in Books and Wikipedia: Categories, Links and Relevance Feedback , 2009, INEX.

[9]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[10]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[11]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[12]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[13]  Craig MacDonald,et al.  University of Glasgow at TREC 2009: Experiments with Terrier , 2009, TREC.

[14]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[15]  Ricardo A. Baeza-Yates,et al.  A language for queries on structure and contents of textual databases , 1995, SIGIR '95.

[16]  Gabriella Kazai Initiative for the Evaluation of XML Retrieval , 2009 .

[17]  Junyu Niu,et al.  A Multiple-Stage Framework for Related Entity Finding: FDWIM at TREC 2010 Entity Track , 2010, TREC.

[18]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[19]  Luo Si,et al.  Entity Retrieval with Hierarchical Relevance Model, Exploiting the Structure of Tables and Learning Homepage Classifiers , 2009, TREC.

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[21]  Paul Thomas,et al.  Overview of the TREC 2009 Entity Track , 2009, TREC.

[22]  Yuji Matsumoto,et al.  Japanese Named Entity Extraction with Redundant Morphological Analysis , 2003, NAACL.

[23]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[24]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[25]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[26]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[27]  Maarten de Rijke,et al.  Finding experts and their eetails in e-mail corpora , 2006, WWW '06.

[28]  Wendy W. Chapman,et al.  ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports , 2009, J. Biomed. Informatics.

[29]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[30]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[31]  Olga Vechtomova Related Entity Finding: University of Waterloo at TREC 2010 Entity Track , 2010, TREC.

[32]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[33]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.

[34]  ChengXiang Zhai,et al.  Finding Related Entities by Retrieving Relations: UIUC at TREC 2009 Entity Track , 2009, TREC.

[35]  Hideki Kashioka,et al.  NiCT at TREC 2009: Employing Three Models for Entity Ranking Track , 2009, TREC.

[36]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[37]  Mounia Lalmas,et al.  Overview of the INEX 2007 Entity Ranking Track , 2008, INEX.

[38]  Andrew Trotman,et al.  Narrowed Extended XPath I (NEXI) , 2004, INEX.

[40]  Lee Spector,et al.  Ontology-Based Knowledge Discovery on the World-Wide Web , 1996 .

[41]  Mounira Harzallah,et al.  A Tree-Based Similarity for Evaluating Concept Proximities in an Ontology , 2006, Data Science and Classification.

[42]  Patrick Pantel,et al.  Automatically Harvesting and Ontologizing Semantic Relations , 2008, Ontology Learning and Population.

[43]  Nick Craswell,et al.  L3S at INEX 2008: Retrieving Entities Using Structured Information , 2008, INEX.

[44]  Danushka Bollegala,et al.  Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[45]  Katja Hofmann,et al.  The University of Amsterdam at TREC 2010: Session, Entity and Relevance Feedback , 2010, TREC.

[46]  Qi Li,et al.  A Study of Relation Annotation in Business Environments Using Web Mining , 2009, 2009 IEEE International Conference on Semantic Computing.

[47]  Felix Naumann,et al.  ECIR - A Lightweight Approach for Entity-Centric Information Retrieval , 2010, TREC.

[48]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[49]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[50]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[51]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[52]  Wei Zheng,et al.  UDEL/SMU at TREC 2009 Entity Track , 2009, TREC.

[53]  Wei Lu,et al.  Adapting Language Modeling Methods for Expert Search to Rank Wikipedia Entities , 2008, INEX.

[54]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[55]  Peng Jiang,et al.  Reconstruct Logical Hierarchical Sitemap for Related Entity Finding , 2010, TREC.

[56]  Stephen E. Robertson,et al.  Optimisation methods for ranking functions with multiple parameters , 2006, CIKM '06.

[57]  Peng Jiang,et al.  Experiments on Related Entity Finding Track at TREC 2009 , 2009, TREC.

[58]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[59]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[60]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[61]  David C. Gibbon,et al.  Support vector machines: relevance feedback and information retrieval , 2002, Inf. Process. Manag..

[62]  Gabriella Kazai,et al.  Overview of the Initiative for the Evaluation of XML retrieval (INEX) 2002 , 2002, INEX Workshop.

[63]  Jaap Kamps,et al.  Finding Entities in Wikipedia Using Links and Categories , 2008, INEX.

[64]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[65]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[66]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[67]  Le Zhao,et al.  A generative retrieval model for structured documents , 2008, CIKM '08.

[68]  Jovan Pehcevski,et al.  Topic Difficulty Prediction in Entity Ranking , 2008, INEX.

[69]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[70]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[71]  Krisztian Balog,et al.  Overview of the TREC 2010 Entity Track , 2010, TREC.

[72]  L. F. Rau,et al.  Extracting company names from text , 1991, [1991] Proceedings. The Seventh IEEE Conference on Artificial Intelligence Application.

[73]  Diego Calvanese,et al.  The Description Logic Handbook , 2007 .

[74]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[75]  Yue Liu,et al.  A Novel Framework for Related Entities Finding: ICTNET at TREC 2009 Entity Track , 2009, TREC.

[76]  Ralph Grishman,et al.  Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition , 1998, VLC@COLING/ACL.

[77]  Djoerd Hiemstra,et al.  Efficient XML and Entity Retrieval with PF/Tijah: CWI and University of Twente at INEX'08 , 2008, INEX.

[78]  Fredric C. Gey,et al.  Probabilistic retrieval based on staged logistic regression , 1992, SIGIR '92.

[79]  Arjen P. de Vries,et al.  Delft University at the TREC 2009 Entity Track: Ranking Wikipedia Entities , 2009, TREC.

[80]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[81]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[82]  James A. Thom,et al.  Using Wikipedia Categories and Links in Entity Ranking , 2007, INEX.