Named entity mining from click-through data using weakly supervised latent dirichlet allocation

This paper addresses Named Entity Mining (NEM), in which we mine knowledge about named entities such as movies, games, and books from a huge amount of data. NEM is potentially useful in many applications including web search, online advertisement, and recommender system. There are three challenges for the task: finding suitable data source, coping with the ambiguities of named entity classes, and incorporating necessary human supervision into the mining process. This paper proposes conducting NEM by using click-through data collected at a web search engine, employing a topic model that generates the click-through data, and learning the topic model by weak supervision from humans. Specifically, it characterizes each named entity by its associated queries and URLs in the click-through data. It uses the topic model to resolve ambiguities of named entity classes by representing the classes as topics. It employs a method, referred to as Weakly Supervised Latent Dirichlet Allocation (WS-LDA), to accurately learn the topic model with partially labeled named entities. Experiments on a large scale click-through data containing over 1.5 billion query-URL pairs show that the proposed approach can conduct very accurate NEM and significantly outperforms the baseline.

[1]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[2]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[3]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[4]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[5]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[6]  Yen-Jen Oyang,et al.  Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[7]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[8]  Yorick Wilks,et al.  University of Sheffield: description of the LaSIE system as used for MUC-6 , 1995, MUC.

[9]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[10]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[11]  Naonori Ueda,et al.  Parametric Mixture Models for Multi-Labeled Text , 2002, NIPS.

[12]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[13]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[14]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[15]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[18]  Hiroshi Nakagawa,et al.  Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior , 2007, KDD '07.

[19]  Wei-Ying Ma,et al.  Probabilistic query expansion using query logs , 2002, WWW '02.

[20]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[21]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[22]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.