论文信息 - Named entity mining from click-through data using weakly supervised latent dirichlet allocation

Named entity mining from click-through data using weakly supervised latent dirichlet allocation

This paper addresses Named Entity Mining (NEM), in which we mine knowledge about named entities such as movies, games, and books from a huge amount of data. NEM is potentially useful in many applications including web search, online advertisement, and recommender system. There are three challenges for the task: finding suitable data source, coping with the ambiguities of named entity classes, and incorporating necessary human supervision into the mining process. This paper proposes conducting NEM by using click-through data collected at a web search engine, employing a topic model that generates the click-through data, and learning the topic model by weak supervision from humans. Specifically, it characterizes each named entity by its associated queries and URLs in the click-through data. It uses the topic model to resolve ambiguities of named entity classes by representing the classes as topics. It employs a method, referred to as Weakly Supervised Latent Dirichlet Allocation (WS-LDA), to accurately learn the topic model with partially labeled named entities. Experiments on a large scale click-through data containing over 1.5 billion query-URL pairs show that the proposed approach can conduct very accurate NEM and significantly outperforms the baseline.

Hang Li | Shuang-Hong Yang | Gu Xu

[1] Wei Li,et al. Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[2] Richard M. Schwartz,et al. Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[3] Deng Cai,et al. Topic modeling with network regularization , 2008, WWW.

[4] Enhong Chen,et al. Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[5] Doug Downey,et al. Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[6] Yen-Jen Oyang,et al. Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[7] W. Bruce Croft,et al. LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[8] Yorick Wilks,et al. University of Sheffield: description of the LaSIE system as used for MUC-6 , 1995, MUC.

[9] Marius Pasca,et al. Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[10] Marius Pasca,et al. Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[11] Naonori Ueda,et al. Parametric Mixture Models for Multi-Labeled Text , 2002, NIPS.