Personal name classification in web queries

Personal names are an important kind of Web queries in Web search, and yet they are special in many ways. Strategies for retrieving information on personal names should therefore be different from the strategies for other types of queries. To improve the search quality for personal names, a first step is to detect whether a query is a personal name. Despite the importance of this problem, relatively little previous research has been done on this topic. Since Web queries are usually short, conventional supervised machine-learning algorithms cannot be applied directly. An alternative is to apply some heuristic rules coupled with name-term dictionaries. However, when the dictionaries are small, this method tends to make false negatives; when the dictionaries are large, it tends to generate false positives. A more serious problem is that this method cannot provide a good trade-off between precision and recall. To solve these problems, we propose an approach based on the construction of probabilistic name-term dictionaries and personal name grammars, and use this algorithm to predict the probability of a query to be a personal name. In this paper, we develop four different methods for building probabilistic name-term dictionaries in which a term is assigned with a probability value of the term being a name term. We compared our approach with baseline algorithms such as dictionary-based look-up methods and supervised classification algorithms including logistic regression and SVM on some manually labeled test sets. The results validate the effectiveness of our approach, whose F1 value is more than 79.8%, which outperforms the best baseline by more than 11.3%

[1]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Hsin-Hsi Chen,et al.  White Page Construction from Web Pages for Finding People on the Internet , 1998, Int. J. Comput. Linguistics Chin. Lang. Process..

[4]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[5]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[8]  Ji-Rong Wen,et al.  Clustering user queries of a search engine , 2001, WWW '01.

[9]  Christopher Dozier Assigning Belief Scores to Names in Queries , 2001, HLT.

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Feng Zhang,et al.  A New Statistical Approach to Personal Name Extraction , 2002, ICML.

[12]  Hwee Tou Ng,et al.  Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[13]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[14]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[15]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[16]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[17]  Julio Gonzalo,et al.  A testbed for people searching strategies in the WWW , 2005, SIGIR '05.

[18]  Zhenyu Liu,et al.  Automatic identification of user goals in Web search , 2005, WWW '05.

[19]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.

[20]  Qiang Yang,et al.  Query enrichment for web-query classification , 2006, TOIS.

[21]  Ying Li,et al.  Detecting online commercial intention (OCI) , 2006, WWW '06.

[22]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[23]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[24]  Christos Faloutsos,et al.  An adaptive two-phase approach to WiFi location sensing , 2006, Fourth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOMW'06).