Mobile phone name extraction from internet forums: a semi-supervised approach

Collecting users’ feedback on products from Internet forums is challenging because users often mention a product with informal abbreviations or nicknames. In this paper, we propose a method named Gren to recognize and normalize mobile phone names from domain-specific Internet forums. Instead of directly recognizing phone names from sentences as in most named entity recognition tasks, we propose an approach to generating candidate names as the first step. The candidate names capture short forms, spelling variations, and nicknames of products, but are not noise free. To predict whether a candidate name mention in a sentence indeed refers to a specific phone model, a Conditional Random Field (CRF)-based name recognizer is developed. The CRF model is trained by using a large set of sentences obtained in a semi-automatic manner with minimal manual labeling effort. Lastly, a rule-based name normalization component maps a recognized name to its formal form. Evaluated on more than 4000 manually labeled sentences with about 1000 phone name mentions, Gren outperforms all baseline methods. Specifically, it achieves precision and recall of 0.918 and 0.875 respectively, with the best feature setting. We also provide detailed analysis of the intermediate results obtained by each of the three components in Gren.

[1]  Aaron Cohen Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries , 2005, LBLODMBS@IDMB.

[2]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Yitong Li,et al.  Entity Linking for Tweets , 2013, ACL.

[5]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Qi He,et al.  Exploiting hybrid contexts for Tweet segmentation , 2013, SIGIR.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[10]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[11]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[12]  Adriano Veloso,et al.  FS-NER: a lightweight filter-stream approach to named entity recognition on twitter data , 2013, WWW '13 Companion.

[13]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[14]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[15]  Qi He,et al.  Tweet Segmentation and Its Application to Named Entity Recognition , 2015, IEEE Transactions on Knowledge and Data Engineering.

[16]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[17]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[18]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[19]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[20]  Jie Tang,et al.  Accurate Product Name Recognition from User Generated Content , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[21]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[22]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[23]  Chenliang Li,et al.  TSDW: Two-stage word sense disambiguation using Wikipedia , 2013, J. Assoc. Inf. Sci. Technol..

[24]  Yangjie Yao Product name recognition and normalization in internet forums , 2014 .

[25]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .