Named entity recognition for tweets

Two main challenges of Named Entity Recognition (NER) for tweets are the insufficient information in a tweet and the lack of training data. We propose a novel method consisting of three core elements: (1) normalization of tweets; (2) combination of a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model; and (3) semisupervised learning framework. The tweet normalization preprocessing corrects common ill-formed words using a global linear model. The KNN-based classifier conducts prelabeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semisupervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of normalization, KNN, and semisupervised learning.

[1]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[2]  George R. Krupka,et al.  IsoQuest Inc.: Description of the NetOwl™ Extractor System as Used for MUC-7 , 1998, MUC.

[3]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[4]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[5]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[6]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[9]  Christopher D. Manning,et al.  An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition , 2006, ACL.

[10]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[13]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[14]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[15]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[16]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[17]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[18]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[19]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[20]  Hai Yang,et al.  ACM Transactions on Intelligent Systems and Technology - Special Section on Urban Computing , 2014 .

[21]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[22]  Tong Zhang,et al.  A Robust Risk Minimization based Named Entity Recognition System , 2003, CoNLL.

[23]  Sameer Singh,et al.  Minimally-Supervised Extraction of Entities from Text Advertisements , 2010, NAACL.

[24]  Jun'ichi Tsujii,et al.  Reranking for Biomedical Named-Entity Recognition , 2007, BioNLP@ACL.

[25]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[26]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[27]  Martin Jansche,et al.  Information Extraction from Voicemail Transcripts , 2002, EMNLP.

[28]  Frederick Reiss,et al.  Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks , 2010, EMNLP.

[29]  Nan Ye,et al.  Domain adaptive bootstrapping for named entity recognition , 2009, EMNLP.

[30]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[31]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[32]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[33]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[34]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[35]  Suzanne Stevenson,et al.  An Unsupervised Model for Text Message Normalization , 2009 .

[36]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[37]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[38]  Xian Wu,et al.  Domain Adaptation with Latent Semantic Association for Named Entity Recognition , 2009, NAACL.