Boosted Web Named Entity Recognition via Tri-Training

Named entity extraction is a fundamental task for many natural language processing applications on the web. Existing studies rely on annotated training data, which is quite expensive to obtain large datasets, limiting the effectiveness of recognition. In this research, we propose a semisupervised learning approach for web named entity recognition (NER) model construction via automatic labeling and tri-training. The former utilizes structured resources containing known named entities for automatic labeling, while the latter makes use of unlabeled examples to improve the extraction performance. Since this automatically labeled training data may contain noise, a self-testing procedure is used as a follow-up to remove low-confidence annotation and prepare higher-quality training data. Furthermore, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve entity recognition. Finally, we apply this semisupervised learning framework for person name recognition, business organization name recognition, and location name extraction. In the task of Chinese NER, an F-measure of 0.911, 0.849, and 0.845 can be achieved, for person, business organization, and location NER, respectively. The same framework is also applied for English and Japanese business organization name recognition and obtains models with performance of a 0.832 and 0.803 F-measure.

[1]  Chia-Hui Chang,et al.  MapMarker: Extraction of Postal Addresses and Associated Information for General Web Pages , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[2]  Jimmy J. Lin The Web as a Resource for Question Answering: Perspectives and Challenges , 2002, LREC.

[3]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[4]  Mitul Tiwari,et al.  Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach , 2013, Proc. VLDB Endow..

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  Le Minh Nguyen,et al.  Using Semi-supervised Learning for Question Classification , 2008 .

[7]  Hitoshi Isahara,et al.  Chinese Chunking with Tri-training Learning , 2006, ICCPOL.

[8]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[9]  Yueng-Sheng Su,et al.  Associated Information Extraction for Enabling Entity Search on Electronic Map , 2012 .

[10]  Craig A. Knoblock,et al.  Exploiting Background Knowledge to Build Reference Sets for Information Extraction , 2009, IJCAI.

[11]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[12]  L. Satish,et al.  Use of hidden Markov models for partial discharge pattern classification , 1993 .

[13]  Björn W. Schuller,et al.  New Avenues in Opinion Mining and Sentiment Analysis , 2013, IEEE Intelligent Systems.

[14]  Adam Rae,et al.  Mining the web for points of interest , 2012, SIGIR '12.

[15]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[16]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[17]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[18]  Gary Geunbae Lee,et al.  Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web , 2003, ACL.

[19]  Ting Liu,et al.  Generating Chinese Named Entity Data from a Parallel Corpus , 2011, IJCNLP.

[20]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[21]  Pericles A. Mitkas,et al.  Event identification in web social media through named entity recognition and topic modeling , 2013, Data Knowl. Eng..

[22]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[23]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[24]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[25]  Minh Le Nguyen,et al.  Using Semi-supervised Learning for Question Classification , 2006, ICCPOL.

[26]  Xuanjing Huang,et al.  FudanNLP: A Toolkit for Chinese Natural Language Processing , 2013, ACL.

[27]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[28]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[29]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[30]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[31]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[32]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[33]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[34]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[35]  Ting Liu,et al.  Generating Chinese named entity data from parallel corpora , 2014, Frontiers of Computer Science.

[36]  Chia-Hui Chang,et al.  Effective Web Crawling for Chinese Addresses and Associated Information , 2014, EC-Web.

[37]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[38]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[39]  Sandra Kübler,et al.  Semi-supervised Learning for Opinion Detection , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[40]  V. A. Yatsko,et al.  Automatic genre recognition and adaptive text summarization , 2010, Automatic Documentation and Mathematical Linguistics.