Semantic Entity Identification in Large Scale Data via Statistical Features and DT-SVM

Semantic entities carry the most important semantics of text data. However, traditional approaches such as named entity recognition and new word identification may only detect some specific types of entities. In addition, they generally adopt sequence annotation algorithms such as Hidden Markov Model (HMM) and Conditional Random Field (CRF) which can only utilize limited context information. As a result, they are inefficient on the extraction of semantic entities that were never shown in the training data. In this paper we propose a strategy to extract unknown text semantic entities by integrating statistical features, Decision Tree (DT), and Support Vector Machine (SVM) algorithms. With the proposed statistical features and novel classification approach, our strategy can detect more semantic entities than traditional approaches such as CRF and Bootstrapping-SVM methods. It is very sensitive to new entities that just appear in fresh data. Our experimental results have shown that the precision, recall rate and F-One rate of our strategy are about 23.6%, 21.5% and 25.8% higher than that of the representative approaches on average.

[1]  Jane Hunter,et al.  Adding Multimedia to the Semantic Web: Building an MPEG-7 ontology , 2001, SWWS.

[2]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[3]  Bo Xu,et al.  Chinese Named Entity Recognition Combining Statistical Model wih Human Knowledge , 2003, NER@ACL.

[4]  Aitao Chen,et al.  Chinese Named Entity Recognition with Conditional Probabilistic Models , 2006, SIGHAN@COLING/ACL.

[5]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[6]  Steffen Staab,et al.  COMM: Designing a Well-Founded Multimedia Ontology for the Web , 2007, ISWC/ASWC.

[7]  Bo Xu,et al.  Chinese named entity recognition based on multiple features , 2005, EMNLP 2005.

[8]  Shih-Hung Wu,et al.  Mencius: A Chinese Named Entity Recognizer Using the Maximum Entropy-based Hybrid Model , 2004, Int. J. Comput. Linguistics Chin. Lang. Process..

[9]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[10]  Cheng Niu,et al.  A Bootstrapping Approach to Named Entity Classification Using Successive Learners , 2003, ACL.

[11]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[12]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[13]  Zhiyuan Liu,et al.  Incorporating User Behaviors in New Word Detection , 2009, IJCAI.

[14]  Guohong Fu,et al.  Chinese Unknown Word Identification Using Class-Based LM , 2004, IJCNLP.

[15]  Isabelle Tellier,et al.  POS-tagging for Oral Texts with CRF and Category Decomposition , 2010, CICLing 2010.

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  Ralph Grishman,et al.  A Decision Tree Method for Finding and Classifying Names in Japanese Texts , 1998, VLC@COLING/ACL.

[18]  Andi Wu,et al.  Statistically-Enhanced New Word Identification in a Rule-Based Chinese System , 2000, ACL 2000.

[19]  Changning Huang,et al.  The Use of SVM for Chinese New Word Identification , 2004, IJCNLP.

[20]  Hua Yang,et al.  Evaluating reliability of co-citation clustering analysis in representing the research history of subject , 2008, Scientometrics.

[21]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[22]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[23]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[24]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.