Person-specific named entity recognition using SVM with rich feature sets

Purpose: The purpose of the study is to explore the potential use of nature language process (NLP) and machine learning (ML) techniques and intents to find a feasible strategy and effective approach to fulfill the NER task for Web oriented person-specific information extraction. Design/methodology/approach: An SVM-based multi-classification approach combined with a set of rich NLP features derived from state-of-the-art NLP techniques has been proposed to fulfill the NER task. A group of experiments has been designed to investigate the influence of various NLP-based features to the performance of the system, especially the semantic features. Optimal parameter settings regarding with SVM models, including kernel functions, margin parameter of SVM model and the context window size, have been explored through experiments as well. Findings: The SVM-based multi-classification approach has been proved to be effective for the NER task. This work shows that NLP-based features are of great importance in datadriven NE recognition, particularly the semantic features. The study indicates that higher order kernel function may not be desirable for the specific classification problem in practical application. The simple linear-kernel SVM model performed better in this case. Moreover, the modified SVM models with uneven margin parameter are more common and flexible, which have been proved to solve the imbalanced data problem better. Research limitations/implications: The SVM-based approach for NER problem is only proved to be effective on limited experiment data. Further research need to be conducted on the large batch of real Web data. In addition, the performance of the NER system need be tested when incorporated into a complete IE framework. Originality/value: The specially designed experiments make it feasible to fully explore the characters of the data and obtain the optimal parameter settings for the NER task, leading to a preferable rate in recall, precision and F 1 measures. The overall system performance ( F 1 value) for all types of name entities can achieve above 88.6%, which can meet the requirements for the practical application.

[1]  Satoshi Sekine,et al.  WePS2 Attribute Extraction Task , 2009 .

[2]  Xiaoli Zhang,et al.  Information-seeking patterns and behaviors of selected undergraduate students in a Chinese university , 1992 .

[3]  Fredrik Åström,et al.  Visualizing Library and Information Science concept spaces through keyword and citation based maps and clusters , 2002 .

[4]  Kalina Bontcheva,et al.  SVM Based Learning System for Information Extraction , 2004, Deterministic and Statistical Methods in Machine Learning.

[5]  Min-Yen Kan,et al.  PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[6]  Stephen S. Murray,et al.  The bibliometric properties of article readership information , 2005, J. Assoc. Inf. Sci. Technol..

[7]  Ting Wang,et al.  Automatic Extraction of Hierarchical Relations from Text , 2006, ESWC.

[8]  Xianpei Han,et al.  CASIANED: Web Personal Name Disambiguation Based on Professional Categorization , 2009 .

[9]  Edward J. Wegman,et al.  Social networks of author-coauthor relationships , 2008, Comput. Stat. Data Anal..

[10]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[11]  Kalina Bontcheva,et al.  Developing Language Processing Components with GATE (a User Guide) , 2003 .

[12]  Hsinchun Chen,et al.  Disease named entity recognition using semisupervised learning and conditional random fields , 2011, J. Assoc. Inf. Sci. Technol..

[13]  Julio Gonzalo,et al.  WePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks , 2010, CLEF.

[14]  Dayne Freitag,et al.  Features for Web Person Disambiguation , 2009 .

[15]  Fernando Pereira,et al.  Lightly-Supervised Attribute Extraction , 2007 .

[16]  Son Doan,et al.  Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine , 2010, COLING.

[17]  Peter Ingwersen,et al.  Information seeking research needs extension toward tasks and technology , 2004, Inf. Res..

[18]  Pabitra Mitra,et al.  A composite kernel for named entity recognition , 2010, Pattern Recognit. Lett..

[19]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[20]  Nicole Campbell,et al.  Usability assessment of library-related web sites : methods and case studies , 2001 .

[21]  Sivaji Bandyopadhyay,et al.  Named Entity Recognition using Support Vector Machine: A Language Independent Approach , 2010 .

[22]  松本 裕治,et al.  Japanese Named Entity Extraction using Support Vector Machines , 2001 .

[23]  Dan Suciu,et al.  SilkRoute: A framework for publishing relational data in XML , 2002, TODS.

[24]  Danushka Bollegala,et al.  A Two-Step Approach to Extracting Attributes for People on the Web , 2009 .

[25]  Abe Crystal,et al.  Task analysis and human-computer interaction: approaches, techniques, and levels of analysis , 2004, AMCIS.

[26]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[27]  Sun Yi,et al.  Design and Implementation of Library Intelligent IM Reference Robot , 2011 .

[28]  Sanda M. Harabagiu,et al.  Automatic extraction of relations between medical concepts in clinical texts , 2011, J. Am. Medical Informatics Assoc..

[29]  Xianpei Han,et al.  CASIANED: People Attribute Extraction based on Information Extraction , 2009 .

[30]  Fei Zhu,et al.  Named Entity Recognition from Biomedical Text Using SVM , 2011, 2011 5th International Conference on Bioinformatics and Biomedical Engineering.

[31]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).