Addressing Scalability Issues of Named Entity Recognition Using Multi-Class Support Vector Machines

This paper explores the scalability issues associated with solving the Named Entity Recognition (NER) problem using Support Vector Machines (SVM) and high-dimensional features. The performance results of a set of experiments conducted using binary and multi-class SVM with increasing training data sizes are examined. The NER domain chosen for these experiments is the biomedical publications domain, especially selected due to its importance and inherent challenges. A simple machine learning approach is used that eliminates prior language knowledge such as part-of-speech or noun phrase tagging thereby allowing for its applicability across languages. No domain-specific knowledge is included. The accuracy measures achieved are comparable to those obtained using more complex approaches, which constitutes a motivation to investigate ways to improve the scalability of multi- class SVM in order to make the solution more practical and useable. Improving training time of multi-class SVM would make support vector machines a more viable and practical machine learning solution for real-world problems with large datasets. An initial prototype results in great improvement of the training time at the expense of memory requirements. In previous work (11), a simple architecture that eliminates language and domain-specific knowledge from the named entity recognition process is applied to the English biomedical entity recognition task, as a baseline for other languages and domains. The biomedical field NER remains a challenging task due to growing nomenclature, ambiguity in the left boundary of entities caused by descriptive naming, difficulty of manually annotating large sets of training data, strong overlap among different entities, to cite a few of the NER challenges in this domain. The approach used reduces the pre- and post-processing of the textual data to a minimum and capitalizes on SVM's strong generalization ability to classify the named entities. The accuracy measures achieved are comparable to those obtained using more complex techniques, which encourage us to explore ways to improve the scalability of multi-class support vector machines. In this paper, the results of a set of scalability experiments are reported. These experiments use binary and multi-class SVM with a large set of real-world data from the biomedical literature. In Section II, the theory of binary and multi-class support vector machines is briefly introduced. Section III describes the experiments' design and summarizes the results of a baseline experiment conducted during the previous work (11) in order to assess the feasibility of our language and domain- independent machine learning NER approach using SVM and high-dimensional features. The baseline experiment design reduces pre-processing to feature extraction and eliminates the use of prior language or domain knowledge. The results of the baseline experiment are a motivation to explore ways to address the scalability issues of the All-Together multi-class SVM approach. Improving scalability of multi-class SVM would provide the research community with a practical and powerful machine learning solution for named entity recognition that promotes the use of high-dimensional features in place of more complex labor and time expensive pre- and post-processing tasks, and simplifies the NER process while achieving good accuracy and performance measures. In Section IV, the results of several sets of single-class and multi-class scalability tests using SVM and increasing training data size are reported and their impact on training time is examined. A sample of preliminary results using a prototype multi-class implementation based on SVM-Perf is also presented.

[1]  S. Abe,et al.  Spatially chunking support vector clustering algorithm , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[2]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.

[3]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[4]  Hae-Chang Rim,et al.  Biomedical named entity recognition using two-phase model based on SVMs , 2004, J. Biomed. Informatics.

[5]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[6]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[7]  Gary Geunbae Lee,et al.  POSBIOTM-NER in the Shared Task of BioNLP/NLPBA2004 , 2004, NLPBA/BioNLP.

[8]  Ulrich H.-G. Kreßel,et al.  Pairwise classification and support vector machines , 1999 .

[9]  Hwee Tou Ng,et al.  One Class per Named Entity: Exploiting Unlabeled Text for Named Entity Recognition , 2007, IJCAI.

[10]  Antônio de Pádua Braga,et al.  SVM-KM: speeding SVMs learning with a priori cluster selection and k-means , 2000, Proceedings. Vol.1. Sixth Brazilian Symposium on Neural Networks.

[11]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[12]  Hae-Chang Rim,et al.  Incorporating Lexical Knowledge into Biomedical NE Recognition , 2004, NLPBA/BioNLP.

[13]  Daniel Boley,et al.  Training Support Vector Machines Using Adaptive Clustering , 2004, SDM.

[14]  Marc Rössler,et al.  Adapting an NER-System for German to the Biomedical Domain , 2004, NLPBA/BioNLP.

[15]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[16]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[17]  Venansius Baryamureeba,et al.  PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 8 , 2005 .

[18]  Shigeo Abe Support Vector Machines for Pattern Classification , 2010, Advances in Pattern Recognition.

[19]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[20]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[21]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[22]  Venu Govindaraju,et al.  Half-Against-Half Multi-class Support Vector Machines , 2005, Multiple Classifier Systems.

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[25]  Claudio Giuliano,et al.  Simple Information Extraction (SIE) , 2005 .

[26]  Luca Zanni,et al.  On the working set selection in gradient projection-based decomposition techniques for support vector machines , 2005, Optim. Methods Softw..

[27]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[28]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[29]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[30]  J. Kalita,et al.  Language and Domain-Independent Named Entity Recognition : Experiment using SVM and High-Dimensional Features , 2007 .

[32]  Jason Weston,et al.  Large Scale Transductive SVMs , 2006, J. Mach. Learn. Res..

[33]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[34]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[35]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..