Classifying Wikipedia Articles into NE's Using SVM's with Threshold Adjustment

In this paper, a method is presented to recognize multilingual Wikipedia named entity articles. This method classifies multilingual Wikipedia articles using a variety of structured and unstructured features and is aided by cross-language links and features in Wikipedia. Adding multilingual features helps boost classification accuracy and is shown to effectively classify multilingual pages in a language independent way. Classification is done using Support Vectors Machine (SVM) classifier at first, and then the threshold of SVM is adjusted in order to improve the recall scores of classification. Threshold adjustment is performed using beta-gamma threshold adjustment algorithm which is a post learning step that shifts the hyperplane of SVM. This approach boosted recall with minimal effect on precision.

[1]  James G. Shanahan,et al.  Boosting support vector machines for text classification through parameter-free threshold relaxation , 2003, CIKM '03.

[2]  Wisam Dakka,et al.  Augmenting Wikipedia with Named Entity Tags , 2008, IJCNLP.

[3]  Peter Jansen,et al.  Threshold Calibration in CLARIT Adaptive Filtering , 1998, TREC.

[4]  Carina Silberer,et al.  Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration , 2008, LREC.

[5]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[8]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[9]  Dunja Mladenic,et al.  Extracting Named Entities and Relating Them over Time Based on Wikipedia , 2007, Informatica.

[10]  Yuji Matsumoto,et al.  A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields , 2007, EMNLP.

[11]  Odijk,et al.  Building a Multilingual Lexical Resource for Named Entity Disambiguation , Translation and Transliteration Wolodja Wentland , 2008 .

[12]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.