Text Classification of Technical Papers Based on Text Segmentation

The goal of this research is to design a multi-label classification model which determines the research topics of a given technical paper. Based on the idea that papers are well organized and some parts of papers are more important than others for text classification, segments such as title, abstract, introduction and conclusion are intensively used in text representation. In addition, new features called Title Bi-Gram and Title SigNoun are used to improve the performance. The results of the experiments indicate that feature selection based on text segmentation and these two features are effective. Furthermore, we proposed a new model for text classification based on the structure of papers, called Back-off model, which achieves 60.45% Exact Match Ratio and 68.75% F-measure. It was also shown that Back-off model outperformed two existing methods, ML-kNN and Binary Approach.

[1]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[2]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[3]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[4]  Edward A. Fox,et al.  Proceedings of the Fourth ACM conference on Digital Libraries, August 11-14, 1999, Berkeley, CA, USA , 1999 .

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Yuji Matsumoto,et al.  Exploiting Text Structure for Topic Identification , 1996, VLC@COLING.

[7]  Mengjie Zhang,et al.  Modelling citation networks for improving scientific paper classification performance , 2006 .

[8]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[9]  Abdellatif Rahmoun,et al.  Experimenting N-Grams in Text Categorization , 2007, Int. Arab J. Inf. Technol..

[10]  Geoffrey I. Webb,et al.  PRICAI 2006: Trends in Artificial Intelligence, 9th Pacific Rim International Conference on Artificial Intelligence, Guilin, China, August 7-11, 2006, Proceedings , 2006, PRICAI.

[11]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[12]  Xiaoying Gao,et al.  Combining Contents and Citations for Scientific Document Classification , 2005, Australian Conference on Artificial Intelligence.

[13]  Leah S. Larkey,et al.  A patent search and classification system , 1999, DL '99.