Text Representations for Text Categorization: A Case Study in Biomedical Domain

In vector space model (VSM), textual documents are represented as vectors in the term space. Therefore, there are two issues in this representation, i.e. (1) what should a term be and (2) how to weight a term. This paper examined ways to represent text from the above two aspects to improve the performance of text categorization. Different representations have been evaluated using SVM on three biomedical corpora. The controlled experiments showed that the straightforward usage of named entities as terms in VSM does not show performance improvements over the bag-of-words representation. On the other hand, the term weighting method slightly improved the performance. However, to further improve the performance of text categorization, more advanced techniques and more effective usages of natural language processing for text representations appear needed.

[1]  James Allan,et al.  Document classification using multiword features , 1998, CIKM '98.

[2]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[3]  Shiwei Tang,et al.  A Comparative Study on Feature Weight in Text Categorization , 2004, APWeb.

[4]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[5]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[6]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[7]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[10]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[11]  Karen Spärck Jones,et al.  Natural language processing for information retrieval , 1996, CACM.

[12]  Athanasios Kehagias,et al.  A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms , 2003, Journal of Intelligent Information Systems.

[13]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[14]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[15]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[16]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[17]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[18]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.