Text Categorization for Multi-label Documents and Many Categories

In this paper, we propose a new classification method that addresses classification in multiple categories of textual documents. We call it Matrix Regression (MR) due to its resemblance to regression in a high dimensional space. Experiences on a medical corpus of hospital records to be classified by ICD (International Classification of Diseases) code demonstrate the validity of the MR approach. We compared MR with three frequently used algorithms in text categorization that are k-Nearest Neighbors, Centroide and Support Vector Machine. The experimental results show that our method outperforms them in both precision and time of classification.

[1]  Johan Eklund,et al.  Automated text categorization of bibliographic records , 2007 .

[2]  Simon Réhel Catégorisation automatique de textes et cooccurrence de mots provenant de documents non étiquetés , 2005 .

[3]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[4]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[5]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  James T. Kwok Automated Text Categorization Using Support Vector Machine , 1998, ICONIP.

[8]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[9]  Bruno Pouliquen,et al.  Indexation de textes médicaux par extraction de concepts, et ses utilisations. (Medical texts indexation using concepts extraction, and its use) , 2002 .

[10]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[11]  Neil D. Lawrence,et al.  A Comparison of State-of-the-Art Classification Techniques with Application to Cytogenetics , 2001, Neural Computing & Applications.

[12]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[13]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[14]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[15]  Karl-Michael Schneider On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification , 2004, EsTAL.

[16]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[17]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[20]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[21]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[22]  George Karypis,et al.  Centroid-Based Document Classification Algorithms: Analysis & Experimental Results , 2000 .