Towards language independent automated learning of text categorization models

We describe the results of extensive machine learning experiments on large collections of Reuters’ English and German newswires. The goal of these experiments was to automatically discover classification patterns that can be used for assignment of topics to the individual newswires. Our results with the English newswire collection show a very large gain in performance as compared to published benchmarks, while our initial results with the German newswires appear very promising. We present our methodology, which seems to be insensitive to the language of the document collections, and discuss issues related to the differences in results that we have obtained for the two collections.

[1]  Philip J. Hayes,et al.  TCS: a shell for content-based text categorization , 1990, Sixth Conference on Artificial Intelligence for Applications.

[2]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[3]  P. J. Hayes,et al.  Adding value to financial news by computer , 1991, Proceedings First International Conference on Artificial Intelligence Applications on Wall Street.

[4]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[7]  Sholom M. Weiss,et al.  Optimized rule induction , 1993, IEEE Expert.

[8]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[9]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .