N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus

In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.

[1]  Jan Snajder,et al.  Language morphology offset: Text classification on a Croatian-English parallel corpus , 2008, Inf. Process. Manag..

[2]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[3]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[4]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Jacques Savoy,et al.  Light stemming approaches for the French, Portuguese, German and Hungarian languages , 2006, SAC.

[9]  Dunja Mladenic,et al.  Using String Kernels for Classification of Slovenian Web Documents , 2005, GfKl.

[10]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[11]  Marko Tadic Building the Croatian-English Parallel Corpus , 2000, LREC.

[12]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  F. Saric,et al.  TMT: Object-Oriented Text Classification Library , 2007, 2007 29th International Conference on Information Technology Interfaces.

[15]  R. Jalam Apprentissage automatique et catégorisation de textes multilingues , 2003 .

[16]  R. Jalam,et al.  Kernel-based text categorisation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[17]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[18]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[19]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[20]  Jean-Hugues Chauchat,et al.  Pourquoi les n-grammes permettent de classer des textes? Recherche de mots-clefs pertinents à l'aide des n-grammes caractéristiques , 2002 .

[21]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[22]  Michael F. Lynch,et al.  Stemming and N-gram matching for term conflation in Turkish texts , 1996, Information Research.

[23]  Wessel Kraaij,et al.  Variations on language modeling for information retrieval , 2005, SIGF.