Language morphology offset: Text classification on a Croatian-English parallel corpus

We investigate how, and to what extent, morphological complexity of the language influences text classification using support vector machines (SVM). The Croatian-English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance is statistically significant; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.

[1]  Tomislava Lauc,et al.  Developing text retrieval system using robust morphological parsing , 1998 .

[2]  F. Saric,et al.  Enhanced thesaurus terms extraction for document indexing , 2005, 27th International Conference on Information Technology Interfaces, 2005..

[3]  Marko Tadic,et al.  Building the Croatian National Corpus , 2002, LREC.

[4]  Marko Tadic Building the Croatian-English Parallel Corpus , 2000, LREC.

[5]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[6]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[7]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[8]  Stephen E. Robertson,et al.  On document relevance and lexical cohesion between query terms , 2006, Inf. Process. Manag..

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Jacques Savoy,et al.  Light stemming approaches for the French, Portuguese, German and Hungarian languages , 2006, SAC.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Hisham M. Haddad Proceedings of the 2006 ACM symposium on Applied computing , 2006, SAC.

[13]  Peter Willett,et al.  The effectiveness of stemming for natural‐language access to Slovene textual data , 1992 .

[14]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[15]  Marko Grobelnik,et al.  Feature Selection Using Linear Support Vector Machines , 2002 .

[16]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[17]  M. Tadic,et al.  Inflectionally sensitive web search in Croatian using Croatian lemmatization server , 2006, 28th International Conference on Information Technology Interfaces, 2006..

[18]  Bojana Dalbelo Basic,et al.  Mining Textual Data in Croatian , 2005, miproBIS.

[19]  Olga Vechtomova,et al.  The Role of Multi-word Units in Interactive Information Retrieval , 2005, ECIR.

[20]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[21]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[22]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[23]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  Stephen Tomlinson,et al.  Lexical and Algorithmic Stemming Compared for 9 European Languages with Hummingbird SearchServerTM at CLEF 2003 , 2003, CLEF.

[26]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[27]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[28]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.