Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?*

ABSTRACT Type-token ratio (TTR), also known as vocabulary size divided by text length (V/N) is a simple measure of lexical diversity. It has been used in literary studies, studies in child language and even psychiatry. The basic problem of TTR is that it is affected by the length of the text sample. Several suggestions for improving this fault have been given, including standardizing the length of text samples, using logarithms in the basic formula, etc. We show in this paper that simple TTR and its more elaborate calculation MATTR can be used for approximation of morphological complexity of languages. This usage of TTR has been notified by Juola with analysis of six languages. We analyse text material with TTR and MATTR from two differing sources: firstly, text of the EU constitution with 21 languages and secondly with 16 of the same languages with available non-parallel random data from the Leipzig corpus. We compare the automatic analysis results to two independent linguistic measures of morphological complexity. Firstly, we use number of non-homographic noun forms in a language’s inflectional paradigms, the paradigm size. Secondly we use available inflectional synthesis figures of verbs produced by the AUTOTYP project. We enrich our corpus findings with data from information retrieval (IR) results. It has been suggested that improvements in achieved IR effectiveness with usage of word form variation management depend on the morphological complexity of the languages. Thus this IR gain data can be used to give independent evidence to evaluation of morphological complexity. Our results show that earlier Juola complexity figures and TTR and MATTR calculations correlate moderately in the EU constitution data. Figures given by TTR and MATTR correlate highly with each other in both corpora, and they also correlate highly with the number of non-homographic noun forms in a language. Correlation to inflectional synthesis of the verbs was found weakly positive in most cases, but the data was scarce. All the three computed measures are able to order the languages quite meaningfully in a morphological complexity order that at least groups most of the languages with same kind of languages and the most and least complex languages are clearly separated. It seems also that TTR and MATTR order the languages quite consistently with both corpora. In the conclusion we discuss how the complexity figures can be utilized.

[1]  Kimmo Kettunen Sijamuodot haussa - tarvitseeko kaikkea hakutermien morfologista vaihtelua kattaa? , 2005 .

[2]  Christian Biemann,et al.  Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[3]  Richard Hudson,et al.  About 37% of word-tokens are nouns , 1994 .

[4]  Timo Honkela,et al.  Complexity of European Union Languages: A comparative approach* , 2008, J. Quant. Linguistics.

[5]  Kimmo Kettunen Managing word form variation of text retrieval in practice – Why language technology is not the only cure for better IR performance? , 2013 .

[6]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[7]  Kimmo Kettunen Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: An overview , 2009, J. Documentation.

[8]  Balthasar Bickel,et al.  Inflectional synthesis of the verb , 2005 .

[9]  R. Harald Baayen,et al.  The Effects of Lexical Specialization on the Growth Curve of the Vocabulary , 1996, Comput. Linguistics.

[10]  Frans Plank,et al.  PARADIGM SIZE, MORPHOLOGICAL TYPOLOGY, AND UNIVERSAL ECONOMY , 1986 .

[11]  Geoffrey Sampson,et al.  Language complexity as an evolving variable , 2009 .

[12]  Theo Janssen,et al.  A Number of Cases , 2016 .

[13]  Michael A. Covington,et al.  Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR) , 2010, J. Quant. Linguistics.

[14]  Benedikt Szmrecsanyi,et al.  An information-theoretic approach to assess linguistic complexity , 2016 .

[15]  Patrick Juola Assessing linguistic complexity , 2008 .

[16]  James Mayfield,et al.  Addressing morphological variation in alphabetic languages , 2009, SIGIR.

[17]  Fermín Moscoso del Prado Martín,et al.  The mirage of morphological complexity , 2011, CogSci.

[18]  Timo Honkela,et al.  Analysis of EU Languages Through Text Compression , 2006, FinTAL.

[19]  Max Bane,et al.  Quantifying and Measuring Morphological Complexity , 2007 .

[20]  Robert Forkel,et al.  The World Atlas of Language Structures Online , 2009 .

[21]  Patrick Juola Measuring Linguistic Complexity: The Morphological Tier , 1998, J. Quant. Linguistics.

[22]  Ryan Keith Shosted,et al.  Correlating complexity: A typological approach , 2006 .