Word unit based multilingual comparative analysis of text corpora

Parallel study of three very different languages - Hungarian. German and English - using text corpora of a similar size gives a possibility for the exploration of both similarities and differences. Corpora of publicly available Internet sources was used. The corpus size was the same (app. 20Mbytes, 2.5-3.5 million word forms) for all languages. Besides traditional corpus coverage, word length and occurence statistics, some new features about prosodic boundaries (sentence beginning and final positions, preceding and following a comma) were also computed. Among others, it was found, that the coverage of corpora by the most frequent words follows a parallel logarithmic rule for all languages in the 40-85% coverage range. The functions are much nearer for English and German than for Hungarian. The results can be applied in such diverse domains as predictive text input, word hyphenation, language modeling in speech recognition, corpus-based speech synthesis, etc.