Statistical Corpus and Language Comparison on Comparable Corpora

With the wide availability of textual data in various languages, domains and registers it is easy to create text corpora for a variety of applications. These include, among many others, the field of Natural Language Processing. The Leipzig Corpora Collection creates and uses such corpora for more than fifteen years. However, the work on preprocessing distributed resources to ensure homogeneity and thus comparability is a steady process. As a result created corpora in identical formats allow the use of different statistical methods to generate various data for manual or automatic analysis. These are basis for applications in intra- and inter-language comparison or quality assurance of text stocks.