论文信息 - Statistical Corpus and Language Comparison on Comparable Corpora

Statistical Corpus and Language Comparison on Comparable Corpora

With the wide availability of textual data in various languages, domains and registers it is easy to create text corpora for a variety of applications. These include, among many others, the field of Natural Language Processing. The Leipzig Corpora Collection creates and uses such corpora for more than fifteen years. However, the work on preprocessing distributed resources to ensure homogeneity and thus comparability is a steady process. As a result created corpora in identical formats allow the use of different statistical methods to generate various data for manual or automatic analysis. These are basis for applications in intra- and inter-language comparison or quality assurance of text stocks.

Thomas Eckart | Uwe Quasthoff | U. Quasthoff | Thomas Eckart

[1] Arnim Bleier,et al. JRuby Topic Maps , 2009 .

[2] Peter Grzybek,et al. History and Methodology of Word Length Studies , 2007 .

[3] Adam Kilgarriff,et al. Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora , 1997, VLC.

[4] G. Zipf,et al. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. , 1949 .

[5] Duncan J. Watts,et al. Collective dynamics of ‘small-world’ networks , 1998, Nature.

[6] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[7] Christian Biemann,et al. Corpus Portal for Search in Monolingual Corpora , 2006, LREC.

[8] Walt Detmar Meurers,et al. Detecting annotation errors in spoken language corpora , 2006 .

[9] Zuhair Bandar,et al. Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.