Unsupervised Learning of Period Disambiguation for Tokenisation

A language-independent period disambiguation method is presented which achieves high accuracy (> 99.5 %) and requires no other information than the corpus which is to be tokenised. The presented method automatically extracts statistical information about likely abbreviations, about sentence-initial words and about words which precede or follow numbers. This information is used to disambiguate periods and to recog-nise ordinal numbers and abbreviations. The recognition of abbreviations in languages with large compound nouns like German is enhanced by suux analysis.