论文信息 - Language Identification With Confidence Limits

Language Identification With Confidence Limits

A statistical classification algorithm and its application to language identification from noisy input are described. The main innovation is to compute confidence limits on the classification, so that the algorithm terminates when enough evidence to make a clear decision has been made, and so avoiding problems with categories that have similar characteristics. A second application, to genre identification, is briefly examined. The results show that some of the problems of other language identification techniques can be avoided, and illustrate a more important point: that a statistical language process can be used to provide feedback about its own success rate.

David Elworthy

[1] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[2] Virginia P. Collier,et al. Two Languages Are Better Than One. , 1998 .

[3] Penelope Sibun,et al. Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.

[4] Philip Resnik,et al. A Language Identification Application Built on the Java Client / Server Platform , 1997 .

[5] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .