论文信息 - A Maximum Entropy Approach to Identifying Sentence Boundaries

A Maximum Entropy Approach to Identifying Sentence Boundaries

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.

Adwait Ratnaparkhi | Jeffrey C. Reynar | A. Ratnaparkhi

[1] Geoffrey Nunberg,et al. The linguistics of punctuation , 1990 .

[2] Michael White. Presenting Punctuation , 1995, ArXiv.

[3] Marti A. Hearst,et al. Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[4] Michael Collins,et al. A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[5] J. Darroch,et al. Generalized Iterative Scaling for Log-Linear Models , 1972 .

[6] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[7] Marti A. Hearst,et al. Adaptive Sentence Boundary Disambiguation , 1994, ANLP.

[8] Penelope Sibun,et al. A Practical Part-of-Speech Tagger , 1992, ANLP.

[9] Michael Riley,et al. Some Applications of Tree-based Modelling to Speech and Language , 1989, HLT.

[10] Eric Brill,et al. Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[11] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.