Experiments on sentence segmentation in Old Swedish editions

We present experiments on automatic segmentation of electronic Old Swedish editions into sentence-like units. Our target material is characterized by a great variation in the type of boundaries that are marked orthographically, the extent of boundary marking, and the means of boundary marking. We begin with an exploration of boundary marking in a large, unannotated corpus of Old Swedish texts. Then we show that we are able to improve upon a simple but effective segmenting baseline, using a conditional random field model trained on a manually annotated corpus. A more valuable lesson the modelling teaches us, however, is that we need to address the boundary marking variation explicitly.

[1]  Yoshihiko Gotoh,et al.  Sentence Boundary Detection in Broadcast Speech Transcripts , 2000 .

[2]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[3]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[4]  Hanne M. Eckhoff,et al.  Computational and Linguistic Issues in Designing a Syntactically Annotated Parallel Corpus of Indo-European Languages , 2009, Trait. Autom. des Langues.

[5]  Andreas Stolcke,et al.  Using Conditional Random Fields for Sentence Boundary Detection in Speech , 2005, ACL.

[6]  Daniel Gillick,et al.  Sentence Boundary Detection and the Problem with the U.S. , 2009, NAACL.

[7]  Hsin-Hsi Chen,et al.  Classical Chinese Sentence Segmentation , 2010, CIPS-SIGHAN.

[8]  Gerlof Bouma,et al.  bokstaffua, bokstaffwa, bokstafwa, bokstaua, bokstawa ... Towards lexical link-up for a corpus of Old Swedish , 2012, KONVENS.

[9]  Florian Petran Studies for Segmentation of Historical Texts : Sentences or Chunks ? , 2012 .

[10]  Stephan Oepen,et al.  Sentence Boundary Detection: A Long Solved Problem? , 2012, COLING.

[11]  Elizabeth Shriberg,et al.  Comparing Evaluation Metrics for Sentence Boundary Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Mark Stevenson,et al.  Experiments on Sentence Boundary Detection , 2000, ANLP.

[13]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.