Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection

In this paper, we present three generalpurpose neural network models for sentence boundary detection. We report on a series of experiments with long shortterm memory (LSTM), bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) for sentence boundary detection. We show that these neural networks architectures outperform the popular framework of OpenNLP, which is based on a maximum entropy model. Hereby, we achieve state-of-the-art results both on multi-lingual benchmarks for 12 different languages and on a zeroshot scenario, thus concluding that our trained models can be used for building a robust, language-independent sentence boundary detection system.

[1]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.

[2]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[3]  Alexander Mehler,et al.  Resource-Size Matters: Improving Neural Named Entity Recognition with Optimized Large Corpora , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[4]  Alexandru Ceausu,et al.  South-East European Times : A parallel corpus of Balkan languages , Francis Tyers and , 2010 .

[5]  Johan Bos,et al.  Elephant: Sequence Labeling for Word and Sentence Segmentation , 2013, EMNLP.

[6]  Ngoc Thang Vu,et al.  Character Composition Model with Convolutional Neural Networks for Dependency Parsing on Morphologically Rich Languages , 2017, ACL.

[7]  Hinrich Schütze,et al.  Nonsymbolic Text Representation , 2016, EACL.

[8]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[9]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[10]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[11]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[12]  Alexander Mehler,et al.  BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature , 2019, CoNLL.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[15]  Xiaodong Zeng,et al.  iSentenizer-μ: Multilingual Sentence Boundary Detection Model , 2014, TheScientificWorldJournal.

[16]  U. Hahn,et al.  Sentence and Token Splitting Based On Conditional Random Fields , 2007 .

[17]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[18]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[19]  Martin Volk,et al.  Cutter - a Universal Multilingual Tokenizer , 2018, SwissText.

[20]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.