iSentenizer-μ: Multilingual Sentence Boundary Detection Model

Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i +Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.

[1]  Johan Bos,et al.  Elephant: Sequence Labeling for Word and Sentence Segmentation , 2013, EMNLP.

[2]  Hae-Chang Rim,et al.  Towards Language-Independent Sentence Boundary Detection , 2004, CICLing.

[3]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[4]  Celso A. A. Kaestner,et al.  An Analysis of Sentence Boundary Detection Systems for English and Portuguese Documents , 2004, CICLing.

[5]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[6]  Andrei Mikheev,et al.  Periods, Capitalized Words, etc. , 2002, CL.

[7]  Elizabeth Shriberg,et al.  Comparing Evaluation Metrics for Sentence Boundary Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Fai Wong,et al.  iSentenizer: An incremental sentence boundary classifier , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[9]  Nishant Agarwal,et al.  Sentence Boundary Detection Using a MaxEnt Classifier , 2005 .

[10]  R. Clarke,et al.  Use of classification and regression trees (CART) to classify remotely-sensed digital images , 2003, IGARSS 2003. 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No.03CH37477).

[11]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[12]  Efstathios Stamatatos,et al.  AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION , 2014 .

[13]  P. Utgoff,et al.  A Kolmogorov-Smirnoff Metric for Decision Tree Induction , 1996 .

[14]  Murhaf Fares,et al.  Machine Learning for High-Quality Tokenization Replicating Variable Tokenization Schemes , 2013, CICLing.

[15]  Jerome H. Friedman,et al.  A Recursive Partitioning Decision Rule for Nonparametric Classification , 1977, IEEE Transactions on Computers.

[16]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[17]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[18]  Ling Zhu,et al.  Unsupervised Chunking Based on Graph Propagation from Bilingual Corpus , 2014, TheScientificWorldJournal.

[19]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[20]  Isabel Trancoso,et al.  Lexicon expansion for latent variable grammars , 2014, Pattern Recognit. Lett..

[21]  U. Hahn,et al.  Sentence and Token Splitting Based On Conditional Random Fields , 2007 .

[22]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[23]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[24]  Ming Chui Dong,et al.  Machine Translation Based on Translation Corresponding Tree Structure , 2006 .

[25]  Lidia S. Chao,et al.  TQDL: Integrated Models for Cross-Language Document Retrieval , 2012, Int. J. Comput. Linguistics Chin. Lang. Process..

[26]  Fai Wong,et al.  An incremental decision tree learning methodology regarding attributes in medical data mining , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[27]  HELENA BRITTO State,et al.  MORPHOLOGICAL ANNOTATION SYSTEM FOR AUTOMATED TAGGING OF ELECTRONIC TEXTUAL CORPORA : FROM ENGLISH TO ROMANCE LANGUAGES , 2005 .

[28]  Mark Stevenson,et al.  Experiments on Sentence Boundary Detection , 2000, ANLP.

[29]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[30]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.