Integrating Language Knowledge Resources to Extend the English Inclusion Classifier to a New Language

This paper presents an unsupervised system that classifies English inclusions in written text. It will demonstrate that extending this English inclusion classifier, which was originally designed for German, requires minimal time and effort to adapt to a new language, in this case French. The analysis of several evaluation experiments carried out on French and German data shows that the system performs well for both languages and on unseen data from the same domain and language.

[1]  Ronald Rosenfeld,et al.  Improving trigram language modeling with the World Wide Web , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[3]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[4]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[5]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[6]  Harald Romsdorfer,et al.  Mixed-lingual text analysis for polyglot TTS synthesis , 2003, INTERSPEECH.

[7]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[8]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[9]  Gregory Grefenstette,et al.  Estimation of English and non-English Language Use on the WWW , 2000, RIAO.

[10]  Beatrice Alex,et al.  An Unsupervised System for Identifying English Inclusions in German Text , 2005, ACL.

[11]  Marc Brysbaert,et al.  Lexique 2 : A new French lexical database , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[12]  Helmut Schmid,et al.  Etiquetage morphologique de textes français avec un arbre de décisions , 1995 .

[13]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[14]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[15]  Stefan Evert,et al.  The NITE XML Toolkit: Flexible annotation for multimodal language data , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.