From Non Word to New Word: Automatically Identifying Neologisms in French Newspapers

In this paper we present a statistical machine learning approach to neologism detection going some way beyond the use of exclusion lists. We explore the impact of three groups of features: form related, morpho-lexical and thematic features. The latter type of features has not yet been used in this kind of application and represents a way to access the semantic context of new words. The results suggest that form related features are helpful at the overall classification task, while morpho-lexical and thematic features better single out true neologisms.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[3]  Christian Biemann,et al.  Language-Independent Methods for Compiling Monolingual Lexical Data , 2004, CICLing.

[4]  Sabine Ploux,et al.  Using Topic Salience and Connotational Drifts to Detect Candidates to Semantic Change , 2011, IWCS.

[5]  C. Elkan,et al.  Topic Models , 2008 .

[6]  Denis Maurel,et al.  Cascades de transducteurs autour de la reconnaissance des entit´ es nomm´ ees , 2011 .

[7]  Mathieu Valette,et al.  la créativité lexicale : des pratiques sociales aux textes , 2008 .

[8]  Rogelio Nazar,et al.  Towards a new approach to the study of neology , 2012 .

[9]  S. Roche,et al.  Cenit : Système de détection semi-automatique des néologismes , 1999 .

[10]  Timothy Baldwin,et al.  Word Sense Induction for Novel Sense Detection , 2012, EACL.

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  Jean-Charles Lamirel,et al.  A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data , 2013, PAKDD Workshops.

[13]  Lothar Lemnitzer,et al.  Mots nouveaux et nouvelles significations : que nous apprennent les mots composés ? , 2012 .

[14]  Graeme Hirst,et al.  Automatic identification of words with novel but infrequent senses , 2011, PACLIC.

[15]  Delphine Bernhard,et al.  Méthodes pour l'archéologie linguistique : datation par combinaison d'indices temporels , 2011 .

[16]  Hilke Elsen,et al.  Neologismen in der Zeitungssprache , 2005 .

[17]  Hilke Elsen,et al.  Neologismen : Formen und Funktionen neuer Wörter in verschiedenen Varietäten des Deutschen , 2004 .

[18]  Maarten Janssen NeoTag: a POS Tagger for Grammatical Neologism Detection , 2012, LREC.

[19]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[20]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[21]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[22]  Sylvain Loiseau Un observable pour décrire les changements sémantiques dans les traditions discursives: la tactique sémantique , 2012 .

[23]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[24]  Fabrice Issac Cybernéologisme : Quelques outils informatiques pour l'identification et le traitement des néologismes sur le web , 2011 .

[25]  Gil Francopoulo,et al.  Standards going concrete : from LMF to Morphalou , 2004, COLING 2004.