Using data-mining to identify and study patterns in lexical innovation on the web

Abstract This paper presents the NeoCrawler – a tailor-made webcrawler, which identifies and retrieves neologisms from the Internet and systematically monitors the use of detected neologisms on the web by means of weekly searches. It enables researchers to use the web as a corpus in order to investigate the dynamics of lexical innovation on a large-scale and systematic basis. The NeoCrawler represents an innovative web-mining tool which opens up new opportunities for linguists to tackle a number of unresolved and under-researched issues in the field of lexical innovation. This paper presents the design as well as the most important characteristics of two modules, the Discoverer and the Observer, with regard to the usage-based study of lexical innovation and diffusion.

[1]  Roswitha Fischer,et al.  Lexical change in present-day English: A corpus-based study of the motivation, institutionalization, and productivity of creative neologisms , 1998 .

[2]  Laurie Bauer,et al.  English Word-Formation: Frontmatter , 1983 .

[3]  Eetu Mäkelä,et al.  Explorations into the social contexts of neologism use in early English correspondence , 2018, Pragmatics and Cognition.

[4]  Jelena Prokic,et al.  Mining the Web for New Words: Semi-Automatic Neologism Identification with the NeoCrawler , 2018, Anglia.

[5]  W. Labov The social origins of sound change , 1979 .

[6]  Delphine Bernhard,et al.  The Logoscope: a Semi-Automatic Tool for Detecting and Documenting French New Words , 2018, ArXiv.

[7]  Daphné Kerremans,et al.  A Web of New Words , 2015 .

[8]  Maria Teresa Cabré,et al.  Stratégie pour la détection semi-automatique des néologismes de presse , 1995 .

[9]  R. Baayen,et al.  Productivity in context: a case study of a Dutch suffix , 1997 .

[10]  S. Tagliamonte,et al.  Expanding the transmission/diffusion dichotomy: Evidence from Canada , 2014 .

[11]  Laurent Prévot,et al.  Observing Features of PTT Neologisms: A Corpus-driven Study with N-gram Model , 2013, ROCLING.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  J. Milroy,et al.  Linguistic change, social network and speaker innovation , 1985, Journal of Linguistics.

[14]  Jean Tournier,et al.  Introduction descriptive à la lexicogénétique de l'anglais contemporain , 1985 .

[15]  Andrew Spencer,et al.  Ingo Plag, Morphological Productivity. Structural Constraints on English Derivation. Berlin/New York: Mouton de Gruyter, 1999. , 2001 .

[16]  Emmanuel Cartier,et al.  Neoveille, a Web Platform for Neologism Tracking , 2017, EACL.

[17]  Nava Maroto,et al.  Building the Interface between Experts and Linguists in the Detection and characterisation of Neology in the Field of Neurosciences , 2014 .

[18]  Ali Hadjarian,et al.  Mining and Classification of Neologisms in Persian Blogs , 2010, HLT-NAACL 2010.

[19]  Dirk Lewandowski,et al.  A three-year study on the freshness of web search engine databases , 2008, J. Inf. Sci..

[20]  Kevin Duh,et al.  A framework for analyzing semantic change of words across time , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[21]  T. Nevalainen Mobility, Social Networks and Language Change in Early Modern England , 2000 .

[22]  Jure Leskovec,et al.  Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change , 2016, EMNLP.