论文信息 - Diachronic Analysis of the Italian Language exploiting Google Ngram

Diachronic Analysis of the Italian Language exploiting Google Ngram

English. In this paper, we propose several methods for the diachronic analysis of the Italian language. We build several models by exploiting Temporal Random Indexing and the Google Ngram dataset for the Italian language. Each proposed method is evaluated on the ability to automatically identify meaning shift over time. To this end, we introduce a new dataset built by looking at the etymological information reported in some dictionaries. Italiano. In questo lavoro proponiamo alcuni metodi per l’analisi diacronica della lingua italiana. Abbiamo costruito differenti modelli utilizzando la tecnica del Temporal Random Indexing e Google Ngram per l’italiano. Ciascun metodo proposto è stato valutato rispetto alla capacità di identificare automaticamente i cambi di significato nel tempo. A tale scopo introduciamo uno nuovo dataset costruito mediante le informazioni etimologiche presenti in alcuni dizionari. 1 Motivation and Background Languages can be studied from two different and complementary viewpoints: the diachronic perspective considers the evolution of a language over time, while the synchronic perspective describes the language rules at a specific point of time without taking its history into account (De Saussure, 1983). In this work, we focus on the diachronic approach, since language appears to be unquestionably immersed in the temporal dimension. Language is subject to a constant evolution driven by the need to reflect the continuous changes of the world. The evolution of word meanings has been studied for several centuries, but this kind of investigation has been limited by the low amount of data on which to perform the analysis. Moreover, in order to reveal structural changes in word meanings, this analysis has to explore long periods of time. Nowadays, the large amount of digital content opens new perspectives for the diachronic analysis of language. This large amount of data needs efficient computational approaches. In this scenario, Distributional Semantic Models (DSMs) represent a promising solution. DSMs are able to represent words as points in a geometric space, generally called WordSpace (Schiitze, 1993; Sahlgren, 2006) simply analysing how words are used in a corpus. However, a WordSpace represents a snapshot of a specific corpus and it does not take into account temporal information. Since its first release, the Google Ngram dataset (Michel et al., 2011) has inspired a lot of works on the analysis of cultural trends and linguistic variations. Moving away from mere frequentist approaches, Distributional Semantic Models have proved to be quite effective in measuring a meaning shift through the analysis of variation of word co-occurrences. One of the earlier attempts can be dated to Gulordava and Baroni (2011), where a co-occurrence matrix is used to model the semantics of terms. In this model, similarly to ours, the cosine similarity between vectors representing a term in two different periods is exploited as a predictor of the meaning shift: low values suggest a change in the words that co-occur with the target. The co-occurrence matrix is computed with local mutual information scores and the context elements are fixed with respect to the different time periods, hence the spaces are directly comparable. However, this kind of direct comparison does not hold when the vector representation is manipulated, like in reduction methods (SVD), or learning approaches (word2vec). In these cases, each space has its own coordinate axis. Then, some kind of alignment between spaces is required. To this end, Hamilton et al. (2016) use orthogonal Procrustes, while Kulkarni et al. (2015a) learn a transformation matrix. In this paper, we propose an evolution of our previous work (Basile et al., 2014; Basile et al., 2015) for analysing word meanings over time. This model, differently from those of Hamilton et al. (2016) and Kulkarni et al. (2015a), creates different WordSpaces for each time period in terms of the same common random vectors; then, the resulting word vectors are directly comparable with one another. In particular, we propose an efficient method for building a DSM model which takes into account temporal information relying on a very large corpus: the Google Ngram for the Italian language. Moreover, for the first time, we provide a dataset for the evaluation of word meaning change points detection specifically set up for the Italian language. The paper is structured as follows: Section 2 provides details about our methodology, while Section 3 describes the dataset that we have developed and the results of a preliminary evaluation. Section 4 reports final remarks and future work.

[1] Erez Lieberman Aiden,et al. Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[2] Magnus Sahlgren,et al. The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[3] Marco Baroni,et al. A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. , 2011, GEMS.

[4] W. A. Taylor. Change-Point Analysis : A Powerful New Tool For Detecting Changes , 2000 .

[5] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[6] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[7] Jure Leskovec,et al. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change , 2016, ACL.

[8] Hinrich Schütze,et al. Word Space , 1992, NIPS.

[9] Ferdinand de Saussure. Course in General Linguistics , 1916 .

[10] Trevor Cohen,et al. Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections , 2010, J. Biomed. Informatics.

[11] Steven Skiena,et al. Statistically Significant Detection of Linguistic Change , 2014, WWW.