Modeling Language Change in Historical Corpora: The Case of Portuguese

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution. We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

[1]  Kevin Tang,et al.  The rise and fall of the L-shaped morphome: diachronic and experimental studies , 2015 .

[2]  Matthew Lease,et al.  Supervised language modeling for temporal resolution of texts , 2011, CIKM '11.

[3]  Daniel Preotiuc-Pietro,et al.  Temporal models of streaming social media data , 2014 .

[4]  Olatz Arregi Uriarte,et al.  IXAGroupEHUDiac: A Multiple Approach System towards the Diachronic Evaluation of Texts , 2015, *SEMEVAL.

[5]  Shervin Malmasi,et al.  Chinese Native Language Identification , 2014, EACL.

[6]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[7]  Luis Gravano,et al.  Answering General Time-Sensitive Queries , 2008, IEEE Transactions on Knowledge and Data Engineering.

[8]  Toon Calders,et al.  Effects of Evolutionary Linguistics in Text Classification , 2015, SLSP.

[9]  Delphine Bernhard,et al.  When Was It Written? Automatically Determining Publication Dates , 2011, SPIRE.

[10]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[11]  Mirella Lapata,et al.  A Bayesian Model of Diachronic Meaning Change , 2016, TACL.

[12]  Liviu P. Dinu,et al.  AMBRA: A Ranking Approach to Temporal Text Classification , 2015, *SEMEVAL.

[13]  Carlo Strapparava,et al.  SemEval 2015, Task 7: Diachronic Text Evaluation , 2015, *SEMEVAL.

[14]  J. M. Hughes,et al.  Quantitative patterns of stylistic influence in the evolution of literature , 2012, Proceedings of the National Academy of Sciences.

[15]  Liviu P. Dinu,et al.  Temporal classification for historical Romanian texts , 2013, LaTeCH@ACL.

[16]  Shusaku Tsumoto,et al.  Text Categorization with Considering Temporal Patterns of Term Usages , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[17]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[18]  Yorick Wilks,et al.  Automatic Dating of Documents and Temporal Text Classification , 2006 .

[19]  Liviu P. Dinu,et al.  Temporal Text Ranking and Automatic Dating of Texts , 2014, EACL.

[20]  Yue Zhao,et al.  Temporal Information Retrieval Revisited: A Focused Study on the Web , 2015, FDIA.

[21]  Yue Zhao,et al.  Sub-document Timestamping of Web Documents , 2015, SIGIR.

[22]  Shervin Malmasi,et al.  Multilingual native language identification , 2015, Natural Language Engineering.

[23]  Shervin Malmasi,et al.  Language Transfer Hypotheses with Linear SVM Weights , 2014, EMNLP.

[24]  Gerard Lynch,et al.  UCD : Diachronic Text Classification with Character, Word, and Syntactic N-grams , 2015, *SEMEVAL.

[25]  Michael Gertz,et al.  Temporal Information Retrieval , 2009, Encyclopedia of Database Systems.

[26]  Carlo Strapparava,et al.  Behind the Times: Detecting Epoch Changes using Large Corpora , 2013, IJCNLP.

[27]  Rita Marquilhas,et al.  Curso de história da língua portuguesa , 1991 .

[28]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[29]  Liviu P. Dinu,et al.  Temporal Text Classification for Romanian Novels set in the Past , 2013, RANLP.

[30]  Rada Mihalcea,et al.  Word Epoch Disambiguation: Finding How Words Change Over Time , 2012, ACL.

[31]  Nathanael Chambers,et al.  Labeling Documents with Timestamps: Learning from their Time Expressions , 2012, ACL.

[32]  Djoerd Hiemstra,et al.  Temporal Language Models for the Disclosure of Historical Text , 2005 .

[33]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[34]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[35]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[36]  Wessel Kraaij,et al.  Variations on language modeling for information retrieval , 2005, SIGF.

[37]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[38]  Romance Philology Colonia : Corpus of Historical Portuguese , 2013 .

[39]  Sanja Stajner,et al.  Stylistic Changes for Temporal Text Classification , 2013, TSD.

[40]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .