Profiling-UD: a Tool for Linguistic Profiling of Texts

In this paper, we introduce Profiling–UD, a new text analysis tool inspired to the principles of linguistic profiling that can support language variation research from different perspectives. It allows the extraction of more than 130 features, spanning across different levels of linguistic description. Beyond the large number of features that can be monitored, a main novelty of Profiling–UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework. In the second part of the paper, we demonstrate the effectiveness of these features in a number of theoretical and applicative studies in which they were successfully used for text and author profiling.

[1]  Joakim Nivre,et al.  Towards a Universal Grammar for Natural Language Processing , 2015, CICLing.

[2]  H. V. Halteren,et al.  Linguistic Profiling for Author Recognition and Verification , 2017 .

[3]  Moshe Koppel,et al.  Automatically Determining an Anonymous Author's Native Language , 2005, ISI.

[4]  Felice Dell'Orletta,et al.  Assessing the Readability of Sentences: Which Corpora and Features? , 2014, BEA@ACL.

[5]  Felice Dell'Orletta,et al.  Gender and Genre Linguistic Profiling: A Case Study on Female and Male Journalistic and Diary Prose , 2018, CLiC-it.

[6]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[7]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[8]  Carolyn Penstein Rosé,et al.  Computational Sociolinguistics: A Survey , 2016, Computational Linguistics.

[9]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[10]  Walter Daelemans,et al.  Explanation in Computational Stylometry , 2013, CICLing.

[11]  Mike Kestemont,et al.  Stylometry with R: A Package for Computational Text Analysis , 2016, R J..

[12]  Kevyn Collins-Thompson,et al.  Computational Assessment of Text Readability: A Survey of Current and Future Research Running title: Computational Assessment of Text Readability , 2014 .

[13]  Beáta Megyesi,et al.  SWEGRAM – A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts , 2017 .

[14]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[15]  Felice Dell'Orletta,et al.  Linguistic Profiling based on General-purpose Features and Native Language Identification , 2013, BEA@NAACL-HLT.

[16]  Felice Dell'Orletta,et al.  Sentences and Documents in Native Language Identification , 2018, CLiC-it.

[17]  Shlomo Engelson Argamon,et al.  Computational Register Analysis and Synthesis , 2019, ArXiv.

[18]  Arthur C. Graesser,et al.  Coh-Metrix Measures Text Characteristics at Multiple Levels of Language and Discourse , 2014, The Elementary School Journal.

[19]  Simonetta Montemagni,et al.  READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification , 2011, SLPAT.

[20]  Kristopher Kyle,et al.  Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication , 2016 .

[21]  Felice Dell'Orletta,et al.  Identifying Predictive Features for Textual Genre Classification: the Key Role of Syntax , 2017, CLiC-it.

[22]  Malvina Nissim,et al.  Overview of the EVALITA 2018 Cross-Genre Gender Prediction (GxG) Task , 2018, EVALITA@CLiC-it.

[23]  Andrea Cimino,et al.  Quanti anni hai? Age Identification for Italian , 2019, CLiC-it.

[24]  Xiaofei Lu,et al.  Automatic analysis of syntactic complexity in second language writing , 2010 .

[25]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.