Quantifying the Semantic Core of Gender Systems

Many of the world’s languages employ grammatical gender on the lexeme. For instance, in Spanish, house “casa” is feminine, whereas the word for paper “papel” is masculine. To a speaker of a genderless language, this categorization seems to exist with neither rhyme nor reason. But, is the association of nouns to gender classes truly arbitrary? In this work, we present the first large-scale investigation of the arbitrariness of gender assignment that uses canonical correlation analysis as a method for correlating the gender of inanimate nouns with their lexical semantic meaning. We find that the gender systems of 18 languages exhibit a significant correlation with an externally grounded definition of lexical semantics.

[1]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[2]  Katharina Kann,et al.  Grammatical Gender, Neo-Whorfianism, and Word Embeddings: A Data-Driven Approach to Linguistic Relativity , 2019, ArXiv.

[3]  A. Hood,et al.  Gender , 2019, Textile History.

[4]  David Yarowsky,et al.  Minimally Supervised Induction of Grammatical Gender , 2003, HLT-NAACL.

[5]  Kiril Ivanov Simov,et al.  Constructing of an Ontology-based Lexicon for Bulgarian , 2010, LREC.

[6]  Fodor Istvan,et al.  The origin of grammatical gender , 1959 .

[7]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[8]  Francis Bond,et al.  Linking and Extending an Open Multilingual Wordnet , 2013, ACL.

[9]  M. H. Ibrahim Grammatical Gender: Its Origin and Development , 1973 .

[10]  Richard Futrell,et al.  A functional theory of gender paradigms , 2017 .

[11]  Radovan Garabík,et al.  From Multilingual Dictionary to Lithuanian WordNet , 2013 .

[12]  Wolfgang U. Dressler,et al.  Latin inflection classes , 2002 .

[13]  Maciej Piasecki,et al.  A Wordnet from the ground up , 2009 .

[14]  Božo Bekavac,et al.  Building Croatian WordNet , 2008 .

[15]  Radu Ion,et al.  Romanian WordNet : Current State , New Applications and Prospects , 2008 .

[16]  M. Swadesh Towards Greater Accuracy in Lexicostatistic Dating , 1955, International Journal of American Linguistics.

[17]  Francis Bond,et al.  A Survey of WordNets and their Licenses , 2011 .

[18]  Benoît Sagot,et al.  Building a free French wordnet from multilingual resources , 2008 .

[19]  Darja Fišer,et al.  sloWNet 3.0: development, extension and cleaning , 2011 .

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  German Rigau,et al.  Multilingual Central Repository version 3 . 0 : upgrading a very large lexical knowledge base , 2011 .

[22]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[23]  Ewa Rudnicka,et al.  Mapping plWordNet onto Princeton WordNet , 2012 .

[24]  Antonio Toral,et al.  Rejuvenating the Italian WordNet: upgrading, standardising, extending , 2009 .

[25]  Sofia Stamou,et al.  Exploring Balkanet Shared Ontology for Multilingual Conceptual Indexing , 2004, LREC.

[26]  M. Swadesh Salish Internal Relationships , 1950, International Journal of American Linguistics.

[27]  Bar-Ilan University,et al.  WordNet : a Test Case of Aligning Lexical Databases across Languages , 2007 .

[28]  L. Aiello,et al.  The Origin and Diversification of Language , 2017 .

[29]  John Shawe-Taylor,et al.  A multiple hold-out framework for Sparse Partial Least Squares , 2016, Journal of Neuroscience Methods.

[30]  Vivi Nastase,et al.  What’s in a name? In some languages, grammatical gender , 2009, EMNLP.

[31]  W. Fitch,et al.  The Origin and Diversification of Language , 1999 .

[32]  Valeria de Paiva,et al.  Revisiting a Brazilian WordNet , 2012 .

[33]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[34]  Rotem Dror,et al.  Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets , 2017, TACL.