Geography of social ontologies: Testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia

In this article, we test a variant of the Sapir-Whorf Hypothesis in the area of complex network theory. This is done by analyzing social ontologies as a new resource for automatic language classification. Our method is to solely explore structural features of social ontologies in order to predict family resemblances of languages used by the corresponding communities to build these ontologies. This approach is based on a reformulation of the Sapir-Whorf Hypothesis in terms of distributed cognition. Starting from a corpus of 160 Wikipedia-based social ontologies, we test our variant of the Sapir-Whorf Hypothesis by several experiments, and find out that we outperform the corresponding baselines. All in all, the article develops an approach to classify linguistic networks of tens of thousands of vertices by exploring a small range of mathematically well-established topological indices.

[1]  Stefan Bornholdt,et al.  Handbook of Graphs and Networks: From the Genome to the Internet , 2003 .

[2]  Alessandro Vespignani,et al.  Dynamical Processes on Complex Networks , 2008 .

[3]  Alexander Mehler A Quantitative Graph Model of Social Ontologies by Example of Wikipedia , 2011 .

[4]  Mohamed E. El-Hawary,et al.  Some Basic Principles , 1995 .

[5]  John A. Lucy Language diversity and thought: Approaches in anthropological linguistics: theoretical and methodological advances , 1992 .

[6]  Graeme Hirst,et al.  Algorithms for language reconstruction , 2002 .

[7]  Ben Shneiderman,et al.  Structural analysis of hypertexts: identifying hierarchies and useful metrics , 1992, TOIS.

[8]  Sarah C. Gudschinsky The ABC'S of Lexicostatistics (Glottochronology) , 1956 .

[9]  Alexander Mehler Structure Formation in the Web , 2010 .

[10]  Niloy Ganguly,et al.  Emergence of Community Structures in Vowel Inventories: An Analysis Based on Complex Networks , 2007, SIGMORPHON.

[11]  S. Wasserman,et al.  Social Network Analysis: Data , 1994 .

[12]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[13]  Jakob Voß,et al.  Collaborative thesaurus tagging the Wikipedia way , 2006, ArXiv.

[14]  Knut Bergsland,et al.  On the Validity of Glottochronology , 1962, Current Anthropology.

[15]  T. Warnow,et al.  A STOCHASTIC MODEL OF LANGUAGE EVOLUTION THAT INCORPORATES HOMOPLASY AND BORROWING , 2005 .

[16]  Gregory Gutin,et al.  Digraphs - theory, algorithms and applications , 2002 .

[17]  M. Dehmer,et al.  Analysis of Complex Networks: From Biology to Linguistics , 2009 .

[18]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[19]  P. Kay,et al.  Color naming reflects optimal partitions of color space , 2007, Proceedings of the National Academy of Sciences.

[20]  Elena V. Konstantinova,et al.  Applications of information theory in chemical graph theory , 2003 .

[21]  Søren Wichmann,et al.  Explorations in automated language classification , 2008 .

[22]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[23]  S. Levinson,et al.  LANGUAGE AND SPACE , 1996 .

[24]  Gertjan van Noord,et al.  The Alpino Dependency Treebank , 2001, CLIN.

[25]  Matthias Dehmer,et al.  Graph-theoretical Characterizations of Generalized Trees , 2007, MLMTA.

[26]  Iggy Roca,et al.  Logical issues in language acquisition , 1990 .

[27]  P. Forster,et al.  Phylogenetic Methods and the Prehistory of Languages , 2006 .

[28]  Ulrich Ammon,et al.  Sociolinguistics: An international handbook of the science of language and society (Project announcement) , 1984, Language in Society.

[29]  Sean Wallis,et al.  Searching treebanks and other structured corpora , 2008 .

[30]  David Weissman,et al.  A Social Ontology , 2000 .

[31]  M. Bowerman The origins of children's spatial semantic categories: Cognitive vs. linguistic determinants , 1996 .

[32]  S. Pinker,et al.  The Language Instinct: How the Mind Creates Language , 1994 .

[33]  Matthias Dehmer,et al.  Towards an Information Theory of Complex Networks - Statistical Methods and Applications , 2011 .

[34]  Simon Kirby,et al.  Measuring Language Divergence by Intra-Lexical Comparison , 2006, ACL.

[35]  C. Habel,et al.  Language , 1931, NeuroImage.

[36]  R. Nisbett The Geography of Thought , 2003 .

[37]  Matthias R. Mehl,et al.  Quantitative Text Analysis. , 2006 .

[38]  Frank Harary,et al.  Graph Theory , 2016 .

[39]  Hal Daumé,et al.  Non-Parametric Bayesian Areal Linguistics , 2009, HLT-NAACL.

[40]  Eneko Agirre,et al.  WikiWalk: Random walks on Wikipedia for Semantic Relatedness , 2009, Graph-based Methods for Natural Language Processing.

[41]  Gabriel Altmann,et al.  Allgemeine Sprachtypologie : Prinzipien und Messverfahren , 1973 .

[42]  John Scott What is social network analysis , 2010 .

[43]  Bo Leuf,et al.  The Wiki Way: Quick Collaboration on the Web , 2001 .

[44]  S. Bornholdt,et al.  Handbook of Graphs and Networks , 2012 .

[45]  Igor Boguslavsky,et al.  Development of a Dependency Treebank for Russian and its Possible Applications in NLP , 2002, LREC.

[46]  A. Díaz-Guilera,et al.  Correlations in the Organization of Large-Scale Syntactic Dependency Networks , 2007, HLT-NAACL 2007.

[47]  James R. Hurford,et al.  Nativist and Functional Explanations in Language Acquisition , 2004 .

[48]  John F. Sowa,et al.  Knowledge representation: logical, philosophical, and computational foundations , 2000 .

[49]  J. Lucy,et al.  Language acquisition and conceptual development: Grammatical categories and the development of classification preferences: a comparative approach , 2001 .

[50]  Marián Boguñá,et al.  Correlations in weighted networks. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[51]  Guido Caldarelli,et al.  Large Scale Structure and Dynamics of Complex Networks: From Information Technology to Finance and Natural Science , 2007 .

[52]  E. Lenneberg,et al.  The Language Of Experience: A Study In Methodology , 2013 .

[53]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[54]  H. S. Straight Color Categories in Thought and Language , 2003 .

[55]  C. L. Hardin,et al.  Color categories in thought and language: Author index , 1997 .

[56]  Cristina Bosco,et al.  Building a Treebank for Italian: a Data-driven Annotation Schema , 2000, LREC.

[57]  Saso Dzeroski,et al.  Towards a Slovene Dependency Treebank , 2006, LREC.

[58]  Karen B. Strier,et al.  Annual Review of Anthropology , 1973 .

[59]  S. Pinker The language instinct : how the mind creates language , 1995 .

[60]  Montserrat Civit,et al.  Building Cast3LB: A Spanish Treebank , 2004 .

[61]  S. Levinson Frames of reference and Molyneux's question: Cross-linguistic evidence , 1996 .

[62]  J. Lucy,et al.  LINGUISTIC RELATIVITY , 2008 .

[63]  M. Pagel,et al.  Frequency of word-use predicts rates of lexical evolution throughout Indo-European history , 2007, Nature.

[64]  R. Port,et al.  Against Formal Phonology , 2005 .

[65]  Sheila Embleton,et al.  Statistics in historical linguistics , 1986 .

[66]  Olga Abramov,et al.  Typology by Means of Language Networks: Applying Information Theoretic Measures to Morphological Derivation Networks , 2011, Towards an Information Theory of Complex Networks.

[67]  Stephan Borgert,et al.  On Entropy-Based Molecular Descriptors: Statistical Analysis of Real and Synthetic Chemical Structures , 2009, J. Chem. Inf. Model..

[68]  Alexander Mehler,et al.  Structural Classifiers of Text Types: Towards a Novel Model of Text Representation , 2007, LDV Forum.

[69]  Alexander Mehler Large Text Networks as an Object of Corpus Linguistic Studies , 2009 .

[70]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[71]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[72]  R. Nisbett The geography of thought : how Asians and Westerners think differently--and why , 2003 .

[73]  B. L. Whorf Language, Thought, and Reality: Selected Writings of Benjamin Lee Whorf , 1956 .

[74]  J. Lucy Language Diversity and Thought: A Reformulation of the Linguistic Relativity Hypothesis , 1992 .

[75]  Michael A. Covington,et al.  An Algorithm to Align Words for Historical Comparison , 1996, Comput. Linguistics.

[76]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[77]  Andreas Witt,et al.  Linguistic Modeling of Information and Markup Languages , 2010 .

[78]  L. Boroditsky Does Language Shape Thought?: Mandarin and English Speakers' Conceptions of Time , 2001, Cognitive Psychology.

[79]  Mark H. Bickhard,et al.  Social Ontology as Convention , 2008 .

[80]  Alexander Mehler,et al.  STRUCTURAL SIMILARITIES OF COMPLEX NETWORKS: A COMPUTATIONAL MODEL BY EXAMPLE OF WIKI GRAPHS , 2008, Appl. Artif. Intell..

[81]  JENNIFER J. FREYD,et al.  Shareability: The Social Psychology of Epistemology , 1983, Cogn. Sci..

[82]  James D. Hollan,et al.  Distributed cognition: toward a new foundation for human-computer interaction research , 2000, TCHI.

[83]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[84]  Joakim Nivre,et al.  Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation , 2006, LREC.

[85]  Peter Gärdenfors,et al.  Conceptual spaces - the geometry of thought , 2000 .

[86]  Alexander Mehler,et al.  Structural Differentiae of Text Types - A Quantitative Model , 2007, GfKl.

[87]  Anke Lüdeling,et al.  Corpus Linguistics: An International Handbook , 2009 .

[88]  Matthias Dehmer,et al.  Information processing in complex networks: Graph entropy and information functionals , 2008, Appl. Math. Comput..

[89]  Béla Bollobás,et al.  Mathematical results on scale‐free random graphs , 2005 .

[90]  L. Steels Collaborative tagging as distributed cognition , 2006 .

[91]  Charles Kemp,et al.  The discovery of structural form , 2008, Proceedings of the National Academy of Sciences.

[92]  Penny E. Lee,et al.  The Whorf Theory Complex: A Critical Reconstruction , 1996 .

[93]  J. Mandler,et al.  Understanding spatial relations: Flexible infants, lexical adults , 2003, Cognitive Psychology.

[94]  Alexander Mehler,et al.  Social Semantics and Its Evaluation by Means of Semantic Relatedness and Open Topic Models , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[95]  Joakim Nivre,et al.  Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories , 2006 .

[96]  Stuart C. Shapiro Review of Knowledge representation: logical, philosophical, and computational foundations by John F. Sowa. Brooks/Cole 2000. , 2001 .

[97]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[98]  U. Brandes,et al.  GraphML Progress Report ? Structural Layer Proposal , 2001 .

[99]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[100]  M. Swadesh Lexico-Statistical Dating of Prehistoric Ethnic Contacts , 1952 .

[101]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[102]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[103]  John R. Searle,et al.  Social ontology , 2006 .

[104]  D. Casasanto,et al.  Who's Afraid of the Big Bad Whorf? Crosslinguistic Differences in Temporal Language and Thought , 2008 .

[105]  Matthias Dimter Textklassenkonzepte heutiger Alltagssprache , 1981 .

[106]  Brett Kessler,et al.  Phonetic comparison algorithms , 2005 .

[107]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[108]  Ronen Feldman,et al.  The Text Mining Handbook: Index , 2006 .

[109]  Aldo Gangemi,et al.  Descriptions of Social Relationships , 2004 .

[110]  S. Levinson,et al.  Rethinking Linguistic Relativity , 1991, Current Anthropology.

[111]  Edward Sapir,et al.  Time Perspective in Aboriginal American Culture: A Study in Method , 2008 .

[112]  Alexander Mehler,et al.  Generalized Shortest Paths Trees: A Novel Graph Class Applied to Semiotic Networks , 2009 .

[113]  S. Levinson,et al.  Language Acquisition and Conceptual Development , 2001 .

[114]  Danail Bonchev,et al.  Information theoretic indices for characterization of chemical structures , 1983 .