Topic marking in a Shanghainese corpus: From observation to prediction

Shanghainese is an extremely topic-prominent language with many topic markers in competition with one another, often without any obvious basis for the selection of one topic marker over another. We explore the influence of five variables on the five most frequent topic markers in a corpus of (spoken) Shanghainese: topic length, syntactic category of the topic, function of the topic, comment type, and genre. We carry out a multivariate statistical analysis of the data, relying on a polytomous logistic regression model. Our approach leads to a satisfying quantification of the role of each factor, as well as an estimate of the probabilities of combinations of factors, in influencing the choice of topic marker. This study serves simultaneously as an introduction to the polytomous package (Arppe 2013) in the statistical

[1]  K. Lambrecht Presentational cleft constructions in spoken French , 1988 .

[2]  Stefan Kramer,et al.  Ensembles of nested dichotomies for multi-class problems , 2004, ICML.

[3]  C. Poppi,et al.  From the Suburbs of the global village∗: Afterthoughts on Magiciens de la terre , 1991 .

[4]  Dagmar Divjak,et al.  Extracting prototypes from exemplars What can corpus data tell us about concept representation? , 2013 .

[5]  T. Reinhart Pragmatics and Linguistics: an analysis of Sentence Topics , 1981, Philosophica.

[6]  Waltraud Paul,et al.  Functional categories, topic prominence, and complex sentences in Mandarin Chinese , 1996 .

[7]  W. Chafe Givenness, contrastiveness, definiteness, subjects, topics, and point of view , 1976 .

[8]  K. Sakuma The structure of the Japanese language , 1951 .

[9]  Charles N. Li,et al.  Subject and topic , 1979 .

[10]  Patrick Hanks,et al.  Contextual dependency and lexical sets , 1996 .

[11]  Sam Featherston 6. The Decathlon Model , 2019, Current Approaches to Syntax.

[12]  Antti Arppe Linguistic choices vs. probabilities – how much and what can linguistic theory explain? , 2009 .

[13]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[14]  N. A. Mccawley,et al.  The structure of the Japanese language , 1973 .

[15]  Li Yan-yan On the Topic Marker "Yaoshuo(要说)" , 2010 .

[16]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[17]  Jeanette K. Gundel ‘Shared knowledge’ and topicality , 1985 .

[18]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[19]  Russell S. Tomlin,et al.  Foreground-background information and the syntax of subordination , 1985 .

[20]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[21]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[22]  R. Harald Baayen,et al.  Predicting the dative alternation , 2007 .

[23]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[24]  Hans C. Jessen,et al.  Applied Logistic Regression Analysis , 1996 .

[25]  Antti Arppe,et al.  Every method counts: Combining corpus-based and experimental evidence in the study of synonymy , 2007 .

[26]  D. Schiffrin,et al.  Conditionals as topics in discourse , 1992 .

[27]  Stefan Th. Gries,et al.  Towards a corpus-based identification of prototypical instances of constructions , 2003 .

[28]  Antti Arppe,et al.  Univariate, bivariate, and multivariate methods in corpus-based lexicography : A study of synonymy , 2008 .