Linguistic Characteristics of Censorable Language on SinaWeibo

This paper investigates censorship from a linguistic perspective. We collect a corpus of censored and uncensored posts on a number of topics, build a classifier that predicts censorship decisions independent of discussion topics. Our investigation reveals that the strongest linguistic indicator of censored content of our corpus is its readability.

[1]  Vitaly Shmatikov,et al.  Defeating Image Obfuscation with Deep Learning , 2016, ArXiv.

[2]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  Bobby Bhattacharjee,et al.  Alibi Routing , 2015, Comput. Commun. Rev..

[5]  Jeffrey Knockel,et al.  Every Rose Has Its Thorn: Censorship and Surveillance on Social Video Platforms in China , 2015 .

[6]  Mung Chiang,et al.  A Taxonomy of Censors and Anti-Censors: Part I-Impacts of Internet Censorship , 2012, Int. J. E Politics.

[7]  Cindy K. Chung,et al.  The development of the Chinese linguistic inquiry and word count dictionary. , 2012 .

[8]  Sachin Katti,et al.  Slicing the Onion: Anonymous Routing Without PKI , 2005 .

[9]  Eric Gilbert,et al.  Algorithmically Bypassing Censorship on Sina Weibo with Nondeterministic Homophone Substitutions , 2015, ICWSM.

[10]  Anthony W. Sali,et al.  Information Processing Biases in the Brain: Implications for Decision-Making and Self-Governance , 2016, Neuroethics.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Tom Fleischer Advances In Kernel Methods Support Vector Learning , 2016 .

[13]  Yao-Ting Sung,et al.  CRIE: An automated analyzer for Chinese texts , 2015, Behavior Research Methods.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Ullrich K. H. Ecker,et al.  Misinformation and Its Correction , 2012, Psychological science in the public interest : a journal of the American Psychological Society.

[16]  Christina Fragouli,et al.  Matryoshka: Hiding Secret Communication in Plain Sight , 2016, FOCI.

[17]  Siu Yau 李肇祐 Lee Surviving Online Censorship in China: Three Satirical Tactics and their Impact , 2016, The China Quarterly.

[18]  Vinod Yegneswaran,et al.  StegoTorus: a camouflage proxy for the Tor anonymity system , 2012, CCS.

[19]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[20]  Margaret E. Roberts,et al.  How Censorship in China Allows Government Criticism but Silences Collective Expression , 2013, American Political Science Review.

[21]  Lijun Tang,et al.  Symbolic power and the internet: The power of a ‘horse’ , 2011 .

[22]  Brendan T. O'Connor,et al.  Censorship and deletion practices in Chinese social media , 2012, First Monday.

[23]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[25]  Jun Da A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction 1 , 2004 .

[26]  Dan S. Wallach,et al.  The Velocity of Censorship: High-Fidelity Detection of Microblog Post Deletions , 2013, USENIX Security Symposium.

[27]  Shaojung Sharon Wang China's Internet lexicon: Symbolic meaning and commoditization of Grass Mud Horse in the harmonious society , 2012, First Monday.

[28]  Tianwei Xie,et al.  A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction , 2004 .