Sina Mandarin Alphabetical Words:A Web-driven Code-mixing Lexical Resource

Mandarin Alphabetical Word (MAW) is one indispensable component of Modern Chinese that demonstrates unique code-mixing idiosyncrasies influenced by language exchanges. Yet, this interesting phenomenon has not been properly addressed and is mostly excluded from the Chinese language system. This paper addresses the core problem of MAW identification and proposes to construct a large collection of MAWs from Sina Weibo (SMAW) using an automatic web-based technique which includes rule-based identification, informaticsbased extraction, as well as Baidu search engine validation. A collection of 16,207 qualified SMAWs are obtained using this technique along with an annotated corpus of more than 200,000 sentences for linguistic research and applicable inquiries.

[1]  Heyan Huang,et al.  A pragmatic model for new Chinese word extraction , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[2]  Chu-Ren Huang,et al.  Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification , 2007, ACL.

[3]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[4]  Kirk Baker,et al.  Lettered words: Using Roman letters to create words in Chinese , 2010 .

[5]  Keh-Jiann Chen,et al.  Unknown Word Extraction for Chinese Documents , 2002, COLING.

[6]  Chu-Ren Huang,et al.  SINICA CORPUS : Design Methodology for Balanced Corpora , 1996, PACLIC.

[7]  Keh-Yih Su,et al.  An Unsupervised Iterative Method for Chinese New Lexicon Extraction , 1997, ROCLING/IJCLCLP.

[8]  R. Miao,et al.  Loanword Adaptation in Mandarin Chinese: Perceptual, Phonological and Sociolinguistic Factors , 2005 .

[9]  Keh-Jiann Chen,et al.  Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[10]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[11]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[12]  Helena Riha,et al.  Lettered Words in Chinese: Roman Letters as Morpheme-Syllables , 2010 .

[13]  Jia-Fei Hong,et al.  Using Chinese Gigaword Corpus and Chinese Word Sketch in linguistic Research , 2006, PACLIC.

[14]  Zhang Tiewe Study of the Word Family “X-ray” in Chinese , 2005 .

[15]  Zhang Tie-wen The Use of Chinese Lettered-words Is a Normal Phenomenon of Language Contact , 2013 .

[16]  David Malvern,et al.  Developmental trends in lexical diversity , 2004 .

[17]  S. Lange,et al.  [The kappa coefficient]. , 2007, Deutsche medizinische Wochenschrift.

[18]  Dong Nguyen,et al.  Automatic Detection of Intra-Word Code-Switching , 2016, SIGMORPHON.

[19]  Zeng Jun-fang A Chinese Word Extraction Algorithm Based on Information Entropy , 2006 .

[20]  Ka Yee Lun Morphological Structure of the Chinese Lettered Words , 2013 .

[21]  Gaël Dias,et al.  Multiword Unit Hybrid Extraction , 2003, ACL 2003.

[22]  Chaofen Sun,et al.  Chinese: A Linguistic Introduction , 2006 .

[23]  Ksenia Kozha CHINESE VIA ENGLISH: A CASE STUDY OF “LETTERED-WORDS” AS A WAY OF INTEGRATION INTO GLOBAL COMMUNICATION , 2012 .

[24]  P. Fox,et al.  Neuroanatomical correlates of phonological processing of Chinese characters and alphabetic words: A meta‐analysis , 2005, Human brain mapping.

[25]  Chu-Ren Huang,et al.  A Preliminary Phonetic Investigation of Alphabetic Words in Mandarin Chinese , 2017, INTERSPEECH.

[26]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.