Unsupervised Clustering of Morphologically Related Chinese Words

Unsupervised Clustering of Morphologically Related Chinese Words Chia-Ling Lee (r00922072@ntu.edu.tw) Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan Ya-Ning Chang (yaningchang@gate.sinica.edu.tw) Institute of Linguistics, Academia Sinica, Taipei, Taiwan Chao-Lin Liu (chaolin@nccu.edu.tw) Department of Computer Science, National Chengchi University, Taipei, Taiwan Chia-Ying Lee (chiaying@gate.sinica.edu.tw) Institute of Linguistics, Academia Sinica, Taipei, Taiwan Jane Yung-jen Hsu (yjhsu@csie.ntu.edu.tw) Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan Abstract Many linguists consider morphological awareness a major factor that affects children’s reading development. A Chi- nese character embedded in different compound words may carry related but different meanings. For example, “商 店(store)”, “商品(commodity)”, “商代(Shang Dynasty)”, and “商朝(Shang Dynasty)” can form two clusters: {“商店”, “商 品”} and {“商代”, “商朝”}. In this paper, we aim at unsuper- vised clustering of a given family of morphologically related Chinese words. Successfully differentiating these words can contribute to both computer assisted Chinese learning and nat- ural language understanding. In Experiment 1, we employed linguistic factors at the word, syntactic, semantic, and contex- tual levels in aggregated computational linguistics methods to handle the clustering task. In Experiment 2, we recruited adults and children to perform the clustering task. Experimental re- sults indicate that our computational model achieved the same level of performance as children. Keywords: morphological awareness; human cognition; com- putational linguistics; Chinese character meaning Introduction Morphological awareness, defined as “children’s conscious awareness of the morphemic structure of words and their abil- ity to reflect on and manipulate that structure”, is associated with children’s reading ability and comprehension (Liu & McBride-Chang, 2010; Kirby et al., 2012; Ku & Anderson, 2003). It is thought by many linguists to strongly affect read- ing development in children (Liu & McBride-Chang, 2010). A Chinese character embedded in different compound words may carry related but different meanings. For exam- ple, the meaning of the character “商/shang1/” in words“商 店(store)” and “商品(commodity)” is commerce. In contrast, in “商代(Shang Dynasty)”, “商” refers to a Chinese dynasty. Successful clustering of related Chinese words would make a contribution to Chinese learning. In addition, differentiat- ing the character’s meanings in such morphologically related words can facilitate Chinese word sense disambiguation and help improve Chinese word segmentation (Navigli, 2009). In this research, we employ natural language processing and computational linguistics techniques to differentiate the meanings of a particular character that is embedded in differ- ent Chinese words. We apply different methods which take diverse factors into account, such as grammar, syntax, seman- tics, and context. We also aggregate all methods and build a better ensemble model. Furthermore, we conduct another experiment in which we asked adults and children to do the same clustering task. Experimental results indicate that our model can achieve the same level of performance as children in the clustering task. There is previous work related to morphological aware- ness. Wang, Hsu, Tien, and Pomplun (2012) predicted raters’ transparency judgments of Chinese morphological character based on latent semantic analysis (LSA) (Landauer, Foltz, & Laham, 1998). If a word is more similar to the primary mean- ing, it is more likely to be judged as semantically transparent, and opaque otherwise. Galmar and Chen (2010) tried to identify different mean- ings of a Chinese character using LSA and semantic pat- tern matching in augmented minimum spanning tree. Galmar (2011) built a term-by-document matrix, and used the batch version of self-organizing maps (Kohonen, 2001) to visualize the interplay between morphology and semantics in Chinese words. To discriminate Chinese character meanings, in addi- tion to LSA techniques, we consider diverse information from comprehensive aspects. There are numerous word-to- word semantic similarity or relatedness measures proposed in the past. In knowledge-based approaches, WordNet 1 was 1 http://wordnet.princeton.edu

[1]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[2]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[3]  Bin Liu,et al.  Measuring Semantic Similarity between Words Using HowNet , 2008, 2008 International Conference on Computer Science and Information Technology.

[4]  Richard C. Anderson,et al.  Development of morphological awareness in Chinese and English , 2003 .

[5]  Donald A. Jackson,et al.  Similarity Coefficients: Measures of Co-Occurrence and Association or Simply Measures of Occurrence? , 1989, The American Naturalist.

[6]  Lesly Wade-Woolley,et al.  Children’s morphological awareness and reading ability , 2012 .

[7]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[9]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[10]  Ted Pedersen,et al.  Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[11]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[12]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[13]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[14]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[15]  Jenn-Yeu Chen,et al.  Identifying Different Meanings of a Chinese Morpheme through Semantic Pattern Matching in Augmented Minimum Spanning Trees , 2010, Prague Bull. Math. Linguistics.

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  Bruno Galmar,et al.  Using Kohonen Maps of Chinese Morphological Families to Visualize the Interplay of Morphology and Semantics in Chinese , 2011, ROCLING/IJCLCLP.

[18]  Hai Zhao,et al.  Character-Level Dependencies in Chinese: Usefulness and Learning , 2009, EACL.

[19]  Hsueh-Cheng Wang,et al.  Estimating Semantic Transparency of Constituents of English Compounds and Two-Character Chinese Words using Latent Semantic Analysis , 2012, CogSci.

[20]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[21]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[22]  Dekang Lin,et al.  Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity , 1997, ACL.

[23]  C. McBride-Chang,et al.  What Is Morphological Awareness? Tapping Lexical Compounding Awareness in Chinese Third Graders. , 2010 .

[24]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[25]  Yue Zhang,et al.  Chinese Parsing Exploiting Characters , 2013, ACL.

[26]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.