Multi-Fusion Chinese WordNet (MCW) : Compound of Machine Learning and Manual Correction

Princeton WordNet (PWN) is a lexicon-semantic network based on cognitive linguistics, which promotes the development of natural language processing. Based on PWN, five Chinese wordnets have been developed to solve the problems of syntax and semantics. They include: Northeastern University Chinese WordNet (NEW), Sinica Bilingual Ontological WordNet (BOW), Southeast University Chinese WordNet (SEW), Taiwan University Chinese WordNet (CWN), Chinese Open WordNet (COW). By using them, we found that these word networks have low accuracy and coverage, and cannot completely portray the semantic network of PWN. So we decided to make a new Chinese wordnet called Multi-Fusion Chinese Wordnet (MCW) to make up those shortcomings. The key idea is to extend the SEW with the help of Oxford bilingual dictionary and Xinhua bilingual dictionary, and then correct it. More specifically, we used machine learning and manual adjustment in our corrections. Two standards were formulated to help our work. We conducted experiments on three tasks including relatedness calculation, word similarity and word sense disambiguation for the comparison of lemma's accuracy, at the same time, coverage also was compared. The results indicate that MCW can benefit from coverage and accuracy via our method. However, it still has room for improvement, especially with lemmas. In the future, we will continue to enhance the accuracy of MCW and expand the concepts in it.

[1]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[2]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[3]  Francis Bond,et al.  Linking and Extending an Open Multilingual Wordnet , 2013, ACL.

[4]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[5]  Michael J Cortese,et al.  Handbook of Psycholinguistics , 2011 .

[6]  Yuzhong Qu,et al.  An Integrated Approach for Automatic Construction of Bilingual Chinese-English WordNet , 2008, ASWC.

[7]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[8]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[9]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[10]  G. Kleiman,et al.  Sentence frame contexts and lexical decisions: Sentence-acceptability and word-relatedness effects , 1980, Memory & cognition.

[11]  Adam Pease,et al.  The Suggested Upper Merged Ontology: A Large Ontology for the Semantic Web and its Applic ations , 2002 .

[12]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Francis Bond,et al.  Building the Chinese Open Wordnet (COW): Starting from Core Synsets , 2013 .

[15]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[16]  Chu-Ren Huang,et al.  Sinica BOW (Bilingual Ontological Wordnet): Integration of Bilingual WordNet and SUMO , 2004, LREC.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Tommy W. S. Chow,et al.  Multi-Label Low-dimensional Embedding with Missing Labels , 2017, Knowl. Based Syst..

[19]  Hao Xin,et al.  Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components , 2017, EMNLP.

[20]  Rada Mihalcea,et al.  Word Sense Disambiguation , 2015, Encyclopedia of Machine Learning.

[21]  Daniel M. Bikel,et al.  Automatic WordNet Mapping Using Word Sense Disambiguation , 2000, EMNLP.

[22]  Jia-Fei Hong,et al.  中文词汇网络:跨语言知识处理基础架构的设计理念与实践 = Chinese wordnet : design, implementation, and application of an infrastructure for cross-lingual knowledge processing , 2010 .

[23]  Junzhong Gu,et al.  A New Model of Information Content for Semantic Similarity in WordNet , 2008, 2008 Second International Conference on Future Generation Communication and Networking Symposia.

[24]  German Rigau,et al.  Book Reviews: EuroWordNet: A Multilingual Database with Lexical Semantic Networks , 1999, CL.

[25]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.