Chinese Language Processing Based on Stroke Representation and Multidimensional Representation

With the development of deep learning and artificial intelligence, deep neural networks are increasingly being applied for natural language processing tasks. However, the majority of research on natural language processing focuses on alphabetic languages. Few studies have paid attention to the characteristics of ideographic languages, such as the Chinese language. In addition, the existing Chinese processing algorithms typically regard Chinese words or Chinese characters as the basic units while ignoring the information contained within the deeper architecture of Chinese characters. In the Chinese language, each Chinese character can be split into several components, or strokes. This means that strokes are the basic units of a Chinese character, in a manner similar to the letters of an English word. Inspired by the success of character-level neural networks, we delve deeper into Chinese writing at the stroke level for Chinese language processing. We extract the basic features of strokes by considering similar Chinese characters to learn a continuous representation of Chinese characters. Furthermore, word embeddings trained at different granularities are not exactly the same. In this paper, we propose an algorithm for combining different representations of Chinese words within a single neural network to obtain a better word representation. We develop a Chinese word representation service for several natural language processing tasks, and cloud computing is introduced to deal with preprocessing challenges and the training of basic representations from different dimensions.

[1]  Hao Xin,et al.  Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components , 2017, EMNLP.

[2]  Wenjie Li,et al.  Component-Enhanced Chinese Character Embeddings , 2015, EMNLP.

[3]  Shuai Li,et al.  Neural Dynamics for Cooperative Control of Redundant Robot Manipulators , 2018, IEEE Transactions on Industrial Informatics.

[4]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[5]  Lucy Vanderwende,et al.  Exploring Content Models for Multi-Document Summarization , 2009, NAACL.

[6]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jun Zhou,et al.  cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information , 2018, AAAI.

[9]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[10]  Ani Nenkova,et al.  Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion , 2007, Information Processing & Management.

[11]  Yaojie Lu,et al.  Shallow Convolutional Neural Network for Implicit Discourse Relation Recognition , 2015, EMNLP.

[12]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[13]  Xuehai Zhou,et al.  CCRS: Web Service for Chinese Character Recognition , 2018, 2018 IEEE International Conference on Web Services (ICWS).

[14]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[15]  Jake Bouvrie,et al.  Notes on Convolutional Neural Networks , 2006 .

[16]  Daniel Marcu,et al.  Bayesian Query-Focused Summarization , 2006, ACL.

[17]  Ani Nenkova,et al.  The Impact of Frequency on Summarization , 2005 .

[18]  Dilek Z. Hakkani-Tür,et al.  A Hybrid Hierarchical Model for Multi-Document Summarization , 2010, ACL.

[19]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[20]  Tim Ng,et al.  Mandarin Word-Character Hybrid-Input Neural Network Language Model , 2011, INTERSPEECH.

[21]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[22]  Chao Liu,et al.  Radical Embedding: Delving Deeper to Chinese Radicals , 2015, ACL.

[23]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[24]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[25]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[26]  Shuai Li,et al.  A Noise-Suppressing Neural Algorithm for Solving the Time-Varying System of Linear Equations: A Control-Based Approach , 2019, IEEE Transactions on Industrial Informatics.

[27]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[28]  Xiaojun Wan,et al.  Manifold-Ranking Based Topic-Focused Multi-Document Summarization , 2007, IJCAI.

[29]  Xiang Zhang,et al.  Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean? , 2017, ArXiv.

[30]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[31]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[32]  Nan Yang,et al.  Radical-Enhanced Chinese Character Embedding , 2014, ICONIP.

[33]  Qingcai Chen,et al.  LCSTS: A Large Scale Chinese Short Text Summarization Dataset , 2015, EMNLP.

[34]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[35]  Eduard H. Hovy,et al.  Automated Text Summarization and the SUMMARIST System , 1998, TIPSTER.

[36]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[37]  Alessandro Moschitti,et al.  UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification , 2015, *SEMEVAL.

[38]  Yue Zhang,et al.  Chinese Parsing Exploiting Characters , 2013, ACL.

[39]  Dipanjan Das Andr,et al.  A Survey on Automatic Text Summarization , 2007 .

[40]  Xuehai Zhou,et al.  Natural Language Processing Service Based on Stroke-Level Convolutional Networks for Chinese Text Classification , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[41]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[42]  Zellig S. Harris,et al.  Distributional Structure , 1954 .