Chinese semantic document classification based on strategies of semantic similarity computation and correlation analysis

Abstract Document classification has become an indispensable technology to realize intelligent information services. This technique is often applied to the tasks such as document organization, analysis, and archiving or implemented as a submodule to support high-level applications. It has been shown that semantic analysis can improve the performance of document classification. Although this has been incorporated in previous automatic document classification methods, with an increase in the number of documents stored online, the use of semantic information for document classification has attracted greater attention as it can greatly reduce human effort. In this present paper, we propose two semantic document classification strategies for two types of semantic problems: (1) a novel semantic similarity computation (SSC) method to solve the polysemy problem and (2) a strong correlation analysis method (SCM) to solve the synonym problem. Experimental results indicate that compared with traditional machine learning, n-gram, and contextualized word embedding methods, the efficient semantic similarity and correlation analysis allow eliminating word ambiguity and extracting useful features to improve the accuracy of semantic document classification for texts in Chinese.

[1]  Noémie Elhadad,et al.  An Unsupervised Aspect-Sentiment Model for Online Reviews , 2010, NAACL.

[2]  Guangyi Xiao,et al.  Improving Multilingual Semantic Interoperation in Cross-Organizational Enterprise Systems Through Concept Disambiguation , 2012, IEEE Transactions on Industrial Informatics.

[3]  Enrico Motta,et al.  AquaLog: An ontology-driven question answering system for organizational semantic intranets , 2007, J. Web Semant..

[4]  Arash Joorabchi,et al.  An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata , 2011, J. Inf. Sci..

[5]  Lei Shi,et al.  Cross Language Text Classification by Model Translation and Semi-Supervised Learning , 2010, EMNLP.

[6]  Zhao Wei Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System , 2010 .

[7]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[8]  Peng Jin,et al.  Bag-of-Embeddings for Text Classification , 2016, IJCAI.

[9]  M. Thangaraj,et al.  Text Classification Techniques: A Literature Review , 2018 .

[10]  Shuo Yang,et al.  An improved Id3 algorithm for medical data classification , 2017, Comput. Electr. Eng..

[11]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[12]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[13]  Christof Monz,et al.  Data Augmentation for Low-Resource Neural Machine Translation , 2017, ACL.

[14]  Qingcai Chen,et al.  Fuzzy deep belief networks for semi-supervised sentiment classification , 2014, Neurocomputing.

[15]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[16]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[17]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[18]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Richard Chbeir,et al.  Building semantic trees from XML documents , 2016, J. Web Semant..

[21]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[22]  Maryam Yammahi,et al.  Construction of FuzzyFind Dictionary using Golay Coding Transformation for Searching Applications , 2015, International Journal of Advanced Computer Science and Applications.

[23]  Jun Fang,et al.  Ontology-Based Automatic Classification and Ranking for Web Documents , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[24]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[25]  Zhiyuan Liu,et al.  OpenHowNet: An Open Sememe-based Lexical Knowledge Base , 2019, ArXiv.

[26]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[27]  Felix Naumann,et al.  CohEEL: Coherent and efficient named entity linking through random walks , 2016, J. Web Semant..

[28]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[29]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[30]  Qiang Dong,et al.  Hownet And The Computation Of Meaning , 2006 .

[31]  Tom M. Mitchell,et al.  Semi-Supervised Text Classification Using EM , 2006, Semi-Supervised Learning.

[32]  Martin Necaský,et al.  Improving discoverability of open government data with rich metadata descriptions using semantic government vocabulary , 2019, J. Web Semant..

[33]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning and Data Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[34]  Carlos Sáez,et al.  An HL7-CDA wrapper for facilitating semantic interoperability to rule-based Clinical Decision Support Systems , 2013, Comput. Methods Programs Biomed..

[35]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[36]  Matthew Rowe,et al.  Linked Knowledge Sources for Topic Classification of Microposts: A Semantic Graph-Based Approach , 2014, J. Web Semant..

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Michael J. Cafarella,et al.  Ontology-driven, unsupervised instance population , 2008, J. Web Semant..

[39]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[40]  Aleksandar Kovačević,et al.  Text Classification Based on Named Entities , 2017 .

[41]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[42]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[43]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[44]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[45]  Harald Sack,et al.  TECNE: Knowledge Based Text Classification Using Network Embeddings , 2018, EKAW.

[46]  Guangyi Xiao,et al.  Semantic input method of Chinese word senses for semantic document exchange in e-business , 2016 .

[47]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[48]  Shuo Yang,et al.  Semantic Interoperability for Electronic Business through a Novel Cross-Context Semantic Document Exchange Approach , 2018, DocEng.

[49]  Kumiko Tanaka-Ishii,et al.  Entropy Rate Estimates for Natural Language - A New Extrapolation of Compressed Large-Scale Corpora , 2016, Entropy.

[50]  Ying Liu,et al.  Using WordNet to Disambiguate Word Senses for Text Classification , 2007, International Conference on Computational Science.

[51]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[52]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[53]  D. S. Guru,et al.  Semi-supervised Text Categorization Using Recursive K-means Clustering , 2016, RTIP2R.

[54]  Renato Bruni,et al.  Website categorization: A formal approach and robustness analysis in the case of e-commerce detection , 2020, Expert Syst. Appl..

[55]  Donald E. Brown,et al.  Text Classification Algorithms: A Survey , 2019, Inf..