Hybrid Chinese text classification approach using general knowledge from Baidu Baike

Most of the previous studies focused on enriching text representation to address text classification (TC) task. However, conventional classification approaches with VSM (vector space model) on Chinese text study intensively only the words and their relationship in some specific corpus/dataset but ignore the basic concept of categories and the general knowledge behind the words learned and used to recognize entities by people. This paper focuses on enriching text representation and proposes a novel approach, which complements information from the online Chinese encyclopedia Baidu Baike for Chinese TC. The similarities between every text and each concept of categories and the most related words from Baidu Baike are added to the feature space. The performance of the proposed approach is measured on the Fudan University TC corpus, which is an imbalanced Chinese dataset. In the experiments, the proposed Baidu Baike-based concept similarity approach obtains promising results when compared with a previous research and the conventional method, with macro-precision of 90.31%, recall of 75.45%, and F1 score 80.32%, which are about 0.02%, 0.15%, 0.12%, respectively, higher than the conventional method, which obviously improves the recall for some small categories while keeping precision at high level and improving the macro F1 score. Moreover, the proposed approach has good expandability, so that many other knowledge bases could be integrated and many other concepts could be referred to improve the effectiveness. © 2016 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.

[1]  Joan Claudi Socoró,et al.  Towards High-Quality Next-Generation Text-to-Speech Synthesis: A Multidomain Approach by Automatic Domain Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Yuan Tian,et al.  Semantic dictionary based method for short text classification , 2013 .

[3]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[4]  Yanchun Liang,et al.  A resampling ensemble algorithm for classification of imbalance problems , 2014, Neurocomputing.

[5]  María Lourdes Borrajo Diz,et al.  An HMM-based over-sampling technique to improve text classification , 2013, Expert Syst. Appl..

[6]  Serkan Günal,et al.  Text classification using genetic algorithm oriented latent semantic features , 2014, Expert Syst. Appl..

[7]  Enhong Chen,et al.  Exploiting probabilistic topic models to improve text categorization under class imbalance , 2011, Inf. Process. Manag..

[8]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[9]  Fuji Ren,et al.  Predicting User-Topic Opinions in Twitter with Social and Topical Context , 2013, IEEE Transactions on Affective Computing.

[10]  Yunming Ye,et al.  ForesTexter: An efficient random forest algorithm for imbalanced text categorization , 2014, Knowl. Based Syst..

[11]  Shengyi Jiang,et al.  A generalized cluster centroid based classifier for text categorization , 2013, Inf. Process. Manag..

[12]  Jiahao Zhang,et al.  Sample cutting method for imbalanced text sentiment classification based on BRC , 2013, Knowl. Based Syst..

[13]  Chih-Fong Tsai,et al.  SVOIS: Support Vector Oriented Instance Selection for text classification , 2011, Inf. Syst..

[14]  Junjie Wu,et al.  Towards enhancing centroid classifier for text classification - A border-instance approach , 2013, Neurocomputing.

[15]  Mari Ostendorf,et al.  Learning Phrase Patterns for Text Classification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Felipe Bravo-Marquez,et al.  Meta-level sentiment models for big social data analysis , 2014, Knowl. Based Syst..

[17]  Rui Xia,et al.  Ensemble of feature sets and classification algorithms for sentiment classification , 2011, Inf. Sci..

[18]  Jingbo Zhu,et al.  Active Learning With Sampling by Uncertainty and Density for Data Annotations , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Chih-Fong Tsai,et al.  Evolutionary instance selection for text classification , 2014, J. Syst. Softw..

[20]  Efstathios Stamatatos,et al.  Syntactic Dependency-Based N-grams as Classification Features , 2012, MICAI.

[21]  Jian Ma,et al.  Sentiment classification: The contribution of ensemble learning , 2014, Decis. Support Syst..

[22]  Wagner Meira,et al.  Temporal contexts: Effective text classification in evolving document collections , 2013, Inf. Syst..

[23]  Fermín L. Cruz,et al.  A comparative study of classifier combination applied to NLP tasks , 2013, Inf. Fusion.

[24]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[25]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[26]  Kansheng Shi,et al.  Efficient text classification method based on improved term reduction and term weighting , 2011 .

[27]  Jonghun Park,et al.  Language independent semantic kernels for short-text classification , 2014, Expert Syst. Appl..

[28]  Christina Lioma,et al.  Part of Speech Based Term Weighting for Information Retrieval , 2009, ECIR.

[29]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[30]  Minyi Guo,et al.  Fast dimension reduction for document classification based on imprecise spectrum analysis , 2013, Inf. Sci..

[31]  Qingshan Jiang,et al.  Feature selection via maximizing global information gain for text classification , 2013, Knowl. Based Syst..

[32]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.