Wikipedia-based hybrid document representation for textual news classification

The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.

[1]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[2]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[3]  Simon Fong,et al.  Hierarchical classification in text mining for sentiment analysis of online news , 2014, Soft Computing.

[4]  Tat-Seng Chua,et al.  Resolving polysemy and pseudonymity in entity linking with comprehensive name and context modeling , 2015, Inf. Sci..

[5]  Wang Huizhen,et al.  Automatic word clustering for text categorization using global information , 2004 .

[6]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[7]  Marie-Francine Moens,et al.  Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications , 2015, Inf. Process. Manag..

[8]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[9]  Marie-Francine Moens,et al.  Knowledge Transfer across Multilingual Corpora via Latent Topics , 2011, PAKDD.

[10]  Vili Podgorelec,et al.  Text classification method based on self-training and LDA topic models , 2017, Expert Syst. Appl..

[11]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[15]  H. Nezreg,et al.  Conceptual Representation Using WordNet for Text Categorization , 2014 .

[16]  Luis Anido Rifón,et al.  Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach , 2015 .

[17]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[18]  Bernardete Ribeiro,et al.  Learning Supervised Topic Models for Classification and Regression from Crowds , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[20]  Jian Hu,et al.  Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[21]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[22]  Luis E. Anido-Rifón,et al.  Bag-of-Concepts Document Representation for Bayesian Text Classification , 2016, 2016 IEEE International Conference on Computer and Information Technology (CIT).

[23]  Ian H. Witten,et al.  Clustering Documents Using a Wikipedia-Based Concept Representation , 2009, PAKDD.

[24]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[25]  Jian Hu,et al.  Improving Text Classification by Using Encyclopedia Knowledge , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[26]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[29]  Jian Hu,et al.  Cross lingual text classification by mining multilingual topics from wikipedia , 2011, WSDM '11.

[30]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[31]  Roberto Prez Rodrguez,et al.  Wikipedia-based cross-language text classification , 2017 .

[32]  Paolo Napoletano,et al.  Text classification using a few labeled examples , 2014, Comput. Hum. Behav..

[33]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[34]  Sunita Sarawagi,et al.  Discriminative Methods for Multi-labeled Classification , 2004, PAKDD.

[35]  Cao Feng,et al.  STATLOG: COMPARISON OF CLASSIFICATION ALGORITHMS ON LARGE REAL-WORLD PROBLEMS , 1995 .

[36]  Dariusz Mrozek,et al.  Beyond Databases, Architectures, and Structures , 2014, Communications in Computer and Information Science.

[37]  Wolfgang G. Stock Concepts and semantic relations in information science , 2010, J. Assoc. Inf. Sci. Technol..

[38]  Oscar Täckström,et al.  An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification , 2005 .

[39]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[40]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[41]  Minyoung Kim,et al.  Model-induced term-weighting schemes for text classification , 2016, Applied Intelligence.

[42]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[43]  Rickard Cöster,et al.  Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization , 2004, COLING.

[44]  Abdellatif Rahmoun,et al.  Using WordNet for Text Categorization , 2008, Int. Arab J. Inf. Technol..

[45]  Peng Jin,et al.  Bag-of-Embeddings for Text Classification , 2016, IJCAI.

[46]  Jianxin Li,et al.  Sentiment analysis and spam detection in short informal text using learning classifier systems , 2017, Soft Computing.

[47]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[48]  Ian H. Witten,et al.  An open-source toolkit for mining Wikipedia , 2013, Artif. Intell..

[49]  Thomas Hofmann,et al.  Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[50]  Ian H. Witten,et al.  Learning a concept-based document similarity measure , 2012, J. Assoc. Inf. Sci. Technol..

[51]  Rajendra Kumar Roul,et al.  Study on suitability and importance of multilayer extreme learning machine for classification of text data , 2016, Soft Computing.

[52]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[53]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[54]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[55]  Ee-Peng Lim,et al.  Automated online news classification with personalization , 2001 .

[56]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[57]  Evgeniy Gabrilovich,et al.  Concept-Based Information Retrieval Using Explicit Semantic Analysis , 2011, TOIS.

[58]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[59]  Ming Jiang,et al.  Positive-Unlabeled Learning for Pupylation Sites Prediction , 2016, BioMed research international.

[60]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[61]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[62]  Ta Minh Thanh,et al.  Vietnamese news classification based on BoW with keywords extraction and neural network , 2017, 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES).

[63]  Ali Selamat,et al.  Web news classification using neural networks based on PCA , 2002, Proceedings of the 41st SICE Annual Conference. SICE 2002..

[64]  K. Manimala,et al.  A novel data selection technique using fuzzy C-means clustering to enhance SVM-based power quality classification , 2014, Soft Computing.

[65]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[66]  Achim Rettinger,et al.  Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification , 2016, NAACL.

[67]  Gabriela Moise,et al.  MASECO: A Multi-agent System for Evaluation and Classification of OERs and OCW Based on Quality Criteria , 2014, E-Learning Paradigms and Applications.

[68]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[69]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[70]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .