News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion

News classification is among essential needs for people to organize, better understand, and utilize information from the Internet. This motivates the authors to propose a novel method to classify news from social media. First, we propose to vectorize an article with TD2V, our pre-trained Twitter-based universal document representation following Doc2Vec approach. We then define Modified Distance to better measure the semantic distance between two document vectors. Finally, we apply retrieval and automatic query expansion to get the most relevant labeled documents in a corpus to determine the category for a new article. As our TD2V is created from 297 million words in 420,351 news articles from more than one million tweets in Twitters from 2010 to 2017, it can be used as one of the efficient pre-trained models for English document representation in various applications. Experiments on datasets from different online sources show that our method achieves the classification accuracy better than existing methods, specifically 98.4±0.3% (BBC dataset), 98.9±0.7% (BBC Sport dataset), 94.1±0.2% (Amazon4 dataset), and 78.6% (20NewsGroup dataset). Furthermore, in the classification training process, we just encode all articles in the training set with TD2V, not to train a dedicated classification model for each of these datasets.

[1]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[2]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[3]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[4]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[5]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[6]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[7]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[8]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[9]  João Francisco Valiati,et al.  Document-level sentiment classification: An empirical comparison between SVM and ANN , 2013, Expert Syst. Appl..

[10]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[16]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[17]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[18]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[19]  S. H. Gawande,et al.  A Comparative Study on Different Types of Approaches to Text Categorization , 2012 .

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[22]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[23]  Rui Zhao,et al.  Fuzzy Bag-of-Words Model for Document Representation , 2018, IEEE Transactions on Fuzzy Systems.

[24]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.