Tens-embedding: A Tensor-based document embedding method

Abstract A human is capable of understanding and classifying a text but a computer can understand the underlying semantics of a text when texts are represented in a way comprehensible by computers. The text representation is a fundamental stage in natural language processing (NLP). One of the main drawbacks of existing text representation approaches is that they only utilize one aspect or view of a text e.g. They only consider texts by their words while the topic information can be extracted from text as well. The term-document and document-topic matrix are two views of a text and contain complementary information. We use the strength of both views to extract a richer representation. In this paper, we propose three different text representation methods with the help of these two matrices and tensor factorization to utilize the power of both views. The proposed approach (Tens-Embedding) was applied in the tasks of text classification, sentence-level and document-level sentiment analysis and text clustering wherein the conducted experiments on 20newsgroups, R52, R8, MR and IMDB datasets indicated the superiority of the proposed method in comparison with other document embedding techniques.

[1]  Danushka Bollegala,et al.  Cross-Domain Sentiment Classification Using Sentiment Sensitive Embeddings , 2016, IEEE Transactions on Knowledge and Data Engineering.

[2]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[3]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[4]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[5]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[6]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[7]  Zhiyong Feng,et al.  LSTM with sentence representations for document-level sentiment classification , 2018, Neurocomputing.

[8]  G. Mateu-Figueras,et al.  Isometric Logratio Transformations for Compositional Data Analysis , 2003 .

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Yan Zhang,et al.  Adaptive Concept Resolution for document representation and its applications in text mining , 2015, Knowl. Based Syst..

[11]  T. Guillot,et al.  SOPHIE velocimetry of Kepler transit candidates XVII. The physical properties of giant exoplanets within 400 days of period , 2015, 1511.00643.

[12]  R. Lakshmi,et al.  Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms , 2019, Expert Syst. Appl..

[13]  Hong Qu,et al.  Bag of meta-words: A novel method to represent document for the sentiment classification , 2018, Expert Syst. Appl..

[14]  Rui Zhao,et al.  Fuzzy Bag-of-Words Model for Document Representation , 2018, IEEE Transactions on Fuzzy Systems.

[15]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[16]  Roberto Navigli,et al.  NASARI: a Novel Approach to a Semantically-Aware Representation of Items , 2015, NAACL.

[17]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Chonghui Guo,et al.  CCODM: conditional co-occurrence degree matrix document representation method , 2017, Soft Computing.

[20]  Minmin Chen,et al.  Efficient Vector Representation for Documents through Corruption , 2017, ICLR.

[21]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[22]  Kayvan Bijari,et al.  Leveraging deep graph-based text representation for sentiment polarity applications , 2020, Expert Syst. Appl..

[23]  Sungzoon Cho,et al.  Bag-of-concepts: Comprehending document representation through clustering words in distributed representation , 2017, Neurocomputing.

[24]  Huanhuan Chen,et al.  Latent Topic Text Representation Learning on Statistical Manifolds , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[25]  Roberto Navigli,et al.  Knowledge-enhanced document embeddings for text classification , 2019, Knowl. Based Syst..

[26]  Mehran Kamkarhaghighi,et al.  Content Tree Word Embedding for document representation , 2017, Expert Syst. Appl..

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Mahesh Joshi,et al.  Prognostication of Student’s Performance: An Hierarchical Clustering Strategy for Educational Dataset , 2016 .

[29]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[30]  Qi Li,et al.  Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base , 2020, Knowl. Based Syst..

[31]  Tiziano Flati,et al.  Three Birds (in the LLOD Cloud) with One Stone: BabelNet, Babelfy and the Wikipedia Bitaxonomy , 2014, SEMANTICS.

[32]  Jing Zhou,et al.  Hate Speech Detection with Comment Embeddings , 2015, WWW.

[33]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[34]  Tao Qi,et al.  Attentive Pooling with Learnable Norms for Text Representation , 2020, ACL.

[35]  Rajarshi Das,et al.  Gaussian LDA for Topic Models with Word Embeddings , 2015, ACL.