Knowledge-enhanced document embeddings for text classification

Abstract Accurate semantic representation models are essential in text mining applications. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although competitive results for automatic text classification may be achieved with traditional bag of words, such representation model cannot provide satisfactory classification performances on hard settings where richer text representations are required. In this paper, we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word- and word-sense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We overcome the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with strong quantitative results, especially in semantically-complex classification scenarios.

[1]  Vadlamani Ravi,et al.  A survey of the applications of text mining in financial domain , 2016, Knowl. Based Syst..

[2]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[3]  Alneu de Andrade Lopes,et al.  Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network , 2014, Journal of Computer Science and Technology.

[4]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[5]  Haris Papageorgiou,et al.  SemEval-2016 Task 5: Aspect Based Sentiment Analysis , 2016, *SEMEVAL.

[6]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[7]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[8]  Rafael Geraldeli Rossi,et al.  Semantic role-based representations in text classification , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[9]  Phivos Mylonas,et al.  Socio-semantic Query Expansion Using Twitter Hashtags , 2012, 2012 Seventh International Workshop on Semantic and Social Media Adaptation and Personalization.

[10]  Tong Zhang,et al.  Fundamentals of Predictive Text Mining , 2010, Texts in Computer Science.

[11]  Rada Mihalcea,et al.  Semantic Relatedness Using Salient Semantic Analysis , 2011, AAAI.

[12]  James Allan,et al.  Interactive Clustering of Text Collections According to a User-Specified Criterion , 2007, IJCAI.

[13]  Solange Oliveira Rezende,et al.  Text mining and semantics: a systematic mapping study , 2017, Journal of the Brazilian Computer Society.

[14]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Iryna Gurevych,et al.  Supersense Embeddings: A Unified Model for Supersense Interpretation, Prediction, and Utilization , 2016, ACL.

[17]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[18]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[19]  Georgios Paliouras,et al.  Representation models for text classification: a comparative analysis over three web document types , 2012, WIMS '12.

[20]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[21]  Rafael Valencia-García,et al.  A semantic role labelling-based framework for learning ontologies from Spanish documents , 2013, Expert Syst. Appl..

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[24]  ZhouQiang,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015 .

[25]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[26]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[27]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[28]  Andreas Stafylopatis,et al.  Exploiting Wikipedia Knowledge for Conceptual Hierarchical Clustering of Documents , 2012, Comput. J..

[29]  Solange Oliveira Rezende,et al.  Best sports: a portuguese collection of documents for semantics-concerned text mining research. , 2018 .

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[32]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[33]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[34]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[35]  Maozhen Li,et al.  Performance evaluation of Latent Dirichlet Allocation in text mining , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[36]  Roberto Navigli,et al.  From senses to texts: An all-in-one graph-based approach for measuring semantic similarity , 2015, Artif. Intell..

[37]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[38]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[39]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[40]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[41]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[42]  Alneu de Andrade Lopes,et al.  Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts , 2016, Inf. Process. Manag..

[43]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[44]  Nigel Collier,et al.  Towards a Seamless Integration of Word Senses into Downstream NLP Applications , 2017, ACL.

[45]  Shuang-Hong Yang,et al.  Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond , 2012, Mining Text Data.

[46]  Kari Torkkola,et al.  Discriminative features for text document classification , 2003, Formal Pattern Analysis & Applications.

[47]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[48]  Son Doan,et al.  Towards role-based filtering of disease outbreak reports , 2009, J. Biomed. Informatics.

[49]  Solange Oliveira Rezende,et al.  Evaluation of latent dirichlet allocation for document organization in different levels of semantic complexity , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[50]  Lei Zhang,et al.  Sentiment Analysis and Opinion Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[51]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[52]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[53]  Ignacio Iacobacci,et al.  SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[54]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[55]  Mounir Zrigui,et al.  Arabic Text Classification Framework Based on Latent Dirichlet Allocation , 2012, J. Comput. Inf. Technol..

[56]  E. Amari-Vaught,et al.  Don't I count? , 1997, The Hastings Center report.

[57]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[58]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[59]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[60]  Yan Zhang,et al.  Adaptive Concept Resolution for document representation and its applications in text mining , 2015, Knowl. Based Syst..

[61]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[62]  Dr. Charu C. Aggarwal Machine Learning for Text , 2018, Springer International Publishing.

[63]  Peng Jin,et al.  Bag-of-Embeddings for Text Classification , 2016, IJCAI.

[64]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[65]  Tom Landauer,et al.  Latent semantic analysis: theory, method and application , 2002, CSCL.

[66]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.