Latent Dirichlet Allocation complement in the vector space model for Multi-Label Text Classification

In text classification task one of the main problems is to choose which features give the best results. Various features can be used like words, n-grams, syntactic n-grams of various types (POS tags, dependency relations, mixed, etc.), or a combinations of these features can be considered. Also, algorithms for dimensionality reduction of these sets of features can be applied, like Latent Dirichlet Allocation (LDA). In this paper, we consider multi-label text classification task and apply various feature sets. We consider a subset of multi-labeled files from the Reuters-21578 corpus. We use traditional tf-IDF values of the features and tried both considering and ignoring stop words. We also tried several combinations of features, like bigrams and unigrams. We also experimented with adding LDA results into Vector Space Models as new features. These last experiments obtained the best results.

[1]  Josef Kittler,et al.  Improving Multilabel Classification Performance by Using Ensemble of Multi-label Classifiers , 2010, MCS.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[4]  Michal Konkol Brainy: A Machine Learning Library , 2014, ICAISC.

[5]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[6]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[7]  StamatatosEfstathios,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014 .

[8]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[9]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[10]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[11]  Marco Moreno,et al.  Using Soft Similarity in Multi-label Classification for Reuters-21578 Corpus , 2014, 2014 13th Mexican International Conference on Artificial Intelligence.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[14]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[15]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Douglas L. T. Rohde An Improved Method for Deriving Word Meaning from Lexical Co-Occurrence , 2004 .

[17]  Grigori Sidorov,et al.  Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model , 2014, Computación y Sistemas.

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[19]  Benno Stein,et al.  Near Similarity Search and Plagiarism Analysis , 2005, GfKl.

[20]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[22]  Spiros Sirmakessis Text Mining and its Applications , 2004 .

[23]  Fabrizio Sebastiani Text Categorization , 2005, Encyclopedia of Database Technologies and Applications.

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[25]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[26]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[27]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[28]  Pavel Král,et al.  Novel Unsupervised Features for Czech Multi-label Document Classification , 2014, MICAI.

[29]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[30]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.