A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification

We address the problem of automatically learning to classify texts by exploiting information derived from meta-features, i.e., features derived from the original bag-of-words representation. Specifically, we provide an in-depth analysis on the recently proposed distance-based meta-features, a data engineering technique that relies on the distance between documents to transform the original feature space into a new one, potentially smaller and more informed. Despite its potential, the meta-feature space may be unnecessarily complex and highly dimensional, which increases the tendency of overfitting, limits the application of meta-features in different contexts, and increases computational costs. In this work, we propose the use of multi-objective strategies to reduce the number of meta-features while maximizing the classification effectiveness, when considering the adequacy of the selected meta-features to a particular dataset or classification method. We present effective and efficient proposals for meta-feature selection that can substantially reduce the number of meta-features by up to 89 percent while keeping or improving the classification effectiveness, something not possible with any of the evaluated baselines. We also use our selection strategies as evaluation tools to analyze different combinations of meta-features. We found very compact combinations of meta-features that can achieve high classification effectiveness in most datasets, despite their peculiarities.

[1]  Fabrício Benevenuto,et al.  Comparing and combining sentiment analysis methods , 2013, COSN '13.

[2]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[3]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[4]  Fabrício Benevenuto,et al.  iFeel: a system that compares and combines sentiment analysis methods , 2014, WWW.

[5]  Marco Laumanns,et al.  SPEA2: Improving the Strength Pareto Evolutionary Algorithm For Multiobjective Optimization , 2002 .

[6]  Thierson Couto,et al.  Similarity-Based Synthetic Document Representations for Meta-Feature Generation in Text Classification , 2019, SIGIR.

[7]  Thierson Couto,et al.  Risk-Sensitive Learning to Rank with Evolutionary Multi-Objective Feature Selection , 2019, ACM Trans. Inf. Syst..

[8]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[9]  Francisco Herrera,et al.  A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests , 2007, Expert Syst. Appl..

[10]  Alan Hanjalic,et al.  Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors , 2019, SIGIR.

[11]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[12]  David A. Shamma,et al.  Characterizing debate performance via aggregated twitter sentiment , 2010, CHI.

[13]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[14]  Thierson Couto,et al.  Incorporating Risk-Sensitiveness into Feature Selection for Learning to Rank , 2016, CIKM.

[15]  Huidong Jin,et al.  CenKNN: a scalable and effective text classifier , 2014, Data Mining and Knowledge Discovery.

[16]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[17]  Marcos André Gonçalves,et al.  Automatic Hierarchical Categorization of Research Expertise Using Minimum Information , 2017, TPDL.

[18]  Ronen Feldman,et al.  Techniques and applications for sentiment analysis , 2013, CACM.

[19]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[20]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[21]  T. Kalamboukis,et al.  Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems ? , 2008 .

[22]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[23]  Marco Laumanns,et al.  SPEA2: Improving the strength pareto evolutionary algorithm , 2001 .

[24]  Fabrício Benevenuto,et al.  Exploiting New Sentiment-Based Meta-level Features for Effective Sentiment Analysis , 2016, WSDM.

[25]  Xin Yao,et al.  A Large Population Size Can Be Unhelpful in Evolutionary Algorithms a Large Population Size Can Be Unhelpful in Evolutionary Algorithms , 2022 .

[26]  Alexandros Agapitos,et al.  Higher Order Functions for Kernel Regression , 2014, EuroGP.

[27]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[28]  Adam Kowalczyk,et al.  Using Unlabelled Data for Text Classification through Addition of Cluster Parameters , 2002, International Conference on Machine Learning.

[29]  Katja Markert,et al.  From Words to Senses: A Case Study of Subjectivity Recognition , 2008, COLING.

[30]  Saif Mohammad,et al.  Sentiment Analysis of Short Informal Texts , 2014, J. Artif. Intell. Res..

[31]  Rada Mihalcea,et al.  Multilingual Subjectivity Analysis Using Machine Translation , 2008, EMNLP.

[32]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[33]  Ellen Riloff,et al.  Creating Subjective and Objective Sentence Classifiers from Unannotated Texts , 2005, CICLing.

[34]  Xin Yao,et al.  Many-Objective Evolutionary Algorithms , 2015, ACM Comput. Surv..

[35]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[38]  Lei Zhao,et al.  A Practical GPU Based KNN Algorithm , 2009 .

[39]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[40]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[41]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[42]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[43]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[44]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[45]  Stephen Tyree,et al.  Non-linear Metric Learning , 2012, NIPS.

[46]  Marcos André Gonçalves,et al.  CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling , 2019, WSDM.

[47]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[48]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[49]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[50]  Yiming Yang,et al.  Multilabel classification with meta-level features , 2010, SIGIR.

[51]  Lior Wolf,et al.  In Defense of Word Embedding for Generic Text Representation , 2015, NLDB.

[52]  John D. Owens,et al.  Efficient Parallel Scan Algorithms for Manycore GPUs , 2010, Scientific Computing with Multicore and Accelerators.

[53]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[54]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[55]  Ming Wu,et al.  Sentiment Classification Analysis of Chinese Microblog Network , 2015, CompleNet.

[56]  Mukesh A. Zaveri,et al.  AUTOMATIC TEXT CLASSIFICATION: A TECHNICAL REVIEW , 2011 .

[57]  Theodore Kalamboukis,et al.  Using clustering to enhance text classification , 2007, SIGIR.

[58]  Piotr Czyzżak,et al.  Pareto simulated annealing—a metaheuristic technique for multiple‐objective combinatorial optimization , 1998 .

[59]  S. Albayrak,et al.  Language-Independent Twitter Sentiment Analysis , 2012 .

[60]  Fabrício Benevenuto,et al.  A Benchmark Comparison of State-of-the-Practice Sentiment Analysis Methods , 2015, ArXiv.

[61]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[62]  Xiaobai Sun,et al.  Parallel search of k-nearest neighbors with synchronous operations , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[63]  Tong Zhang,et al.  Effective Use of Word Order for Text Categorization with Convolutional Neural Networks , 2014, NAACL.

[64]  Pável Calado,et al.  Quality assessment of collaborative content with minimal information , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[65]  Sunita Sarawagi,et al.  Discriminative Methods for Multi-labeled Classification , 2004, PAKDD.

[66]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[68]  T. Kalamboukis,et al.  Text Classification Using Clustering , 2006 .

[69]  Arvid Kappas,et al.  Sentiment in short strength detection informal text , 2010, J. Assoc. Inf. Sci. Technol..

[70]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[71]  Andrei Popescu-Belis,et al.  Sentiment analysis of user comments for one-class collaborative filtering over ted talks , 2013, SIGIR.

[72]  Liheng Jian,et al.  CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU , 2009, 2009 IEEE Youth Conference on Information, Computing and Telecommunication.

[73]  Cornelia Caragea,et al.  Using non-lexical features for identifying factual and opinionative threads in online forums , 2014, Knowl. Based Syst..

[74]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[75]  Yiming Yang,et al.  Multilabel classification with meta-level features in a learning-to-rank framework , 2011, Machine Learning.

[76]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[77]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[78]  Marco Laumanns,et al.  On the Effects of Archiving, Elitism, and Density Based Selection in Evolutionary Multi-objective Optimization , 2001, EMO.

[79]  Michel Barlaud,et al.  Fast k nearest neighbor search using GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[80]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[81]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[82]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[83]  Thierson Couto,et al.  On Efficient Meta-Level Features for Effective Text Classification , 2014, CIKM.

[84]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[85]  Fabrício Benevenuto,et al.  On the combination of "off-the-shelf" sentiment analysis methods , 2016, SAC.

[86]  Adriano M. Pereira,et al.  Exploiting temporal contexts in text classification , 2008, CIKM '08.