NetClass: A network-based relational model for document classification

Abstract Aiming to handle the complexity inherent to the human textual communication, Automatic Document Classification (ADC) methods often adopt several simplifications. One such simplification is to consider independent the terms that compose documents, which may hide important relationships between them. These relationships can encapsulate non-trivial and effective patterns to improve classification effectiveness. In this work, we propose NetClass, a new network-based model for documents that explicitly considers term relationships and introduce a family of relational algorithms for ADC, such as the LRN-WRN classifier—a lazy relational ADC algorithm that not only exploits relationships between terms but also neighborhood information. As our extensive experimental evaluation shows, the proposed LRN-WRM achieves competitive performance when compared to the state-of-the-art in ADC, including SVM, considering seven distinct domains. More specifically, LRN-WRN outperforms state-of-the-art classifiers in 5 out of 7 domains, being within the top-2 best-performing classifier in all assessed domains. Our evaluation highlights the high effectiveness of our proposal, as well as its efficiency in terms of runtime. Indeed, besides effectiveness and efficiency, the simplicity and the absence of a complex parameter tuning of our proposal are key characteristics that make our algorithms interesting alternatives for ADC. Particularly, as highlighted by our experimental evaluation, LRN-WRM was shown to be a promising alternative to dynamic domains with a huge volume of short texts (e.g., social media content) or with several classes.

[1]  Marcos André Gonçalves,et al.  A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification , 2018, IEEE Transactions on Knowledge and Data Engineering.

[2]  Mohammed J. Zaki,et al.  Lazy Associative Classification , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Foster Provost,et al.  Simple Models and Classification in Networked Data , 2004 .

[4]  Wagner Meira,et al.  Understanding temporal aspects in document classification , 2008, WSDM '08.

[5]  Thierson Couto,et al.  On Efficient Meta-Level Features for Effective Text Classification , 2014, CIKM.

[6]  Gisele L. Pappa,et al.  Temporally-aware algorithms for document classification , 2010, SIGIR '10.

[7]  Marcos André Gonçalves,et al.  Parallel Lazy Semi-Naive Bayes Strategies for Effective and Efficient Document Classification , 2015, CIKM.

[8]  Wagner Meira,et al.  Word co-occurrence features for text classification , 2011, Inf. Syst..

[9]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  Hal Daumé,et al.  Short Text Representation for Detecting Churn in Microblogs , 2016, AAAI.

[12]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[13]  Yonghong Yan,et al.  Distributional Representations of Words for Short Text Classification , 2015, VS@HLT-NAACL.

[14]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  Yanjun Qi,et al.  Sentiment classification based on supervised latent n-gram analysis , 2011, CIKM '11.

[17]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[18]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[19]  Marcos André Gonçalves,et al.  BROOF: Exploiting Out-of-Bag Errors, Boosting and Random Forests for Effective Automated Classification , 2015, SIGIR.

[20]  Virgílio A. F. Almeida,et al.  Dengue surveillance based on a computational model of spatio-temporal locality of Twitter , 2011, WebSci '11.

[21]  Yiming Yang,et al.  Multilabel classification with meta-level features , 2010, SIGIR.

[22]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[23]  Miguel Ángel García Cumbreras,et al.  Using linguistic information as features for text categorization , 2007, NATO ASI Mining Massive Data Sets for Security.

[24]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[25]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[26]  Adriano M. Pereira,et al.  Exploiting temporal contexts in text classification , 2008, CIKM '08.

[27]  Ricard V. Solé,et al.  Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf’s Law Revisited* , 2001, J. Quant. Linguistics.

[28]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[29]  Lior Wolf,et al.  In Defense of Word Embedding for Generic Text Representation , 2015, NLDB.

[30]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[31]  Carmen Banea,et al.  Random-Walk Term Weighting for Improved Text Classification , 2006 .

[32]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[33]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.