Efficient classification of multi-labeled text streams by clashing

We present a method for the classification of multi-labeled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time. Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labeled instances colliding in the same region. This approach is referred to as clashing. We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labeled streams.

[1]  Hwee Tou Ng,et al.  Bayesian online classifiers for text classification and filtering , 2002, SIGIR '02.

[2]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[3]  Eyke Hüllermeier,et al.  Dependent binary relevance models for multi-label classification , 2014, Pattern Recognit..

[4]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification : Survey of Methods , 2010 .

[5]  Spiridon D. Likothanassis,et al.  Best terms: an efficient feature-selection algorithm for text categorization , 2005, Knowledge and Information Systems.

[6]  Rafael Morales Bueno,et al.  TF-SIDF: Term frequency, sketched inverse document frequency , 2011, 2011 11th International Conference on Intelligent Systems Design and Applications.

[7]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[8]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[9]  Geoff Holmes,et al.  Multi-label Classification Using Ensembles of Pruned Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[11]  Xijin Tang,et al.  Text clustering using frequent itemsets , 2010, Knowl. Based Syst..

[12]  Jirí Matousek,et al.  On variants of the Johnson–Lindenstrauss lemma , 2008, Random Struct. Algorithms.

[13]  Peng Shi,et al.  Learning very fast decision tree from uncertain data streams with positive and unlabeled samples , 2012, Inf. Sci..

[14]  Geoff Holmes,et al.  Scalable and efficient multi-label classification for evolving data streams , 2012, Machine Learning.

[15]  Niall M. Adams,et al.  lambda-Perceptron: An adaptive classifier for data streams , 2011, Pattern Recognit..

[16]  Verayuth Lertnattee,et al.  Class normalization in centroid-based text categorization , 2006, Inf. Sci..

[17]  Songbo Tan,et al.  An improved centroid classifier for text categorization , 2008, Expert Syst. Appl..

[18]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[19]  Eyke Hüllermeier,et al.  Rectifying Classifier Chains for Multi-Label Classification , 2019, LWA.

[20]  Alexander J. Smola,et al.  Collaborative Email-Spam Filtering with the Hashing-Trick , 2009 .

[21]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[22]  Suman Saha,et al.  Approximate Data Mining Using Sketches for Massive Data , 2013 .

[23]  Eyke Hüllermeier,et al.  An Analysis of Chaining in Multi-Label Classification , 2012, ECAI.

[24]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[25]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[26]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[27]  T. Theeramunkong,et al.  Analysis of inverse class frequency in centroid-based text classification , 2004, IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004..

[28]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[29]  Witold Pedrycz,et al.  Multi-label classification by exploiting label correlations , 2014, Expert Syst. Appl..

[30]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[31]  Junjie Wu,et al.  Towards enhancing centroid classifier for text classification - A border-instance approach , 2013, Neurocomputing.

[32]  Shie-Jue Lee,et al.  FSKNN: Multi-label text categorization based on fuzzy similarity and k nearest neighbors , 2012, Expert Syst. Appl..

[33]  Hui Xiong,et al.  A semantic term weighting scheme for text categorization , 2011, Expert Syst. Appl..

[34]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[35]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[36]  Koby Crammer,et al.  Multi-Class Confidence Weighted Algorithms , 2009, EMNLP.

[37]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[38]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[39]  Anirban Dasgupta,et al.  A sparse Johnson: Lindenstrauss transform , 2010, STOC '10.

[40]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[41]  Zhongyang Xiong,et al.  Fast text categorization using concise semantic analysis , 2011, Pattern Recognit. Lett..

[42]  Fuji Ren,et al.  Class-indexing-based term weighting for automatic text classification , 2013, Inf. Sci..

[43]  Hui Zhang,et al.  Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization , 2010, J. Inf. Sci. Eng..

[44]  Bernard Chazelle,et al.  Faster dimension reduction , 2010, Commun. ACM.

[45]  I. V. Ramakrishnan,et al.  Live and learn from mistakes: A lightweight system for document classification , 2013, Inf. Process. Manag..

[46]  Ziqi Wang,et al.  A Probabilistic Approach to String Transformation , 2014, IEEE Transactions on Knowledge and Data Engineering.

[47]  Carlo Zaniolo,et al.  An Adaptive Nearest Neighbor Classification Algorithm for Data Streams , 2005, PKDD.

[48]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[49]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[50]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[51]  Jesús S. Aguilar-Ruiz,et al.  A similarity-based approach for data stream classification , 2014, Expert Syst. Appl..

[52]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[53]  Grigorios Tsoumakas,et al.  Evaluating Feature Selection Methods for Multi-Label Text Classication , 2013, BioASQ@CLEF.

[54]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[56]  Albert Bifet,et al.  Mining Big Data in Real Time , 2013, Informatica.

[57]  David J. Hand Data, Not Dogma: Big Data, Open Data, and the Opportunities Ahead , 2013, IDA.

[58]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[59]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[60]  Hakan Altinçay,et al.  Analytical evaluation of term weighting schemes for text categorization , 2010, Pattern Recognit. Lett..

[61]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[62]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[63]  Harry Wechsler,et al.  Spam detection using Random Boost , 2012, Pattern Recognit. Lett..

[64]  Hakan Altinçay,et al.  Nonlinear transformation of term frequencies for term weighting in text categorization , 2012, Eng. Appl. Artif. Intell..

[65]  Dimitrios Gunopulos,et al.  Dimensionality reduction by random projection and latent semantic indexing , 2003 .

[66]  Lei Wang,et al.  Fuzzy Passive-Aggressive classification: A robust and efficient algorithm for online classification problems , 2013, Inf. Sci..

[67]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[68]  Eyke Hüllermeier,et al.  On label dependence and loss minimization in multi-label classification , 2012, Machine Learning.

[69]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[70]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[72]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[73]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[74]  Xindong Wu,et al.  Compressed labeling on distilled labelsets for multi-label learning , 2012, Machine Learning.

[75]  José Ramón Quevedo,et al.  Multilabel classifiers with a probabilistic thresholding strategy , 2012, Pattern Recognit..

[76]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[77]  Shengyi Jiang,et al.  A generalized cluster centroid based classifier for text categorization , 2013, Inf. Process. Manag..

[78]  Songbo Tan,et al.  Adapting centroid classifier for document categorization , 2011, Expert Syst. Appl..

[79]  Nello Cristianini,et al.  NOAM: news outlets analysis and monitoring system , 2011, SIGMOD '11.

[80]  Minyi Guo,et al.  A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[81]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[82]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..