MCut: A Thresholding Strategy for Multi-label Classification

The multi-label classification is a frequent task in machine learning notably in text categorization. When binary classifiers are not suited, an alternative consists in using a multiclass classifier that provides for each document a score per category and then in applying a thresholding strategy in order to select the set of categories which must be assigned to the document. The common thresholding strategies, such as RCut, PCut and SCut methods, need a training step to determine the value of the threshold. To overcome this limit, we propose a new strategy, called MCut which automatically estimates a value for the threshold. This method does not have to be trained and does not need any parametrization. Experiments performed on two textual corpora, XML Mining 2009 and RCV1 collections, show that the MCut strategy results are on par with the state of the art but MCut is easy to implement and parameter free.

[1]  Zhi-Hua Zhou,et al.  A k-nearest neighbor based algorithm for multi-label classification , 2005, 2005 IEEE International Conference on Granular Computing.

[2]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[3]  Ludovic Denoyer,et al.  The XML Wikipedia Corpus , 2006 .

[4]  Dan Roth,et al.  Constraint Classification: A New Approach to Multiclass Classification , 2002, ALT.

[5]  A. Nur Zincir-Heywood,et al.  Evaluation of Two Systems on Multi-class Multi-label Document Classification , 2005, ISMIS.

[6]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[7]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[8]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[9]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[12]  Troels Andreasen,et al.  Foundations of Intelligent Systems , 2014, Lecture Notes in Computer Science.

[13]  Lei Tang,et al.  Large scale multi-label classification via metalabeler , 2009, WWW '09.

[14]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[15]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Rémi Gilleron,et al.  Learning Multi-label Alternating Decision Trees from Texts and Data , 2003, MLDM.

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[20]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[21]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[22]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[23]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[24]  Ludovic Denoyer,et al.  Report on the xml mining classification track at inex 2009 , 2009 .

[25]  L. A. Ureña-López,et al.  Selection strategies for multi-label text categorization , 2006 .

[26]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[27]  Koby Crammer,et al.  A Family of Additive Online Algorithms for Category Ranking , 2003, J. Mach. Learn. Res..

[28]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.