Multilabel Text Categorization Based on Fuzzy Relevance Clustering

We propose a fuzzy based method for multilabel text classification in which a document can belong to one or more than one category. In text categorization, the number of the involved features is usually huge, causing the curse of the dimensionality problem. Besides, a category can be a nonconvex region, which is a union of several overlapping or disjoint subregions. An automatic classification system, thus, may suffer from large memory requirements or poor performance. By incorporating fuzzy techniques, our proposed method can overcome these issues. A fuzzy relevance measure is adopted to transform high-dimensional documents to low-dimensional fuzzy relevance vectors to avoid the curse of dimensionality problem. A clustering technique is used to divide the relevance space into a collection of subregions which are then combined to make up individual categories. This allows complex and nonconvex regions to be created. A number of experiments are presented to show the effectiveness of the proposed method in both performance and speed.

[1]  Michael K. Ng,et al.  Transductive Multilabel Learning via Label Set Propagation , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  Rémi Gilleron,et al.  Learning Multi-label Alternating Decision Trees from Texts and Data , 2003, MLDM.

[4]  Shie-Jue Lee,et al.  A neuro-fuzzy system modeling with self-constructing rule generationand hybrid SVD-based learning , 2003, IEEE Trans. Fuzzy Syst..

[5]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[6]  Edwin Lughofer,et al.  Reliable All-Pairs Evolving Fuzzy Classifiers , 2013, IEEE Transactions on Fuzzy Systems.

[7]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[8]  Wan-Jui Lee,et al.  A TSK-type neurofuzzy network approach to system modeling problems , 2005, IEEE Trans. Syst. Man Cybern. Part B.

[9]  Jiong Yang,et al.  An Approach to Active Spatial Data Mining Based on Statistical Information , 2000, IEEE Trans. Knowl. Data Eng..

[10]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[11]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[12]  Naonori Ueda,et al.  Parametric Mixture Models for Multi-Labeled Text , 2002, NIPS.

[13]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[14]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[17]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[18]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[19]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[20]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[21]  John Yen,et al.  A fuzzy similarity approach in text classification task , 2000, Ninth IEEE International Conference on Fuzzy Systems. FUZZ- IEEE 2000 (Cat. No.00CH37063).

[22]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[23]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[24]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[25]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[26]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[27]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[28]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[29]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[30]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[31]  James T. Kwok,et al.  MultiLabel Classification on Tree- and DAG-Structured Hierarchies , 2011, ICML.

[32]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[33]  Shie-Jue Lee,et al.  An Iterative Divide-and-Merge-Based Approach for Solving Large-Scale Least Squares Problems , 2013, IEEE Transactions on Parallel and Distributed Systems.

[34]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[35]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[36]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[37]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[38]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[39]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[40]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[41]  H. Abdi,et al.  Principal component analysis , 2010 .

[42]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[43]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[44]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[45]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[46]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[47]  Min-Ling Zhang,et al.  Ml-rbf: RBF Neural Networks for Multi-Label Learning , 2009, Neural Processing Letters.

[48]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[49]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[50]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[51]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[52]  Shie-Jue Lee,et al.  A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification , 2011, IEEE Transactions on Knowledge and Data Engineering.

[53]  Francisco Herrera,et al.  IVTURS: A Linguistic Fuzzy Rule-Based Classification System Based On a New Interval-Valued Fuzzy Reasoning Method With Tuning and Rule Selection , 2013, IEEE Transactions on Fuzzy Systems.

[54]  S. Lauritzen The EM algorithm for graphical association models with missing data , 1995 .

[55]  Víctor Robles,et al.  Feature selection for multi-label naive Bayes classification , 2009, Inf. Sci..

[56]  Ridvan Saraçoglu,et al.  A new approach on search for similar documents with multiple categories using fuzzy clustering , 2008, Expert Syst. Appl..

[57]  Andrea Esuli,et al.  Boosting multi-label hierarchical text categorization , 2008, Information Retrieval.

[58]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[59]  Jeen-Shing Wang,et al.  Self-adaptive neuro-fuzzy inference systems for classification applications , 2002, IEEE Trans. Fuzzy Syst..

[60]  Krishnakumar Balasubramanian,et al.  The Landmark Selection Method for Multiple Output Prediction , 2012, ICML.

[61]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[62]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[63]  Grigorios Tsoumakas,et al.  Multi-label classification of music by emotion , 2011 .