A Transformation Approach Towards Big Data Multilabel Decision Trees

A large amount of the data processed nowadays is multilabel in nature. This means that every pattern usually belongs to several categories at once. Multilabel data are abundant, and most multilabel datasets are quite large. This causes that many multilabel classification methods struggle with their processing. Tackling this task by means of big data methods seems a logical choice. However, this approach has been scarcely explored by now. The present work introduces several big data multilabel classifiers, all of them based on decision trees. After detailing how they have been designed, their predictive performance, as well as the execution time, are analyzed.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[3]  Francisco Charte,et al.  Case Studies and Metrics , 2016 .

[4]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[6]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[7]  Francisco Charte,et al.  Multilabel Classification: Problem Analysis, Metrics and Techniques , 2016 .

[8]  Min-Ling Zhang,et al.  Ml-rbf: RBF Neural Networks for Multi-Label Learning , 2009, Neural Processing Letters.

[9]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[10]  A.N. Srivastava,et al.  Discovering recurring anomalies in text reports regarding complex space systems , 2005, 2005 IEEE Aerospace Conference.

[11]  Piotr Synak,et al.  Multi-Label Classification of Emotions in Music , 2006, Intelligent Information Systems.

[12]  Koby Crammer,et al.  Automatic Code Assignment to Medical Text , 2007, BioNLP@ACL.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[15]  Francisco Charte,et al.  QUINTA: A question tagging assistant to improve the answering ratio in electronic forums , 2015, IEEE EUROCON 2015 - International Conference on Computer as a Tool (EUROCON).

[16]  Francisco Charte,et al.  R Ultimate Multilabel Dataset Repository , 2016, HAIS.

[17]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[18]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[19]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[21]  Francisco Charte,et al.  Working with Multilabel Datasets in R: The mldr Package , 2015, R J..

[22]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[23]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[24]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.