Supervised Learning: Classification

Supervised learning is the task of building a model that is able to fit the available observations. In the area of supervised learning, classification is one of the most studied problems. Given a set of predefined class labels (two or more) and a set of available observations, the aim is to build a model based on the features of the observations that is able to assign each observation to the corresponding class. In Bioinformatics several problems can be formulated as a classification task. This article introduces several supervised learning techniques that are commonly used to address a classification problem, presenting the most used measures to evaluate the performance of a classification model.

[1]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[2]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Leonardo Vanneschi,et al.  A Multi-dimensional Genetic Programming Approach for Multi-class Classification Problems , 2014, EuroGP.

[5]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[6]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[7]  G. Izmirlian,et al.  Overview of Commonly Used Bioinformatics Methods and Their Applications , 2004, Annals of the New York Academy of Sciences.

[8]  Zbigniew Michalewicz,et al.  Handbook of Evolutionary Computation , 1997 .

[9]  Dan W. Patterson,et al.  Artificial Neural Networks: Theory and Applications , 1998 .

[10]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[11]  Yi Lu Murphey,et al.  Multi-class pattern classification using neural networks , 2007, Pattern Recognit..

[12]  Luis Muñoz,et al.  M3GP - Multiclass Classification with GP , 2015, EuroGP.

[13]  R. Reinhardt,et al.  Classification and Identification of Bacteria by Mass Spectrometry and Computational Analysis , 2008, PloS one.

[14]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[15]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[16]  Alan H. Fielding,et al.  Cluster and Classification Techniques for the Biosciences , 2006 .

[17]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[18]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[19]  Yixin Chen,et al.  Splice site prediction using support vector machines with a Bayes kernel , 2006, Expert Syst. Appl..

[20]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[23]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[24]  Martin T. Hagan,et al.  Neural network design , 1995 .

[25]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[26]  William Stafford Noble,et al.  Nucleosome positioning signals in genomic DNA. , 2007, Genome research.

[27]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[28]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[29]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[30]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[31]  Peter A. Flach,et al.  Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves , 2003, ICML.

[32]  Ramón Díaz-Uriarte,et al.  Supervised Methods with Genomic Data: a Review and Cautionary View , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[33]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[34]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[35]  Leonardo Vanneschi,et al.  Multiclass Classification Through Multidimensional Clustering , 2016 .

[36]  Guoqiang Peter Zhang,et al.  Neural networks for classification: a survey , 2000, IEEE Trans. Syst. Man Cybern. Part C.

[37]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[38]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[39]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[40]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[41]  Kevin Y. Yip,et al.  Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors , 2012, Genome Biology.

[42]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[43]  John R. Koza,et al.  Human-competitive results produced by genetic programming , 2010, Genetic Programming and Evolvable Machines.

[44]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[45]  I. Dimopoulos,et al.  Artificial neural networks as a classification method in the behavioural sciences , 1997, Behavioural Processes.

[46]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[47]  Charu C. Aggarwal,et al.  Feature Selection for Classification: A Review , 2014, Data Classification: Algorithms and Applications.

[48]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..