Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification

There is growing interest in the application of machine learning techniques in bioinformatics. The supervised machine learning approach has been widely applied to bioinformatics and gained a lot of success in this research area. With this learning approach researchers first develop a large training set, which is a time-consuming and costly process. Moreover, the proportion of the positive examples and negative examples in the training set may not represent the real-world data distribution, which causes concept drift. Active learning avoids these problems. Unlike most conventional learning methods where the training set used to derive the model remains static, the classifier can actively choose the training data and the size of training set increases. We introduced an algorithm for performing active learning with support vector machine and applied the algorithm to gene expression profiles of colon cancer, lung cancer, and prostate cancer samples. We compared the classification performance of active learning with that of passive learning. The results showed that employing the active learning method can achieve high accuracy and significantly reduce the need for labeled training instances. For lung cancer classification, to achieve 96% of the total positives, only 31 labeled examples were needed in active learning whereas in passive learning 174 labeled examples were required. That meant over 82% reduction was realized by active learning. In active learning the areas under the receiver operating characteristic (ROC) curves were over 0.81, while in passive learning the areas under the ROC curves were below 0.50.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[3]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[4]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[5]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[6]  David R. Gilbert,et al.  An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics , 2003, APBC.

[7]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[8]  George Forman Incremental Machine Learning to Reduce Biochemistry Lab Costs in the Search for Drug Discovery , 2002, BIOKDD.

[9]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[10]  J. Goo,et al.  Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists , 2004, Korean journal of radiology.

[11]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[12]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[13]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[14]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[15]  Niall M. Adams,et al.  Comparing classifiers when the misallocation costs are uncertain , 1999, Pattern Recognit..

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[18]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[19]  A. Fasoli [Clinical decision analysis]. , 1986, Annali italiani di medicina interna : organo ufficiale della Societa italiana di medicina interna.

[20]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[21]  David G. Stork,et al.  Pattern Classification , 1973 .

[22]  R. Chang,et al.  Support vector machines for diagnosis of breast tumors on US images. , 2003, Academic radiology.

[23]  Ying Liu,et al.  A Comparative Study on Feature Selection Methods for Drug Discovery , 2004, J. Chem. Inf. Model..

[24]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[25]  Peter A. Flach,et al.  Learning Decision Trees Using the Area Under the ROC Curve , 2002, ICML.

[26]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[27]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[28]  Geoffrey I. Webb,et al.  On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions , 2005, Machine Learning.

[29]  Manabu Sassano,et al.  An Empirical Study of Active Learning with Support Vector Machines for Japanese Word Segmentation , 2002, ACL.

[30]  Feng Luan,et al.  Diagnosing Breast Cancer Based on Support Vector Machines , 2003, J. Chem. Inf. Comput. Sci..

[31]  Jean-Claude Chevalier 1. Que et quelles sont les grammaires scientifiques du français au xxe siècle , 2000 .