Optimal Aggregation of Binary Classifiers for Multiclass Cancer Diagnosis Using Gene Expression Profiles

Multiclass classification is one of the fundamental tasks in bioinformatics and typically arises in cancer diagnosis studies by gene expression profiling. There have been many studies of aggregating binary classifiers to construct a multiclass classifier based on one-versus-the-rest (1R), one-versus-one (11), or other coding strategies, as well as some comparison studies between them. However, the studies found that the best coding depends on each situation. Therefore, a new problem, which we call the “optimal coding problem,” has arisen: how can we determine which coding is the optimal one in each situation? To approach this optimal coding problem, we propose a novel framework for constructing a multiclass classifier, in which each binary classifier to be aggregated has a weight value to be optimally tuned based on the observed data. Although there is no a priori answer to the optimal coding problem, our weight tuning method can be a consistent answer to the problem. We apply this method to various classification problems including a synthesized data set and some cancer diagnosis data sets from gene expression profiling. The results demonstrate that, in most situations, our method can improve classification accuracy over simple voting heuristics and is better than or comparable to state-of-the-art multiclass predictors.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Kikuya Kato,et al.  Adaptor-tagged competitive PCR: a novel method for measuring relative gene expression. , 1997, Nucleic acids research.

[4]  Kikuya Kato,et al.  Differentiation of Follicular Thyroid Adenoma from Carcinoma by Means of Gene Expression Profiling with Adapter-Tagged Competitive Polymerase Chain Reaction , 2005, Oncology.

[5]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[6]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[7]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[8]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[9]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[11]  Li Shen,et al.  Reducing multiclass cancer classification to binary by output coding and SVM , 2006, Comput. Biol. Chem..

[12]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  G. Masarotto,et al.  Histological Evaluation of Thyroid Carcinomas: Reproducibility of the «Who» Classification , 1993, Tumori.

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  Z. Baloch,et al.  Diagnosis of “follicular neoplasm”: A gray zone in thyroid fine‐needle aspiration cytology , 2002, Diagnostic cytopathology.

[16]  Shinzaburo Noguchi,et al.  Cancer gene expression database (CGED): a database for gene expression profiling with accompanying clinical information of human cancer tissues , 2004, Nucleic Acids Res..

[17]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[18]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[19]  M. Ringnér,et al.  Molecular classification of familial non-BRCA1/BRCA2 breast cancer , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[21]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[22]  K. Franssila,et al.  Observer Variation in Histologic Classification of Thyroid Cancer , 1978, Acta pathologica et microbiologica Scandinavica. Section A, Pathology.

[23]  S. Ishii,et al.  Expression profiling using a tumor-specific cDNA microarray predicts the prognosis of intermediate risk neuroblastomas. , 2005, Cancer cell.

[24]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[25]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[26]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[27]  B. Zadrozny Reducing multiclass to binary by coupling probability estimates , 2001, NIPS.