Kernelized Information-Theoretic Metric Learning for Cancer Diagnosis Using High-Dimensional Molecular Profiling Data

With the advancement of genome-wide monitoring technologies, molecular expression data have become widely used for diagnosing cancer through tumor or blood samples. When mining molecular signature data, the process of comparing samples through an adaptive distance function is fundamental but difficult, as such datasets are normally heterogeneous and high dimensional. In this article, we present kernelized information-theoretic metric learning (KITML) algorithms that optimize a distance function to tackle the cancer diagnosis problem and scale to high dimensionality. By learning a nonlinear transformation in the input space implicitly through kernelization, KITML permits efficient optimization, low storage, and improved learning of distance metric. We propose two novel applications of KITML for diagnosing cancer using high-dimensional molecular profiling data: (1) for sample-level cancer diagnosis, the learned metric is used to improve the performance of k-nearest neighbor classification; and (2) for estimating the severity level or stage of a group of samples, we propose a novel set-based ranking approach to extend KITML. For the sample-level cancer classification task, we have evaluated on 14 cancer gene microarray datasets and compared with eight other state-of-the-art approaches. The results show that our approach achieves the best overall performance for the task of molecular-expression-driven cancer sample diagnosis. For the group-level cancer stage estimation, we test the proposed set-KITML approach using three multi-stage cancer microarray datasets, and correctly estimated the stages of sample groups for all three studies.

[1]  Feiping Nie,et al.  Robust Distance Metric Learning via Simultaneous L1-Norm Minimization and Maximization , 2014, ICML.

[2]  R. Durrett Probability: Theory and Examples , 1993 .

[3]  Marc Sebban,et al.  A Survey on Metric Learning for Feature Vectors and Structured Data , 2013, ArXiv.

[4]  Tomer Hertz,et al.  Boosting margin based distance functions for clustering , 2004, ICML.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Chi-Kan Chen,et al.  The classification of cancer stage microarray data , 2012, Comput. Methods Programs Biomed..

[7]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[8]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[9]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[10]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[11]  Kathleen R. Cho,et al.  Mouse model of human ovarian endometrioid adenocarcinoma based on somatic defects in the Wnt/beta-catenin and PI3K/Pten signaling pathways. , 2007, Cancer cell.

[12]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[13]  Leroy Hood,et al.  A molecular correlate to the Gleason grading system for prostate adenocarcinoma. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[14]  P. McCullagh Analysis of Ordinal Categorical Data , 1985 .

[15]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[16]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[17]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[18]  David Zhang,et al.  A Kernel Classification Framework for Metric Learning , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[19]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[20]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[21]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Dejan Juric,et al.  Functional network analysis reveals extended gliomagenesis pathway maps and three novel MYC-interacting genes in human gliomas. , 2005, Cancer research.

[23]  Yanjun Qi,et al.  An Integrated Approach To Blood-Based Cancer Diagnosis And Biomarker Discovery , 2013, Pacific Symposium on Biocomputing.

[24]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[25]  QiYanjun,et al.  Kernelized Information-Theoretic Metric Learning for Cancer Diagnosis Using High-Dimensional Molecular Profiling Data , 2016 .

[26]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[27]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[28]  Jun Wang,et al.  Metric Learning with Multiple Kernels , 2011, NIPS.

[29]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[30]  Huilin Xiong,et al.  Kernel-based distance metric learning for microarray data classification , 2006, BMC Bioinformatics.

[31]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[32]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[33]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[34]  Jia Li,et al.  Biomarker detection in the integration of multiple multi-class genomic studies , 2010, Bioinform..

[35]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[36]  Kaizhu Huang,et al.  Sparse Metric Learning via Smooth Optimization , 2009, NIPS.

[37]  Brian Kulis,et al.  Metric Learning: A Survey , 2013, Found. Trends Mach. Learn..

[38]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[39]  Chenlei Leng,et al.  Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data , 2008, Comput. Biol. Chem..

[40]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[41]  John T. Wei,et al.  Integrative molecular concept modeling of prostate cancer progression , 2007, Nature Genetics.

[42]  Wen-Lin Kuo,et al.  A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. , 2006, Cancer cell.

[43]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[44]  Feiping Nie,et al.  Learning a Mahalanobis distance metric for data clustering and classification , 2008, Pattern Recognit..

[45]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Tat-Seng Chua,et al.  An efficient sparse metric learning in high-dimensional space via l1-penalized log-determinant regularization , 2009, ICML '09.

[47]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[48]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[49]  Yunsong Guo,et al.  Metric Learning: A Support Vector Approach , 2008, ECML/PKDD.

[50]  Feiping Nie,et al.  A general kernelization framework for learning algorithms based on kernel PCA , 2010, Neurocomputing.

[51]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[52]  A. Agresti Analysis of Ordinal Categorical Data: Agresti/Analysis , 2010 .

[53]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[54]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[55]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[56]  Alan E Hubbard,et al.  Creating diagnostic scores using data-adaptive regression: An application to prediction of 30-day mortality among stroke victims in a rural hospital in India , 2007, Therapeutics and clinical risk management.

[57]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[58]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[59]  Torben F. Ørntoft,et al.  Identifying distinct classes of bladder carcinoma using microarrays , 2003, Nature Genetics.

[60]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[61]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[62]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[63]  Inderjit S. Dhillon,et al.  Structured metric learning for high dimensional problems , 2008, KDD.

[64]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[65]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[66]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[67]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of squamous cell lung cancers , 2012, Nature.

[68]  Peng Li,et al.  Distance Metric Learning with Eigenvalue Optimization , 2012, J. Mach. Learn. Res..