Kernel based gene expression pattern discovery and its application on cancer classification

Association rules have been widely used in gene expression data analysis. However, there is no systematical way to select interesting rules from the millions of rules generated from high dimensional gene expression data. In this study, a kernel density estimation based measurement is proposed to evaluate the interestingness of the association rules. Several pruning strategies are also devised to efficiently discover the approximate top-k interesting patterns. Finally, over-fitting problem of the classification model is addressed by using conditional independence test to eliminate redundant rules. Experimental results show the effectiveness of the proposed interestingness measure and classification model.

[1]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[2]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[3]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[4]  M. Degroot Optimal Statistical Decisions , 1970 .

[5]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[6]  Anthony K. H. Tung,et al.  Mining frequent closed patterns in microarray data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[7]  Ah-Hwee Tan,et al.  Predictive neural networks for gene expression data analysis , 2005, Neural Networks.

[8]  Stephen T. C. Wong,et al.  Cancer classification and prediction using logistic regression with Bayesian gene selection , 2004, J. Biomed. Informatics.

[9]  Stefan Kramer,et al.  Analyzing microarray data using quantitative association rules , 2005, ECCB/JBI.

[10]  Daniel Q. Naiman,et al.  Simple decision rules for classifying human cancers from gene expression profiles , 2005, Bioinform..

[11]  Davy Janssens,et al.  Evaluating the performance of cost-based discretization versus entropy- and error-based discretization , 2006, Comput. Oper. Res..

[12]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[13]  Jenq-Neng Hwang,et al.  Nonparametric multivariate density estimation: a comparative study , 1994, IEEE Trans. Signal Process..

[14]  Guy Perrière,et al.  Between-group analysis of microarray data , 2002, Bioinform..

[15]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[16]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[17]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[19]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[20]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. , 2002 .

[21]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[22]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[23]  Anthony K. H. Tung,et al.  FARMER: finding interesting rule groups in microarray datasets , 2004, SIGMOD '04.

[24]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[25]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[27]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[28]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[29]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[30]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[31]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[32]  J. Keene Why is Hu where? Shuttling of early-response-gene messenger RNA subsets. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[34]  Franz von Kutschera,et al.  Causation , 1993, J. Philos. Log..

[35]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.