Identifying simple discriminatory gene vectors with an information theory approach

In the feature selection of cancer classification problems, many existing methods consider genes individually by choosing the top genes which have the most significant signal-to-noise statistic or correlation coefficient. However the information of the class distinction provided by such genes may overlap intensively, since their gene expression patterns are similar The redundancy of including many genes with similar gene expression patterns results in highly complex classifiers. According to the principle of Occam's razor, simple models are preferable to complex ones, if they can produce comparable prediction performances to the complex ones. In this paper, we introduce a new method to learn accurate and low-complexity classifiers from gene expression profiles. In our method, we use mutual information to measure the relation between a set of genes, called gene vectors, and the class attribute of the samples. The gene vectors are in higher-dimensional spaces than individual genes, therefore, they are more diverse, or contain more information than individual genes. Hence, gene vectors are more preferable to individual genes in describing the class distinctions between samples since they contain more information about the class attribute. We validate our method on 3 gene expression profiles. By comparing our results with those from literature and other well-known classification methods, our method demonstrated better or comparable prediction performances to the existing methods, however, with lower-complexity models than existing methods.

[1]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[2]  Kwoh Chee Keong,et al.  Dynamic algorithm for inferring qualitative models of gene regulatory networks , 2004 .

[3]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[4]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. , 2002 .

[8]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Zheng Yun,et al.  Identifying Decision Lists with the Discrete Function Learning Algorithm , 2004 .

[10]  Chee Keong Kwoh,et al.  Dynamic algorithm for inferring qualitative models of Gene Regulatory Networks , 2006, Int. J. Data Min. Bioinform..

[11]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[12]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[13]  Shimon Ullman,et al.  Object recognition with informative features and linear classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[15]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[16]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[20]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Huiqing Liu,et al.  Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients , 2003, Bioinform..

[23]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[24]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[25]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[26]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[27]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[29]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[30]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[31]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[32]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[33]  Robert J. McEliece,et al.  The theory of information and coding : a mathematical framework for communication , 1977 .

[34]  Ian Witten,et al.  Data Mining , 2000 .

[35]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[36]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[37]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[38]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[39]  J. Groffen,et al.  Endogenous, hyperactive Rac3 controls proliferation of breast cancer cells by a p21-activated kinase-dependent pathway. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[41]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .