Selecting maximally informative genes

Microarray experiments are emerging as one of the main driving forces in modern biology. By allowing the simultaneous monitoring of the expression of the entire genome for a given organism, array experiments provide tremendous insight into the fundamental biological processes that translate genetic information. One of the major challenges is to identify computationally efficient and biologically meaningful analysis approaches to extract the most informative and unbiased components of the microarray data. This process is complicated by the fact that a number of uncertainties are associated with array experiments. Therefore, the assumption of the existence of a unique computational descriptive model needs to be challenged. In this paper, we introduce a framework that integrates machine learning and optimization techniques for the selection of maximally informative genes in microarray expression experiments. The fundamental premise of the approach is that maximally informative genes are the ones that lead to least complex descriptive and predictive models. We propose a methodology, based on decision trees, which identifies ensembles of groups of maximally informative genes. We raise a number of computational issues that need to be comprehensively addressed and illustrate the approach by analyzing recently published microarray experimental data.

[1]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[2]  Dimitris K. Agrafiotis,et al.  Stochastic Algorithms for Maximizing Molecular Diversity , 1997, J. Chem. Inf. Comput. Sci..

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  Usama M. Fayyad,et al.  What Should Be Minimized in a Decision Tree? , 1990, AAAI.

[7]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Tin Kam Ho,et al.  C4.5 decision forests , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  W. Pan,et al.  How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach , 2002, Genome Biology.

[11]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  P S Meltzer,et al.  Gastrointestinal stromal tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. , 2001, Cancer research.

[13]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy number variation in breast cancer using DNA microarrays , 1999, Nature Genetics.

[14]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[15]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[16]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[17]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[18]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[19]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[20]  Simon Lin,et al.  Methods of microarray data analysis III , 2002 .

[21]  M. Bhaskara Rao,et al.  Model Selection and Inference , 2000, Technometrics.

[22]  J. K. Lenstra,et al.  Local Search in Combinatorial Optimisation. , 1997 .

[23]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[24]  M. Bittner,et al.  Human prostate cancer and benign prostatic hyperplasia: molecular dissection by gene expression profiling. , 2001, Cancer research.

[25]  C. Li,et al.  Feature extraction and normalization algorithms for high‐density oligonucleotide gene expression array data , 2001, Journal of cellular biochemistry. Supplement.

[26]  B. Spencer‐Dene,et al.  Tyrosine kinase signalling in breast cancer: Fibroblast growth factors and their receptors , 2000, Breast Cancer Research.

[27]  Tin Kam Ho,et al.  A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors , 2002, Pattern Analysis & Applications.

[28]  Y. Tu,et al.  Quantitative noise analysis for gene expression microarray experiments , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Tin Kam Ho,et al.  Complexity of Classification Problems and Comparative Advantages of Combined Classifiers , 2000, Multiple Classifier Systems.

[30]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  D. Bowtell,et al.  Options available — from start to finish — for obtaining expression data by microarray , 1999, Nature Genetics.

[32]  Kelvin H. Lee,et al.  Dynamical analysis of gene networks requires both mRNA and protein expression information. , 1999, Metabolic engineering.

[33]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[34]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[35]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[36]  M. Eisen,et al.  Gene expression informatics —it's all in your mine , 1999, Nature Genetics.

[37]  Fotis C Kafatos A revolutionary landscape: the restructuring of biology and its convergence with medicine. , 2002, Journal of molecular biology.

[38]  George Stephanopoulos,et al.  Determination of minimum sample size and discriminatory expression patterns in microarray data , 2002, Bioinform..

[39]  Kimberley D. Wood Exploring the new world , 1999 .

[40]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[41]  H. Kovar,et al.  Overexpression of the pseudoautosomal gene MIC2 in Ewing's sarcoma and peripheral primitive neuroectodermal tumor. , 1990, Oncogene.

[42]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[43]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[44]  J. M. Deutsch,et al.  Evolutionary algorithms for finding optimal gene sets in microarray prediction , 2003, Bioinform..

[45]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[46]  A. Blanchard,et al.  High-density oligonucleotide arrays , 1996 .

[47]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[48]  Gerard V. Trunk,et al.  A Problem of Dimensionality: A Simple Example , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  M. Morley,et al.  Making and reading microarrays , 1999, Nature Genetics.

[50]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[51]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy-number changes using cDNA microarrays , 1999, Nature Genetics.

[52]  E Terry Papoutsakis,et al.  A segmental nearest neighbor normalization and gene identification method gives superior results for DNA-array analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.