Binary analysis and optimization-based normalization of gene expression data

MOTIVATION Most approaches to gene expression analysis use real-valued expression data, produced by high-throughput screening technologies, such as microarrays. Often, some measure of similarity must be computed in order to extract meaningful information from the observed data. The choice of this similarity measure frequently has a profound effect on the results of the analysis, yet no standards exist to guide the researcher. RESULTS To address this issue, we propose to analyse gene expression data entirely in the binary domain. The natural measure of similarity becomes the Hamming distance and reflects the notion of similarity used by biologists. We also develop a novel data-dependent optimization-based method, based on Genetic Algorithms (GAs), for normalizing gene expression data. This is a necessary step before quantizing gene expression data into the binary domain and generally, for comparing data between different arrays. We then present an algorithm for binarizing gene expression data and illustrate the use of the above methods on two different sets of data. Using Multidimensional Scaling, we show that a reasonable degree of separation between different tumor types in each data set can be achieved by working solely in the binary domain. The binary approach offers several advantages, such as noise resilience and computational efficiency, making it a viable approach to extracting meaningful biological information from gene expression data.

[1]  D. Thieffry,et al.  Dynamical behaviour of biological regulatory networks—I. Biological role of feedback loops and practical use of the concept of the loop-characteristic state , 1995 .

[2]  Z. Szallasi,et al.  Modeling the normal and neoplastic cell cycle with "realistic Boolean genetic networks": their application for understanding carcinogenesis and assessing therapeutic strategies. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[3]  E. Davidson,et al.  Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. , 1998, Science.

[4]  Rainer Fuchs,et al.  Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters , 2001, Bioinform..

[5]  Kevin R. Coombes,et al.  Identifying Differentially Expressed Genes in cDNA Microarray Experiments , 2001, J. Comput. Biol..

[6]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[7]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[8]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[9]  E. Dougherty,et al.  Multivariate measurement of gene expression relationships. , 2000, Genomics.

[10]  Hongyu Zhao,et al.  Assessing reliability of gene clusters from gene expression data , 2000, Functional & Integrative Genomics.

[11]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[12]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[13]  Roland Somogyi,et al.  Modeling the complexity of genetic networks: Understanding multigenic and pleiotropic regulation , 1996, Complex..

[14]  Toshihide Ibaraki,et al.  Logical analysis of numerical data , 1997, Math. Program..

[15]  S. Kauffman Metabolic stability and epigenesis in randomly constructed genetic nets. , 1969, Journal of theoretical biology.

[16]  Tommi S. Jaakkola,et al.  Maximum-likelihood estimation of optimal scaling factors for expression array normalization , 2001, SPIE BiOS.

[17]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[18]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  T. Ørntoft,et al.  Gene expression profiling: monitoring transcription and translation products using DNA microarrays and proteomics , 2000, FEBS letters.

[21]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[22]  Satoru Miyano,et al.  Selecting Informative Genes for Cancer Classification Using Gene Expression Data , 2003 .

[23]  A. Brazma,et al.  Gene expression data analysis , 2000, FEBS letters.

[24]  Bernhard Pfahringer,et al.  Compression-Based Discretization of Continuous Attributes , 1995, ICML.

[25]  W. K. Alfred Yung PATHOLOGY AND GENETICS OF TUMOURS OF THE NERVOUS SYSTEM , 2002 .

[26]  W. Yung,et al.  Reactivation of insulin-like growth factor binding protein 2 expression in glioblastoma multiforme: a revelation by parallel gene expression profiling. , 1999, Cancer research.

[27]  Edward R. Dougherty,et al.  Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks , 2002, Bioinform..

[28]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[29]  A Wuensche,et al.  Genomic regulation modeled as a network with basins of attraction. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[30]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[31]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[32]  M. Kendall,et al.  Rank Correlation Methods (5th ed.). , 1992 .

[33]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[34]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[35]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Sui Huang Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery , 1999, Journal of Molecular Medicine.

[37]  Günter Rudolph,et al.  Convergence analysis of canonical genetic algorithms , 1994, IEEE Trans. Neural Networks.

[38]  David M. Rocke,et al.  A Model for Measurement Error for Gene Expression Arrays , 2001, J. Comput. Biol..

[39]  Fionn Murtagh,et al.  Image Processing and Data Analysis - The Multiscale Approach , 1998 .

[40]  L. Glass,et al.  The logical analysis of continuous, non-linear biochemical control networks. , 1973, Journal of theoretical biology.