Pattern Recognition Techniques in Microarray Data Analysis

Abstract: Recent development of technologies (e.g., microarray technology) that are capable of producing massive amounts of genetic data has highlighted the need for new pattern recognition techniques that can mine and discover biologically meaningful knowledge in large data sets. Many researchers have begun an endeavor in this direction to devise such data‐mining techniques. As such, there is a need for survey articles that periodically review and summarize the work that has been done in the area. This article presents one such survey. The first portion of the paper is meant to provide the basic biology (mostly for non‐biologists) that is required in such a project. This part is only meant to be a starting point for those experts in the technical fields who wish to embark on this new area of bioinformatics. The second portion of the paper is a survey of various data‐mining techniques that have been used in mining microarray data for biological knowledge and information (such as sequence information). This survey is not meant to be treated as complete in any form, since the area is currently one of the most active, and the body of research is very large. Furthermore, the applications of the techniques mentioned here are not meant to be taken as the most significant applications of the techniques, but simply as examples among many.

[1]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[2]  S. Brenner,et al.  General Nature of the Genetic Code for Proteins , 1961, Nature.

[3]  S. Brenner,et al.  An Unstable Intermediate Carrying Information from Genes to Ribosomes for Protein Synthesis , 1961, Nature.

[4]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[5]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[6]  T. W. Anderson An Introduction to Multivariate Statistical Analysis, 2nd Edition. , 1985 .

[7]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[8]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[9]  S. Grossberg,et al.  ART 2: self-organization of stable category recognition codes for analog input patterns. , 1987, Applied optics.

[10]  E. Kinney Primer of Biostatistics , 1987 .

[11]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[12]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[13]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[14]  Stephen Grossberg,et al.  ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures , 1990, Neural Networks.

[15]  Stephen Grossberg,et al.  Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system , 1991, Neural Networks.

[16]  Stephen Grossberg,et al.  ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition , 1991, Neural Networks.

[17]  Stephen Grossberg,et al.  ARTMAP: supervised real-time learning and classification of nonstationary data by a self-organizing neural network , 1991, [1991 Proceedings] IEEE Conference on Neural Networks for Ocean Engineering.

[18]  Stephen Grossberg,et al.  ART 2-A: an adaptive resonance algorithm for rapid category learning and recognition , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[19]  Stephen Grossberg,et al.  Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps , 1992, IEEE Trans. Neural Networks.

[20]  Walter Cedeño,et al.  An Investigation of DNA Mapping with Genetic Algorithms Preliminary Results , 1993 .

[21]  James W. Fickett,et al.  A GENETIC ALGORITHM FOR ASSEMBLING CHROMOSOME PHYSICAL MAPS , 1993 .

[22]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[23]  E. Snyder,et al.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. , 1993, Nucleic acids research.

[24]  L M Adleman,et al.  Molecular computation of solutions to combinatorial problems. , 1994, Science.

[25]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[26]  Bernd Fritzke,et al.  Growing cell structures--A self-organizing network for unsupervised and supervised learning , 1994, Neural Networks.

[27]  E. Snyder,et al.  Identification of protein coding regions in genomic DNA. , 1995, Journal of molecular biology.

[28]  Stephen Grossberg,et al.  A fuzzy ARTMAP nonparametric probability estimator for nonstationary pattern recognition problems , 1995, IEEE Trans. Neural Networks.

[29]  Cathy H. Wu,et al.  Gene Classification Artificial Neural System , 1995, Int. J. Artif. Intell. Tools.

[30]  M.H. Hassoun,et al.  Fundamentals of Artificial Neural Networks , 1996, Proceedings of the IEEE.

[31]  M. Gribskov,et al.  Identification of Sequence Patterns with Profile Analysis , 1996 .

[32]  Faramarz Valafar,et al.  Distributed global optimization (DGO) , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[33]  Kenneth H. Fasman,et al.  Finding Genes in Human DNA with a Hidden Markov Model , 1996, ISMB 1996.

[34]  Cathy H. Wu,et al.  Motif identification neural design for rapid and sensitive protein family search , 1996, Comput. Appl. Biosci..

[35]  M. Gribskov,et al.  [13] Identification of sequence patterns with profile analysis , 1996 .

[36]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[37]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[38]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[39]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[40]  J. Dopazo,et al.  Phylogenetic Reconstruction Using an Unsupervised Growing Neural Network That Adopts the Topology of a Phylogenetic Tree , 1997, Journal of Molecular Evolution.

[41]  Andrew K. C. Wong,et al.  A genetic algorithm for multiple molecular sequence alignment , 1997, Comput. Appl. Biosci..

[42]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[43]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[45]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[46]  S. Mallat A wavelet tour of signal processing , 1998 .

[47]  Joaquín Dopazo,et al.  Self-organizing tree growing network for classifying amino acids , 1998 .

[48]  Joaquin Dopazo,et al.  Self‐organizing tree‐growing network for the classification of protein sequences , 1998, Protein science : a publication of the Protein Society.

[49]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[50]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.

[51]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[52]  S. Mallat VI – Wavelet zoom , 1999 .

[53]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[54]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Daniel P. Fasulo,et al.  An Analysis of Recent Work on Clustering Algorithms , 1999 .

[56]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Bernard Yurke,et al.  DNA analog vector algebra and physical constraints on large-scale DNA-based neural network computation , 1999, DNA Based Computers.

[58]  Stéphane Mallat,et al.  A Wavelet Tour of Signal Processing, 2nd Edition , 1999 .

[59]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[61]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[62]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[63]  Pierre Baldi On the convergence of a clustering algorithm for protein-coding regions in microbial genomes , 2000, Bioinform..

[64]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[66]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[67]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[68]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[69]  R Sásik,et al.  Percolation clustering: a novel approach to the clustering of gene expression patterns in Dictyostelium development. , 2001, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[70]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[71]  Neal S. Holter,et al.  Dynamic modeling of gene expression data. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Tommi S. Jaakkola,et al.  Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks , 2000, Pacific Symposium on Biocomputing.

[73]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[74]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[75]  Nir Friedman,et al.  Inferring subnetworks from perturbed expression profiles , 2001, ISMB.

[76]  Taizo Hanai,et al.  Gene Expression Analysis Using Fuzzy ART , 2001 .

[77]  Francisco Azuaje,et al.  A computational neural approach to support the discovery of gene function and classes of cancer , 2001, IEEE Transactions on Biomedical Engineering.

[78]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[79]  D. Gifford Blazing Pathways Through Genetic Mountains , 2001, Science.

[80]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[81]  F. Valafar Pattern Recognition Techniques in Microarray Data Analysis : A Survey , 2002 .

[82]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[83]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[84]  F. Valafar Neural Network Applications in Biological Sequencing , 2002 .

[85]  Jörg Rahnenführer,et al.  Unsupervised technique for robust target separation and analysis of DNA microarray spots through adaptive pixel clustering , 2002, Bioinform..

[86]  A Y Yakovlev,et al.  Variable selection and pattern recognition with gene expression data generated by the microarray technology. , 2002, Mathematical biosciences.

[87]  Noam Harpaz,et al.  Artificial neural networks distinguish among subtypes of neoplastic colorectal lesions. , 2002, Gastroenterology.

[88]  Francisco Azuaje,et al.  A cluster validity framework for genome expression data , 2002, Bioinform..

[89]  Diane Gershon,et al.  Microarray technology: An array of opportunities , 2002, Nature.

[90]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[91]  A. Mills,et al.  Gene expression profiling diagnosis through DNA molecular computation. , 2002, Trends in biotechnology.

[92]  A. Kelemen,et al.  Bayesian neural network for microarray data , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[93]  Richard B. Chambers,et al.  Primer of Biostatistics, 5th ed , 2002 .

[94]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[95]  S. Grossberg,et al.  Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors , 1976, Biological Cybernetics.

[96]  Stephen Grossberg,et al.  Adaptive pattern classification and universal recoding: II. Feedback, expectation, olfaction, illusions , 1976, Biological Cybernetics.

[97]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[98]  Stephanie Forrest,et al.  Genetic algorithms, operators, and DNA fragment assembly , 1995, Machine Learning.

[99]  Cathy H. Wu,et al.  Neural networks for full-scale protein sequence classification: Sequence encoding with singular value decomposition , 1995, Machine Learning.