Bioinformatics: Organisms from Venus, Technology from Jupiter, Algorithms from Mars

In this paper, we discuss data sets that are being generated by microarray technology, which makes it possible to measure in parallel the activity or expression of thousands of genes simultaneously. We discuss the basics of the technology, how to preprocess the data, and how classical and newly developed algorithms can be used to generate insight in the biological processes that have generated the data. Algorithms we discuss are Principal Component Analysis, clustering techniques such as hierarchical clustering and Adaptive Quality Based Clustering and statistical sampling methods, such as Monte Carlo Markov Chains and Gibbs sampling. We illustrate these algorithms with several real-life cases from diagnostics and class discovery in leukemia, functional genomics research on the mitotic cell cycle of yeast, and motif detection in Arabidopsis thaliana using DNA background models. We also discuss some bioinformatics software platforms. In the final part of the manuscript, we present some future perspectives on the development of bioinformatics, including some visionary discussions on technology, algorithms, systems biology and computational biomedicine.

[1]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[2]  Johan A. K. Suykens,et al.  A support vector machine formulation to PCA analysis and its kernel version , 2003, IEEE Trans. Neural Networks.

[3]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[4]  Raymond N. J. Veldhuis,et al.  On the computation of the Kullback-Leibler measure for spectral distances , 2003, IEEE Trans. Speech Audio Process..

[5]  Satoru Miyano,et al.  Challenges for Intelligent Systems in Biology , 2001, IEEE Intell. Syst..

[6]  Brian D. Sykes,et al.  The seven daughters of Eve , 2001 .

[7]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[8]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[9]  Tony Van Gestel From linear to Kernel Based Methods in Classification, Modelling and Prediction , 2002 .

[10]  E. Oja,et al.  Independent Component Analysis , 2013 .

[11]  L. Lathauwer,et al.  On the Best Rank-1 and Rank-( , 2004 .

[12]  M. Bittner,et al.  Expression profiling using cDNA microarrays , 1999, Nature Genetics.

[13]  D. Botstein,et al.  For Personal Use. Only Reproduce with Permission from the Lancet Publishing Group , 2022 .

[14]  Olaf Wolkenhauer,et al.  Systems Biology: the Reincarnation of Systems Theory Applied in Biology? , 2001, Briefings Bioinform..

[15]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[16]  Bart De Moor,et al.  Generalizations of the Singular Value and QR-Decompositions , 1992, SIAM J. Matrix Anal. Appl..

[17]  Kathleen Marchal,et al.  INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis , 2003, Nucleic Acids Res..

[18]  Paul Schliekelman,et al.  Statistical Methods in Bioinformatics: An Introduction , 2001 .

[19]  Kathleen Marchal,et al.  MARAN: Normalizing Micro-array Data , 2003, Bioinform..

[20]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[21]  Joos Vandewalle,et al.  Extended Bayesian Regression Models: A Symbiotic Application of Belief Networks and Multilayer Perceptrons for the Classification of Ovarian Tumors , 2001, AIME.

[22]  Olaf Wolkenhauer,et al.  Mathematical modelling in the post-genome era: understanding genome expression and regulation--a system theoretic approach. , 2002, Bio Systems.

[23]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[24]  Kathleen Marchal,et al.  COMPARISON OF DIFFERENT METHODOLOGIES TO IDENTIFY DIFFERENTIALLY EXPRESSED GENES IN TWO-SAMPLE cDNA MICROARRAYS , 2002 .

[25]  Kathleen Marchal,et al.  Functional bioinformatics of microarray data: from expression to regulation , 2002, Proc. IEEE.

[26]  Bart De Moor,et al.  On the Structure of Generalized Singular Value and QR Decompositions , 1994 .

[27]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[28]  Joos Vandewalle,et al.  On the Best Rank-1 and Rank-(R1 , R2, ... , RN) Approximation of Higher-Order Tensors , 2000, SIAM J. Matrix Anal. Appl..

[29]  J. Collado-Vides,et al.  A web site for the computational analysis of yeast regulatory sequences , 2000, Yeast.

[30]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[31]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[32]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[33]  Jotun Hein,et al.  Statistical Methods in Bioinformatics: An Introduction , 2002 .

[34]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[35]  Kevin Davies,et al.  Cracking the genome : inside the race to unlock human DNA , 2001 .

[36]  Bart De Moor,et al.  Subspace angles between ARMA models , 2002, Syst. Control. Lett..

[37]  David G. Stork,et al.  Pattern Classification , 1973 .

[38]  José Carlos Príncipe,et al.  Information Theoretic Clustering , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Kathleen Marchal,et al.  INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling , 2002, Bioinform..

[40]  H. Kitano Systems Biology: A Brief Overview , 2002, Science.

[41]  Dimitris Anastassiou,et al.  Genomic signal processing , 2001, IEEE Signal Process. Mag..

[42]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[43]  J. Venter,et al.  Genome: The Autobiography of a Species in 23 Chapters , 2000, Nature Medicine.

[44]  K. Kadota,et al.  Preprocessing implementation for microarray (PRIM): an efficient method for processing cDNA microarray data. , 2001, Physiological genomics.

[45]  F. Wetenschappen,et al.  TO SEARCH FOR REGULATORY ELEMENTS IN SETS OF COREGULATED GENES , 2003 .

[46]  Arthur M. Lesk,et al.  The unreasonable effectiveness of mathematics in molecular biology , 2000 .

[47]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[48]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[49]  A. Griffiths Introduction to Genetic Analysis , 1976 .

[50]  Bart De Moor,et al.  Evaluation of the Vector Space Representation in Text-Based Gene Clustering , 2002, Pacific Symposium on Biocomputing.

[51]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[52]  Tommy J Phelps,et al.  Metabolomics and microarrays for improved understanding of phenotypic characteristics controlled by both genomics and environmental constraints. , 2002, Current opinion in biotechnology.

[53]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[54]  Joos Vandewalle,et al.  Independent component analysis and (simultaneous) third-order tensor diagonalization , 2001, IEEE Trans. Signal Process..

[55]  Richard M. Twyman,et al.  Principles of Gene Manipulation , 2002 .

[56]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[57]  P. Sass,et al.  DNA AND PROTEIN SEQUENCE ANALYSIS: A PRACTICAL APPROACH , 1997 .

[58]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[59]  R. Fraser The structure of deoxyribose nucleic acid. , 2004, Journal of structural biology.

[60]  Jonathan Knight,et al.  When the chips are down , 2001, Nature.

[61]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[62]  K. D. Cock Principal Angles in System Theory, Information Theory and Signal Processing , 2002 .

[63]  B. De Moor,et al.  Toucan: deciphering the cis-regulatory logic of coregulated genes. , 2003, Nucleic acids research.

[64]  Antony M. Jose Cracking the Genome: Inside the Race to Unlock Human DNA , 2001, The Yale Journal of Biology and Medicine.

[65]  Elena Cattaneo,et al.  The enigma of Huntington's disease. , 2002, Scientific American.

[66]  Michael W. Berry,et al.  Computational information retrieval , 2001 .

[67]  F. Crick,et al.  Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid , 1953, Nature.

[68]  Helen Kreuzer,et al.  Recombinant DNA and biotechnology : a guide for teachers , 1996 .

[69]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[70]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[71]  Raj Acharya,et al.  An information theoretic approach for analyzing temporal patterns of gene expression , 2003, Bioinform..

[72]  L. Stein Creating a bioinformatics nation , 2002, Nature.

[73]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[74]  E. Lander Array of hope , 1999, Nature Genetics.

[75]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[76]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[77]  M. Vidal A Biological Atlas of Functional Maps , 2001, Cell.

[78]  G. Karp Cell and molecular biology : concepts and experiments / Gerald Karp , 1996 .

[79]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[80]  Kathleen Marchal,et al.  PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences , 2002, Nucleic Acids Res..

[81]  P. Reymond,et al.  Differential Gene Expression in Response to Mechanical Wounding and Insect Feeding in Arabidopsis , 2000, Plant Cell.

[82]  B. De Moor,et al.  Genome-specific higher-order background models to improve motif detection. , 2003, Trends in microbiology.

[83]  Johan A. K. Suykens,et al.  Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis , 2002, Neural Computation.

[84]  Anthony Jf Griffiths,et al.  Modern Genetic Analysis , 1998 .

[85]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[86]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[87]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[88]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[89]  P Antal,et al.  Probabilistic Graphical Models for Computational Biomedicine , 2003, Methods of Information in Medicine.

[90]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[91]  Yves Moreau,et al.  Gene profiling of hippocampal neuronal culture , 2003, Journal of neurochemistry.

[92]  R. Stoughton,et al.  The magic of microarrays. , 2002, Scientific American.

[93]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[94]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[95]  Jill P. Mesirov,et al.  Support Vector Machine Classification of Microarray Data , 2001 .