Computer Sciences Department Novel Uses for Machine Learning and Other Computational Methods for the Design and Interpretation of Genetic Microarrays

ACKNOWLEDGMENTS First I would like to acknowledge my advisor, Jude Shavlik, for his patient guidance throughout. His careful attention and rigorous approach have not only helped to develop my work, but have also set the ideal example of how to perform meaningful professional research. Next, I would like to thank the other members of my thesis committee: Fred Blattner, David Page, Mark Craven and Chuck Dyer. They all provided important suggestions and insights that improved the thesis dramatically. They provided a community and an environment in which I could thrive. Since 2001, NimbleGen Systems Inc. (now Roche Nimblegen) has provided me with not only a series of interesting problems to work on, and, through my consulting, the funding to work on them, but NimbleGen has also provided superb collaborators too numerous to mention. I would like to specifically acknowledge several who were of special help with regard to my thesis work: Kitzman and Matt Rosesch have all been very helpful along the way. I would like to thank Peter Andrae for joining and substantially improving our text mining project; and John Tobler for his part in our joint work on microarray design. I would also like to thank my close friend Eric Haag for allowing me to join his work in compensatory evolution. I also received three years of support from the Biotechnology Training Program (BTP, NIH Grant 5T32GM08349) and 3 months of support from the Computation and Informatics in Biology and Medicine (CIBM) Training Program (NLM Grant 5T15LM007359). I would also like to acknowledge Timothy Donohue and Beth Holden of the BTP and Louise Pape of the CIBM program for their kind assistance during and since my participation in these training programs. Finally, I would like to acknowledge my family. My wife, Rebecca, and my children Celia and Eva have been constant sources of inspiration and support. My parents' love and generosity have helped to make this possible. I would also like to thank Rebecca's parents for their gracious encouragement and advice throughout.

[1]  C. Nusbaum,et al.  Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. , 1998, Science.

[2]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[3]  Elaine R Mardis,et al.  Direct genomic selection , 2005, Nature Methods.

[4]  U Landegren,et al.  PCR-generated padlock probes detect single nucleotide variation in genomic DNA. , 2000, Nucleic acids research.

[5]  Warren S. Sarle,et al.  Stopped Training and Other Remedies for Overfitting , 1995 .

[6]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[7]  J. Gillespie MOLECULAR EVOLUTION OVER THE MUTATIONAL LANDSCAPE , 1984, Evolution; international journal of organic evolution.

[8]  S. P. Fodor,et al.  Light-directed, spatially addressable parallel chemical synthesis. , 1991, Science.

[9]  Alexander Schliep,et al.  Selecting signature oligonucleotides to identify organisms using DNA arrays , 2002, Bioinform..

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[12]  Jude W. Shavlik,et al.  A self-tuning method for one-chip SNP identification , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[13]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[14]  Brooks Hanson,et al.  A Computer Science Odyssey , 2001, Science.

[15]  P. Phillips,et al.  Waiting for a compensatory mutation: phase zero of the shifting-balance process. , 1996, Genetical research.

[16]  David Page,et al.  Using Machine Learning to Design and Interpret Gene-Expression Microarrays , 2004, AI Mag..

[17]  Yonatan Aumann,et al.  Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis , 2005, RECOMB.

[18]  Christopher W. Wong,et al.  Tracking the evolution of the SARS coronavirus using high-throughput, high-density resequencing arrays. , 2004, Genome research.

[19]  Nello Cristianini,et al.  Introduction To Computational Genomics , 2007 .

[20]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[21]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[22]  M. Hellberg,et al.  Rapid evolution of fertilization selectivity and lysin cDNA sequences in teguline gastropods. , 1999, Molecular biology and evolution.

[23]  H. Lindman Analysis of variance in complex experimental designs , 1974 .

[24]  J. Hacia Resequencing and mutational analysis using oligonucleotide microarrays , 1999, Nature Genetics.

[25]  Franck Picard,et al.  A statistical approach for array CGH data analysis , 2005, BMC Bioinformatics.

[26]  M. Nasrallah,et al.  Allele-Specific Receptor-Ligand Interactions in Brassica Self-Incompatibility , 2001, Science.

[27]  M. Kimura The role of compensatory neutral mutations in molecular evolution , 1985, Journal of Genetics.

[28]  M. Kimura,et al.  Evolution in Sexual and Asexual Populations , 1965, The American Naturalist.

[29]  Jude W. Shavlik,et al.  Evaluating machine learning approaches for aiding probe selection for gene-expression arrays , 2002, ISMB.

[30]  Gary D. Stormo,et al.  Selection of optimal DNA oligos for gene expression arrays , 2001, Bioinform..

[31]  Bart De Moor,et al.  Meta-clustering of gene expression data and literature-based information , 2003, SKDD.

[32]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[33]  David Page,et al.  Using Multiple Levels of Learning and Diverse Evidence to Uncover Coordinately Controlled Genes , 2000, ICML.

[34]  H. Blöcker,et al.  Predicting DNA duplex stability from the base sequence. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[35]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001 .

[36]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[37]  Pat Langley,et al.  Guiding Revision of Regulatory Models with Expression Data , 2002, Pacific Symposium on Biocomputing.

[38]  Jude W. Shavlik,et al.  Interpreting microarray expression data using text annotating the genes , 2002, Information Sciences.

[39]  Yogendra P. Chaubey,et al.  Resampling Methods: A Practical Guide to Data Analysis , 2000, Technometrics.

[40]  Michael Molla BUILDING GENOME EXPRESSION MODELS USING MICROARRAY EXPRESSION DATA AND TEXT , 2004 .

[41]  H A Erlich,et al.  Genetic analysis of amplified DNA with immobilized sequence-specific oligonucleotide probes. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[42]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[43]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[44]  W. Swanson,et al.  Extraordinary divergence and positive Darwinian selection in a fusagenic protein coating the acrosomal process of abalone spermatozoa. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Pablo Tamayo,et al.  A strategy for oligonucleotide microarray probe reduction , 2002, Genome Biology.

[46]  P. Sham,et al.  High-throughput loss-of-heterozygosity study of chromosome 3p in lung cancer using single-nucleotide polymorphism markers. , 2006, Cancer Research.

[47]  Philip M. Long,et al.  Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection , 2003, The Lancet.

[48]  Anu Raghunathan,et al.  Comparative genome sequencing of Escherichia coli allows observation of bacterial evolution on a laboratory timescale , 2006, Nature Genetics.

[49]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[51]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy-number changes using cDNA microarrays , 1999, Nature Genetics.

[52]  K. Lindblad-Toh,et al.  SBE-TAGS: an array-based method for efficient single-nucleotide polymorphism genotyping. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[54]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[55]  Franco Cerrina,et al.  Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. , 2002, Genome research.

[56]  S. Lucas,et al.  Whole-Genome Analysis of the Methyl tert-Butyl Ether-Degrading Beta-Proteobacterium Methylibium petroleiphilum PM1 , 2006, Journal of bacteriology.

[57]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[58]  Wolfgang Huber,et al.  Transcript mapping with high-density oligonucleotide tiling arrays , 2006, Bioinform..

[60]  Jan Komorowski,et al.  Learning Rule-based Models of Biological Process from Gene Expression Time Profiles Using Gene Ontology , 2003, Bioinform..

[61]  A Chakravarti,et al.  High-throughput variation detection and genotyping using microarrays. , 2001, Genome research.

[62]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[63]  Shanping Wang,et al.  Rapid Coevolution of the Nematode Sex-Determining Genes fem-3 and tra-2 , 2002, Current Biology.

[64]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[65]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[66]  Raymond J. Mooney,et al.  Encouraging experimental results on learning CNF , 1995, Machine Learning.

[67]  S. Palumbi,et al.  Positive selection and sequence rearrangements generate extensive polymorphism in the gamete recognition protein bindin. , 1996, Molecular biology and evolution.

[68]  Francis S Collins,et al.  Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies. , 2007, Scientific American.

[69]  Akira Suyama,et al.  Probe Design for DNA Chips , 1999 .

[70]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[71]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[72]  COMPENSATORY EVOLUTION OF INTERACTING GENE PRODUCTS THROUGH MULTIFUNCTIONAL INTERMEDIATES , 2005, Evolution; international journal of organic evolution.

[73]  B. Rannala Bioinformatics: The Machine Learning Approach.Second Edition. Adaptive Computation and Machine Learning. ByPierre Baldiand, Sørenv Brunak.A Bradford Book. Cambridge (Massachusetts): MIT Press. $49.95. xxiii + 452 p; ill.; index. ISBN: 0–262–02506‐X. 2001. , 2002 .

[74]  Ignacio Tinoco,et al.  Base-base mismatches. Thermodynamics of double helix formation for dCA3XA3G + dCT3YT3G (X, Y = A, C, G, T) , 1985, Nucleic Acids Res..

[75]  M Slatkin,et al.  Interaction of selection and recombination in the fixation of negative-epistatic genes. , 1996, Genetical research.

[76]  Russ B. Altman,et al.  A literature-based method for assessing the functional coherence of a gene group , 2003, Bioinform..

[77]  T. Richmond,et al.  Mutation discovery in bacterial genomes: metronidazole resistance in Helicobacter pylori , 2005, Nature Methods.

[78]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[79]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[80]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[81]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[82]  J. Shavlik,et al.  Using Multiple Levels of Learning and Diverse Evidence Sources to Uncover Coordinately Controlled Genes , 2000 .

[83]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[84]  Weihua Chang,et al.  Whole-genome genotyping with the single-base extension assay , 2005, Nature Methods.

[85]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[86]  W. Gilbert,et al.  The nucleotide sequence of the lac operator. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[87]  M. Sussman,et al.  Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array , 1999, Nature Biotechnology.