Construction of a global map of human gene expression : the process, tools and analysis

This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown. This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress. The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data anno-

[1]  C. Ball,et al.  Submission of Microarray Data to Public Repositories , 2004, PLoS biology.

[2]  D. Kemp,et al.  Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Peter A. C. 't Hoen,et al.  Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data , 2008, Nucleic Acids Res..

[4]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[5]  P. Nelson,et al.  Microarray bioinformatics. , 2011, Methods in molecular biology.

[6]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[7]  A. Jemal,et al.  Global Cancer Statistics , 2011 .

[8]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[9]  T. Pastinen,et al.  Systematic assessment of the human osteoblast transcriptome in resting and induced primary cells. , 2008, Physiological genomics.

[10]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[11]  Wei-Min Liu,et al.  Analysis of high density expression microarrays with signed-rank call algorithms , 2002, Bioinform..

[12]  A Breslow,et al.  Thickness, Cross‐Sectional Areas and Depth of Invasion in the Prognosis of Cutaneous Melanoma , 1970, Annals of surgery.

[13]  Eivind Hovig,et al.  Methods for quantitation of gene expression. , 2009, Frontiers in bioscience.

[14]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[15]  P. Brown,et al.  A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. , 1996, Genome research.

[16]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[17]  W. Clark,et al.  The histogenesis and biologic behavior of primary human malignant melanomas of the skin. , 1969, Cancer research.

[18]  Tao Han,et al.  Cross-platform comparability of microarray technology: Intra-platform consistency and appropriate data analysis procedures are essential , 2005, BMC Bioinformatics.

[19]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[20]  S. P. Fodor,et al.  Multiplexed biochemical assays with biological chips , 1993, Nature.

[21]  D L Morton,et al.  Technical details of intraoperative lymphatic mapping for early stage melanoma. , 1992, Archives of surgery.

[22]  J. M. Thomas Time to re-evaluate sentinel node biopsy in melanoma post-multicenter selective lymphadenectomy trial. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[23]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[24]  Wei-Min Liu,et al.  Robust estimators for expression analysis , 2002, Bioinform..

[25]  Chris F. Taylor,et al.  The MGED Ontology: a resource for semantics-based description of microarray experiments , 2006, Bioinform..

[26]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[27]  Lyle D Burgoon,et al.  The need for standards, not guidelines, in biological data reporting and sharing , 2006, Nature Biotechnology.

[28]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[29]  A. Eggermont,et al.  Rotterdam Criteria for sentinel node (SN) tumor burden and the accuracy of ultrasound (US)-guided fine-needle aspiration cytology (FNAC): can US-guided FNAC replace SN staging in patients with melanoma? , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[30]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[31]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[32]  Leon French,et al.  Application and evaluation of automated semantic annotation of gene expression experiments , 2009, Bioinform..

[33]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[34]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[35]  A. McMichael,et al.  Solar Ultraviolet Radiation: Global burden of disease from solar ultraviolet radiation , 2006 .

[36]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[37]  J. Kawai,et al.  Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Audrey Kauffmann,et al.  Importing ArrayExpress datasets into R/Bioconductor , 2009, Bioinform..

[39]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[40]  Fuchou Tang,et al.  mRNA-sequencing whole transcriptome analysis of a single cell on the SOLiD system. , 2009, Journal of biomolecular techniques : JBT.

[41]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[42]  D. Rusciano Differentiation and metastasis in melanoma. , 2000, Critical reviews in oncogenesis.

[43]  B. L. Welch The generalisation of student's problems when several different population variances are involved. , 1947, Biometrika.

[44]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[45]  Jarkko Venna,et al.  Nonlinear Dimensionality Reduction as Information Retrieval , 2007, AISTATS.

[46]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[47]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[48]  E. Ukkonen,et al.  Systematic search for the best gene expression markers for melanoma micrometastasis detection , 2007, The Journal of pathology.

[49]  S. P. Fodor,et al.  Light-directed, spatially addressable parallel chemical synthesis. , 1991, Science.

[50]  Rafael A. Irizarry,et al.  A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database , 2006, BMC Bioinformatics.

[51]  A. Cochran,et al.  Management of the regional lymph nodes in patients with cutaneous malignant melanoma , 1992, World Journal of Surgery.

[52]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[53]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[55]  Paul T. Spellman,et al.  A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB , 2006, BMC Bioinformatics.

[56]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[57]  Robert Tibshirani,et al.  A comparison of fold-change and the t-statistic for microarray data analysis , 2007 .

[58]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[59]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[60]  Stephen C. Harris,et al.  Rat toxicogenomic study reveals analytical consistency across microarray platforms , 2006, Nature Biotechnology.

[61]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[62]  Audrey Kauffmann,et al.  Bioinformatics Applications Note Arrayqualitymetrics—a Bioconductor Package for Quality Assessment of Microarray Data , 2022 .

[63]  Ibrahim Emam,et al.  Gene Expression Atlas at the European Bioinformatics Institute , 2009, Nucleic Acids Res..

[64]  E. Southern Detection of specific sequences among DNA fragments separated by gel electrophoresis. , 1975, Journal of molecular biology.

[65]  Ibrahim Emam,et al.  ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression , 2008, Nucleic Acids Res..

[66]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[67]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[68]  J. Khan,et al.  Database of mRNA gene expression profiles of multiple human organs. , 2005, Genome research.

[69]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[70]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[71]  F. Crick Central Dogma of Molecular Biology , 1970, Nature.

[72]  Kazuho Ikeo,et al.  CIBEX: center for information biology gene expression database. , 2003, Comptes rendus biologies.

[73]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[74]  S. P. Fodor,et al.  Light-generated oligonucleotide arrays for rapid DNA sequence analysis. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[75]  P. Collins,et al.  Performance comparison of one-color and two-color platforms within the Microarray Quality Control (MAQC) project , 2006, Nature Biotechnology.

[76]  K. Kinzler,et al.  Serial Analysis of Gene Expression , 1995, Science.