Precision annotation of digital samples in NCBI’s gene expression omnibus

The Gene Expression Omnibus (GEO) contains more than two million digital samples from functional genomics experiments amassed over almost two decades. However, individual sample meta-data remains poorly described by unstructured free text attributes preventing its largescale reanalysis. We introduce the Search Tag Analyze Resource for GEO as a web application (http://STARGEO.org) to curate better annotations of sample phenotypes uniformly across different studies, and to use these sample annotations to define robust genomic signatures of disease pathology by meta-analysis. In this paper, we target a small group of biomedical graduate students to show rapid crowd-curation of precise sample annotations across all phenotypes, and we demonstrate the biological validity of these crowd-curated annotations for breast cancer. STARGEO.org makes GEO data findable, accessible, interoperable and reusable (i.e., FAIR) to ultimately facilitate knowledge discovery. Our work demonstrates the utility of crowd-curation and interpretation of open ‘big data’ under FAIR principles as a first step towards realizing an ideal paradigm of precision medicine.

[1]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[2]  Lucila Ohno-Machado,et al.  Making it personal: translational bioinformatics , 2013, J. Am. Medical Informatics Assoc..

[3]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[4]  Peter N. Robinson,et al.  Deep phenotyping for precision medicine , 2012, Human mutation.

[5]  Atul J. Butte,et al.  Peptidomic Identification of Serum Peptides Diagnosing Preeclampsia , 2013, PloS one.

[6]  Thomas Bourgeron,et al.  The impact of the metabotropic glutamate receptor and other gene family interaction networks on autism , 2014, Nature Communications.

[7]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[8]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[9]  Dexter Hadley,et al.  Systematic integration of biomedical knowledge prioritizes drugs for repurposing , 2017, bioRxiv.

[10]  H. Kraemer,et al.  How many raters? Toward the most reliable diagnostic consensus. , 1992, Statistics in medicine.

[11]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[12]  Joel T Dudley,et al.  Computational prediction and experimental validation associating FABP-1 and pancreatic adenocarcinoma with diabetes , 2011, BMC gastroenterology.

[13]  Ben Readhead,et al.  Critical period plasticity-related transcriptional aberrations in schizophrenia and bipolar disorder , 2019, Schizophrenia Research.

[14]  Kathleen M Jagodnik,et al.  Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd , 2016, Nature Communications.

[15]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[16]  Susanne Walitza,et al.  Genome-wide copy number variation study associates metabotropic glutamate receptor gene networks with attention deficit hyperactivity disorder , 2011, Nature Genetics.

[17]  Purvesh Khatri,et al.  A comprehensive time-course–based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set , 2015, Science Translational Medicine.

[18]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[19]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[20]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[21]  Yidong Chen,et al.  A novel significance score for gene selection and ranking , 2014, Bioinform..

[22]  S C Weller,et al.  Assessing Rater Performance without a "Gold Standard" Using Consensus Theory , 1997, Medical decision making : an international journal of the Society for Medical Decision Making.

[23]  J. Weissenbach,et al.  Mutations in PCSK9 cause autosomal dominant hypercholesterolemia , 2003, Nature Genetics.

[24]  J. Sim,et al.  The kappa statistic in reliability studies: use, interpretation, and sample size requirements. , 2005, Physical therapy.

[25]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[26]  Anna Zhukova,et al.  Modeling sample variables with an Experimental Factor Ontology , 2010, Bioinform..

[27]  A. Butte,et al.  Drug Discovery in a Multidimensional World: Systems, Patterns, and Networks , 2010, Journal of cardiovascular translational research.

[28]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[29]  Gang Fu,et al.  Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[30]  Alexander A. Morgan,et al.  Integrating multiple ‘omics’ analyses identifies serological protein biomarkers for preeclampsia , 2013, BMC Medicine.

[31]  Muin J Khoury,et al.  A population approach to precision medicine. , 2012, American journal of preventive medicine.

[32]  Ara Darzi,et al.  Preparing for precision medicine. , 2012, The New England journal of medicine.

[33]  Avi Ma'ayan,et al.  Mining data and metadata from the gene expression omnibus , 2018, Biophysical Reviews.

[34]  M. Farnier,et al.  PCSK9: From discovery to therapeutic applications. , 2014, Archives of cardiovascular diseases.

[35]  Maria Keays,et al.  ArrayExpress update—trends in database growth and links to data analysis tools , 2012, Nucleic Acids Res..

[36]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[37]  Rui Chen,et al.  Promise of personalized omics to precision medicine , 2013, Wiley interdisciplinary reviews. Systems biology and medicine.

[38]  Jeffrey A. Wiser,et al.  ImmPort: disseminating data to the public for the future of immunology , 2014, Immunologic Research.

[39]  N. Laird,et al.  Meta-analysis in clinical trials. , 1986, Controlled clinical trials.

[40]  Alexander Pertsemlidis,et al.  Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9 , 2005, Nature Genetics.

[41]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[42]  Atul J Butte,et al.  Protein Microarrays Discover Angiotensinogen and PRKRIP1 as Novel Targets for Autoantibodies in Chronic Renal Disease , 2010, Molecular & Cellular Proteomics.

[43]  Olivier Gevaert,et al.  Cross-species functional analysis of cancer-associated fibroblasts identifies a critical role for CLCF1 and IL-6 in non-small cell lung cancer in vivo. , 2012, Cancer research.

[44]  Dahui Li,et al.  Task Design, Motivation, and Participation in Crowdsourcing Contests , 2011, Int. J. Electron. Commer..

[45]  R Bailén Almorox,et al.  [Effect of a monoclonal antibody to PCSK9 on LDL cholesterol]. , 2012, Revista clinica espanola.

[46]  C E Metz,et al.  Gains in Accuracy from Replicated Readings of Diagnostic Images , 1992, Medical decision making : an international journal of the Society for Medical Decision Making.

[47]  Li Li,et al.  Differentially Expressed RNA from Public Microarray Data Identifies Serum Protein Biomarkers for Cross-Organ Transplant Rejection and Other Conditions , 2010, PLoS Comput. Biol..

[48]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[49]  Chunlei Wu,et al.  BioGPS and MyGene.info: organizing online, gene-centric information , 2012, Nucleic Acids Res..

[50]  D M Roden,et al.  Genomic Medicine, Precision Medicine, Personalized Medicine: What's in a Name? , 2013, Clinical pharmacology and therapeutics.

[51]  Gareth Highnam,et al.  Personal genomes and precision medicine , 2012, Genome Biology.

[52]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[53]  Ron Shamir,et al.  Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets , 2015, Nucleic acids research.

[54]  R G Swensson,et al.  Improving performance by multiple interpretations of chest radiographs: effectiveness and cost. , 1978, Radiology.

[55]  S C Kleene,et al.  Representation of Events in Nerve Nets and Finite Automata , 1951 .