Data Wisdom in Computational Genomics Research

All fields of science are now inundated with massive amounts of data, which have the potential to answer fundamental questions. Genomics is one particular example, exploring questions like: How does the human genome work? What genome variants make us more prone to diseases? To find answers to these questions, it is crucial to develop statistical and machine learning methods that can scale up, particularly through efficient data storage and communication. Equally crucial, but less emphasized, is the possession of data wisdom—a rebranding of the best elements of applied statistics in a recent note at ODBMS.org (http://www.odbms.org/2015/04/data-wisdom-for-data-science/). The note at ODBMS.org contains ten sets of questions a practitioner can ask to cultivate data wisdom. Although there has been much recent excitement about big data, having enough data relevant to the problem is the key to gaining meaningful answers in genomics. Data wisdom gives us the insight into how these data would look, how much information a dataset really contains, and how to extract it. In this paper, we expand on the ten sets of questions and illustrate where and how data wisdom can be integrated into computational genomics research.

[1]  M. Hawes,et al.  Flavonoids: from cell cycle regulation to biotechnology , 2005, Biotechnology Letters.

[2]  G. Box Robustness in the Strategy of Scientific Model Building. , 1979 .

[3]  Xinbin Dai,et al.  Genome-wide analysis of phenylpropanoid defence pathways. , 2010, Molecular plant pathology.

[4]  G. Box Science and Statistics , 1976 .

[5]  R. Kream,et al.  Comparing Bioinformatic Gene Expression Profiling Methods: Microarray and RNA-Seq , 2014, Medical science monitor basic research.

[6]  P. Khaitovich,et al.  BMC Genomics BioMed Central Methodology article Estimating accuracy of RNA-Seq and microarrays with proteomics , 2022 .

[7]  Sapna Kumari,et al.  Evaluation of Gene Association Methods for Coexpression Network Construction and Biological Knowledge Discovery , 2012, PloS one.

[8]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[9]  D. Rubin [On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.] Comment: Neyman (1923) and Causal Inference in Experiments and Observational Studies , 1990 .

[10]  A. Scherer Batch Effects and Noise in Microarray Experiments , 2009 .

[11]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[12]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[13]  C. Gachon,et al.  Transcriptional co-regulation of secondary metabolism enzymes in Arabidopsis: functional and evolutionary implications , 2005, Plant Molecular Biology.

[14]  Bang Wong,et al.  Pathline: A Tool For Comparative Functional Genomics , 2010, Comput. Graph. Forum.

[15]  David J. Glass Experimental Design for Biologists , 2006 .

[16]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[17]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[18]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[19]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[20]  Haiyan Huang,et al.  Using biologically interrelated experiments to identify pathway genes in Arabidopsis , 2012, Bioinform..

[21]  Michael Anthony Bauer,et al.  Towards the integration, annotation and association of historical microarray experiments with RNA-seq , 2013, BMC Bioinformatics.

[22]  Steven M. Drucker,et al.  Reflections on how designers design with data , 2014, AVI.

[23]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[24]  Haiyan Huang,et al.  A Statistical Framework to Infer Functional Gene Relationships From Biologically Interrelated Microarray Experiments , 2009 .

[25]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[26]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Peter J. Bickel,et al.  Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis , 2014, 1401.6504.

[28]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[29]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[30]  S. Oliver Proteomics: Guilt-by-association goes global , 2000, Nature.

[31]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[32]  Andrew Johnston,et al.  NextGen Sequencing Technology-based Dissection of Physiological Systems Integrative RNA-seq and microarray data analysis reveals GC content and gene length biases in the psoriasis transcriptome , 2014 .

[33]  I. Sønderby,et al.  Biosynthesis of glucosinolates--gene discovery and beyond. , 2010, Trends in plant science.

[34]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[35]  Xiufeng Yan,et al.  Regulation of plant glucosinolate metabolism , 2007, Planta.

[36]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[37]  M. Waterman,et al.  Gene coexpression measures in large heterogeneous samples using count statistics , 2014, Proceedings of the National Academy of Sciences.

[38]  D. von Rosen,et al.  More on the Kronecker Structured Covariance Matrix , 2012 .

[39]  Sampa Das,et al.  Microarray data analysis: Gaining biological insights , 2013 .

[40]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[41]  J. Tukey The Future of Data Analysis , 1962 .

[42]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[43]  Andreas Scherer,et al.  Batch Effects and Noise in Microarray Experiments: Sources and Solutions , 2009 .

[44]  A. Casadevall,et al.  Reforming Science: Methodological and Cultural Reforms , 2011, Infection and Immunity.