Viewpoint Paper: Large Datasets in Biomedicine: A Discussion of Salient Analytic Issues

Advances in high-throughput and mass-storage technologies have led to an information explosion in both biology and medicine, presenting novel challenges for analysis and modeling. With regards to multivariate analysis techniques such as clustering, classification, and regression, large datasets present unique and often misunderstood challenges. The authors' goal is to provide a discussion of the salient problems encountered in the analysis of large datasets as they relate to modeling and inference to inform a principled and generalizable analysis and highlight the interdisciplinary nature of these challenges. The authors present a detailed study of germane issues including high dimensionality, multiple testing, scientific significance, dependence, information measurement, and information management with a focus on appropriate methodologies available to address these concerns. A firm understanding of the challenges and statistical technology involved ultimately contributes to better science. The authors further suggest that the community consider facilitating discussion through interdisciplinary panels, invited papers and curriculum enhancement to establish guidelines for analysis and reporting.

[1]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[2]  George Hripcsak,et al.  A statistical methodology for analyzing co-occurrence data from a large sample , 2007, J. Biomed. Informatics.

[3]  Peter J. Huber,et al.  Massive Datasets Workshop: Four Years After , 1999 .

[4]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[5]  P. Brown,et al.  Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Michael Wolf,et al.  Control of generalized error rates in multiple testing , 2007, 0710.2258.

[7]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[8]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[9]  Jon R. Kettenring A Perspective on Cluster Analysis , 2008 .

[10]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[11]  Megan K. Mulligan,et al.  Toward understanding the genetics of alcohol drinking through transcriptome meta-analysis. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Lev Klebanov,et al.  Multivariate search for differentially expressed gene combinations , 2004, BMC Bioinformatics.

[13]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[14]  Fionn Murtagh,et al.  Overcoming the Curse of Dimensionality in Clustering by Means of the Wavelet Transform , 2000, Comput. J..

[15]  Theodore Johnson,et al.  Hunting of the Snark: Finding Data Glitches using Data Mining Methods , 1999, IQ.

[16]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[17]  I. Jolliffe Principal Component Analysis , 2002 .

[18]  George Hripcsak,et al.  Inter-patient distance metrics using SNOMED CT defining relationships , 2006, J. Biomed. Informatics.

[19]  E. Shortliffe Computer-based medical consultations: mycin (elsevier north holland , 1976 .

[20]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[21]  Olga G. Troyanskaya,et al.  A scalable method for integration and functional analysis of multiple microarray datasets , 2006, Bioinform..

[22]  Jon R. Kettenring,et al.  The Practice of Cluster Analysis , 2006, J. Classif..

[23]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[24]  Joseph Beyene,et al.  Integrative analysis of multiple gene expression profiles with quality-adjusted effect size models , 2005, BMC Bioinformatics.

[25]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[26]  Richard Simon,et al.  A random variance model for detection of differential gene expression in small microarray experiments , 2003, Bioinform..

[27]  A. Dempster A HIGH DIMENSIONAL TWO SAMPLE SIGNIFICANCE TEST , 1958 .

[28]  Chi Hau Chen,et al.  Pattern recognition and signal processing , 1978 .

[29]  José Martínez Sotoca,et al.  A review of data complexity measures and their applicability to pattern classification problems , 2005 .

[30]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[31]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[32]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[33]  So Young Sohn,et al.  Meta Analysis of Classification Algorithms for Pattern Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  A. Farcomeni Some Results on the Control of the False Discovery Rate under Dependence , 2007 .

[35]  Rory A. Fisher,et al.  Theory of Statistical Estimation , 1925, Mathematical Proceedings of the Cambridge Philosophical Society.

[36]  Lawrence M. Fagan,et al.  Medical informatics: computer applications in health care and biomedicine (Health informatics) , 2003 .

[37]  Michael Y. Galperin The Molecular Biology Database Collection: 2005 update , 2004, Nucleic Acids Res..

[38]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[39]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[40]  Kathleen N. Lohr,et al.  Effectiveness and Outcomes in Health Care , 1990 .

[41]  W. Wu,et al.  On false discovery control under dependence , 2008, 0803.1971.

[42]  Clayton A. Wiley,et al.  Reflections on a Workshop , 1997 .

[43]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[44]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[45]  Olga Brazhnik,et al.  Anatomy of data integration , 2007, J. Biomed. Informatics.

[46]  Xing Qiu,et al.  The effects of normalization on the correlation structure of microarray data , 2005, BMC Bioinformatics.

[47]  Aniko Szabo,et al.  Multivariate exploratory tools for microarray data analysis. , 2003, Biostatistics.

[48]  Edward H. Shortliffe,et al.  Computer-based medical consultations, MYCIN , 1976 .

[49]  Michael Y. Galperin The Molecular Biology Database Collection: 2007 update , 2006, Nucleic Acids Res..

[50]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[51]  Andrei Yakovlev,et al.  Diverse correlation structures in gene expression data and their utility in improving statistical inference , 2007, 0712.2130.

[52]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[53]  Sriram V. Pemmaraju,et al.  Error-detecting codes and fault-containing self-stabilization , 2000, Inf. Process. Lett..

[54]  Alon Y. Halevy,et al.  Data integration and genomic medicine , 2007, J. Biomed. Informatics.

[55]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[56]  P. Diggle,et al.  Analysis of Longitudinal Data , 2003 .

[57]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[58]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[59]  George Hripcsak,et al.  Considering clustering: a methodological review of clinical decision support system studies , 2000, AMIA.

[60]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[61]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[62]  John D. Storey A direct approach to false discovery rates , 2002 .

[63]  Andrei Yakovlev,et al.  expression data: do they matter for correlation analysis? , 2007 .

[64]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[65]  E. Shortliffe,et al.  Readings in medical artificial intelligence: the first decade , 1984 .

[66]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[67]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[68]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .