A preliminary review of influential works in data-driven discovery

The Gordon and Betty Moore Foundation ran an Investigator Competition as part of its Data-Driven Discovery Initiative in 2014. We received about 1100 applications and each applicant had the opportunity to list up to five influential works in the general field of “Big Data” for scientific discovery. We collected nearly 5000 references and 53 works were cited at least six times. This paper contains our preliminary findings.

[1]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[4]  Thomas Vogt,et al.  Reinventing Discovery: The New Era of Networked Science , 2012 .

[5]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[6]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[7]  William H. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[10]  P. Fayers,et al.  The Visual Display of Quantitative Information , 1990 .

[11]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Edward R. Tufte,et al.  The Visual Display of Quantitative Information , 1986 .

[16]  Shawn Bowers,et al.  The New Bioinformatics: Integrating Ecological Data from the Gene to the Biosphere , 2006 .

[17]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[18]  Dmitry Pechyony,et al.  Fast Optimization Algorithms for Solving SVM , 2012 .

[19]  Raul Cano On The Bayesian Bootstrap , 1992 .

[20]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[21]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[22]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[23]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[26]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[27]  E. al.,et al.  The Sloan Digital Sky Survey: Technical summary , 2000, astro-ph/0006396.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[32]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[33]  Peter R Grant,et al.  Multilocus genotypes from Charles Darwin's finches: biodiversity lost since the voyage of the Beagle , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[34]  T. Neumann Probability Theory The Logic Of Science , 2016 .

[35]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[36]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[39]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[40]  Steven L. Goldman Reinventing Discovery: The New Era of Networked Science , 2014 .

[41]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[42]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[43]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[44]  Vasant Dhar,et al.  Data science and prediction , 2012, CACM.

[45]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[46]  J. Gern The Sequence of the Human Genome , 2001, Science.

[47]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[48]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[49]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[50]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[51]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[52]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[53]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[54]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[55]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[56]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[57]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[58]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[59]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[60]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[61]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[62]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[63]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[64]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[65]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[66]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[67]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[68]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[69]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[70]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[71]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[72]  David G. Stork,et al.  Pattern Classification , 1973 .

[73]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[74]  J. E. Hirsch,et al.  An index to quantify an individual's scientific research output , 2005, Proc. Natl. Acad. Sci. USA.

[75]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[76]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.