Big data and its epistemology

The article considers whether Big Data, in the form of data‐driven science, will enable the discovery, or appraisal, of universal scientific theories, instrumentalist tools, or inductive inferences. It points out, initially, that such aspirations are similar to the now‐discredited inductivist approach to science. On the positive side, Big Data may permit larger sample sizes, cheaper and more extensive testing of theories, and the continuous assessment of theories. On the negative side, data‐driven science encourages passive data collection, as opposed to experimentation and testing, and hornswoggling (“unsound statistical fiddling”). The roles of theory and data in inductive algorithms, statistical modeling, and scientific discoveries are analyzed, and it is argued that theory is needed at every turn. Data‐driven science is a chimera.

[1]  статья Редакционная,et al.  Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals , 2017 .

[2]  Wg.Cdr. Pongphet Congpuong How to lie With Statistics , 2013 .

[3]  C. Begley,et al.  Reproducibility: Six red flags for suspect work , 2013, Nature.

[4]  Luciano Floridi,et al.  Big Data and Their Epistemological Challenge , 2012, Philosophy & Technology.

[5]  Friedrich Sommer,et al.  Comment on the article "Distilling free-form natural laws from experimental data" , 2012 .

[6]  C. Borgman The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[7]  Jeffrey R. Spies,et al.  Scientific Utopia: II. Restructuring incentives and practices to promote truth over publishability , 2012, 1205.4251.

[8]  E. Yong Replication studies: Bad copy , 2012, Nature.

[9]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[10]  C. Glenn Begley,et al.  Raise standards for preclinical cancer research , 2012 .

[11]  B. Jasny,et al.  Again, and Again, and Again … , 2011 .

[12]  John P A Ioannidis,et al.  Improving Validation Practices in “Omics” Research , 2011, Science.

[13]  T. Johansson Hail the impossible: p-values, evidence, and likelihood. , 2011, Scandinavian journal of psychology.

[14]  G. Naik Scientists' Elusive Goal: Reproducing Study Results , 2011 .

[15]  Yoav Benjamini,et al.  Simultaneous and selective inference: Current successes and future challenges , 2010, Biometrical journal. Biometrische Zeitschrift.

[16]  L. Sterne The life and opinions of Tristram Shandy, gentleman. Vol.IX. , 2010 .

[17]  J. Rodgers The epistemology of mathematical and statistical modeling: a quiet methodological revolution. , 2010, The American psychologist.

[18]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[19]  Hod Lipson,et al.  Distilling Free-Form Natural Laws from Experimental Data , 2009, Science.

[20]  Bruce G Buchanan,et al.  Automating Science , 2009, Science.

[21]  Martin Frické,et al.  The knowledge pyramid: a critique of the DIKW hierarchy , 2009, J. Inf. Sci..

[22]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[23]  Peter C Austin,et al.  Pisces did not have increased heart failure: data-driven comparisons of binary proportions between levels of a categorical variable can result in incorrect statistical significance levels. , 2008, Journal of clinical epidemiology.

[24]  David A. Freedman,et al.  Editorial: Oasis or Mirage? , 2008 .

[25]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[26]  E. Wagenmakers A practical solution to the pervasive problems ofp values , 2007, Psychonomic bulletin & review.

[27]  John Worrall,et al.  Why There's No Cause to Randomize , 2007, The British Journal for the Philosophy of Science.

[28]  Donald A Berry,et al.  The difficult and ubiquitous problems of multiplicities , 2007, Pharmaceutical statistics.

[29]  J. Ioannidis Why Most Published Research Findings Are False , 2005 .

[30]  J Hilliard,et al.  Again and Again and Again , 2005 .

[31]  Michael A Babyak,et al.  What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models , 2004, Psychosomatic medicine.

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  D. Kell,et al.  Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. , 2004, BioEssays : news and reviews in molecular, cellular and developmental biology.

[34]  J. Hirschhorn,et al.  A comprehensive review of genetic association studies , 2002, Genetics in Medicine.

[35]  J. Allen Hypothesis, induction and background knowledge. Data do not speak for themselves. Replies to Donald A. Gillies, Lawrence A. Kelly and Michael Scott , 2001 .

[36]  On John Allen's critique of induction , 2001, BioEssays : news and reviews in molecular, cellular and developmental biology.

[37]  Donald A. Gillies Popper and computer induction , 2001, BioEssays : news and reviews in molecular, cellular and developmental biology.

[38]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[39]  J F Allen,et al.  Bioinformatics and discovery: induction beckons again , 2000, BioEssays : news and reviews in molecular, cellular and developmental biology.

[40]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[41]  D. Freedman From association to causation: some remarks on the history of statistics , 1999 .

[42]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[43]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[44]  J. Farris CONJECTURES AND REFUTATIONS , 1995, Cladistics : the international journal of the Willi Hennig Society.

[45]  R. Audi The Cambridge Dictionary of Philosophy , 1995 .

[46]  Jacob Cohen The earth is round (p < .05) , 1994 .

[47]  A Kohn,et al.  [Data torturing]. , 1994, Harefuah.

[48]  S. Maxwell,et al.  Bivariate median splits and spurious statistical significance. , 1993 .

[49]  J. Faraway On the Cost of Data Analysis , 1992 .

[50]  Michael K. Buckland,et al.  Information as thing , 1991, J. Am. Soc. Inf. Sci..

[51]  David T. Lykken,et al.  What's wrong with Psychology, anyway? , 1991 .

[52]  P. Duhem,et al.  La Th'eorie Physique: son Object et sa Structure , 1991 .

[53]  D. Freedman Statistical models and shoe leather , 1989 .

[54]  C. Howson,et al.  Scientific Reasoning: The Bayesian Approach , 1989 .

[55]  Gary James Jason,et al.  The Logic of Scientific Discovery , 1988 .

[56]  Peter Urbach,et al.  Randomization and the Design of Experiments , 1985, Philosophy of Science.

[57]  P. Meehl Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. , 1978 .

[58]  Joan New,et al.  The life and opinions of Tristram Shandy, gentleman : the text , 1978 .

[59]  Imre Lakatos,et al.  The methodology of scientific research programmes: Popper on demarcation and induction , 1978 .

[60]  I. Lakatos Falsification and the Methodology of Scientific Research Programmes , 1976 .

[61]  Imre Lakatos,et al.  The role of crucial experiments in science , 1974 .

[62]  Melvin H. Marx,et al.  Criticism and the Growth of Knowledge. , 1971 .

[63]  I. Lakatos,et al.  Criticism and the Growth of Knowledge: Falsification and the Methodology of Scientific Research Programmes , 1970 .

[64]  M. Kendall,et al.  The Logic of Scientific Discovery. , 1959 .

[65]  Pierre Maurice Marie Duhem,et al.  La théorie physique. Son objet, sa structure , 1906 .

[66]  Darley,et al.  The life and opinions of Tristram Shandy, gentleman : comprising the humorous adventures of Uncle Toby and corporal Trim , 1858 .

[67]  David A Freedman Oasis or Mirage ? , 2022 .