New approaches to using scientific data - statistics, data mining and related technologies in research and research training

This paper surveys technological changes that affect the collection, organization, analysis and presentation of data. It considers changes or improvements that ought to influence the research process and direct the use of technology. It explores implications for graduate research training. The insights of Evidence-Based Medicine are widely relevant across many different research areas. Its insights provide a helpful context within which to discuss the use of technological change to improve the research process. Systematic data-based overview has to date received inadequate attention, both in research and in research training. Sharing of research data once results are published would both assist systematic overview and allow further scrutiny where published analyses seem deficient. Deficiencies in data collection and published data analysis are surprisingly common. Technologies that offer new perspectives on data collection and analysis include data warehousing, data mining, new approaches to data visualization and a variety of computing technologies that are in the tradition of knowledge engineering and machine learning. There is a large overlap of interest with statistics. Statistics is itself changing dramatically as a result of the interplay between theoretical development and the power of new computational tools. I comment briefly on other developing mathematical science application areas – notably molecular biology. The internet offers new possibilities for cooperation across institutional boundaries, for exchange of information between researchers, and for dissemination of research results. Research training ought to equip students both to use their research skills in areas different from those in which they have been immediately trained, and to respond to the challenge of steadily more demanding standards. There should be an increased emphasis on training to work cooperatively. “At bottom my critique is pretty simple-minded: Nobody pays much attention to the assumptions, and technology tends to overwhelm common sense.” [Freedman 1987.] “I personally look forward to the proper balance that will emerge from the mixing of computational algorithm-oriented approaches characterizing the database and computer science communities with the powerful mathematical theories and methods for estimation developed in statistics.” [Fayyad 1998.] “Statistics has been the most successful information science. Those who ignore statistics are condemned to re-invent it.” [Efron, quoted in Friedman 1997.] “Some members of the profession are trying hard to make changes, by teaching courses in which substantive questions come first and technique is introduced to find answers. Of course, all too often, technique comes first; data come in as purely decorative illustrations – a practice not confined to statistics departments.” [Freedman 1991.]

[1]  J. Diamond Guns, Germs, and Steel: The Fates of Human Societies , 1999 .

[2]  T. Kuhn,et al.  The Structure of Scientific Revolutions. , 1964 .

[3]  Rebecca A. Maynard,et al.  The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs , 1987 .

[4]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[5]  A. Feinstein,et al.  The role of observational studies in the evaluation of therapy. , 1984, Statistics in medicine.

[6]  P. Kincade,et al.  Excerpts from Unlocking Our Future: Toward a New National Science Policy , 1999 .

[7]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[8]  J. Snow On the Mode of Communication of Cholera , 1856, Edinburgh medical journal.

[9]  D G Altman,et al.  The scandal of poor medical research , 1994, BMJ.

[10]  Robert F. Service Chemical Industry Rushes Toward Greener Pastures , 1998, Science.

[11]  D. Freedman As Others See Us: A Case Study in Path Analysis , 1987 .

[12]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[13]  William E. Odom Report of the Senior Assessment Panel for the International Assessment of the U , 1998 .

[14]  Neville Nicholls,et al.  Recent apparent changes in relationships between the El Niño-Southern Oscillation and Australian rainfall and temperature , 1996 .

[15]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[16]  Frank R. Hampel,et al.  Is statistics too difficult? , 1998 .

[17]  G Taubes,et al.  The (Political) Science of Salt , 1998, Science.

[18]  I T Higgins,et al.  "Asbestos in drinking water and cancer incidence in the San Francisco Bay area". , 1981, American journal of epidemiology.

[19]  J. Maindonald,et al.  What is a correct plant density for transplanted green asparagus , 1997 .

[20]  K. McPherson Why Do Variations Occur , 1990 .

[21]  D. Freedman Statistical models and shoe leather , 1989 .

[22]  Evangelos Simoudis,et al.  Reality Check for Data Mining , 1996, IEEE Expert.

[23]  Jerome H. Friedman,et al.  DATA MINING AND STATISTICS: WHAT''S THE CONNECTION , 1997 .

[24]  C. Sagan Broca's Brain , 1979 .

[25]  Edward J. Wegman,et al.  Huge Data Sets and the Frontiers of Computational Feasibility , 1995 .

[26]  Cidambi Srinivasan,et al.  Combining Information: Statistical Issues and Opportunities for Research , 1993 .

[27]  D P Byar,et al.  Using observational data from registries to compare treatments: the fallacy of omnimetrics. , 1984, Statistics in medicine.

[28]  N. R. Cox,et al.  Use of statistical evidence in some recent issues of DSIR agricultural journals , 1984 .

[29]  T. Lipman Netting the Evidence: A ScHARR Introduction to Evidence-Based Practice on the Internet. , 2000 .

[30]  R. Mangan,et al.  Modeling Thermal Death in the Mexican Fruit Fly (Diptera: Tephritidae) , 1997 .

[31]  I McCance,et al.  Assessment of statistical procedures used in papers in the Australian Veterinary Journal. , 1995, Australian veterinary journal.

[32]  Gavin Mooney,et al.  The Challenges of Medical Practice Variations , 1990 .

[33]  Adrian F. M. Smith Mad cows and ecstasy : Chance and choice in an evidence-based society , 1996 .

[34]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[35]  I. Olkin,et al.  Improving the quality of reporting of randomized controlled trials. The CONSORT statement. , 1996, JAMA.

[36]  I. Chalmers,et al.  Salutary lessons from the Collaborative Eclampsia Trial , 1996, Evidence Based Medicine.

[37]  D. Pregibon,et al.  REX: an Expert System for Regression Analysis , 1984 .

[38]  D. Sackett,et al.  Evidence based medicine: what it is and what it isn't , 1996, BMJ.

[39]  M. Rosenberg,et al.  The Logic of Survey Analysis. , 1968 .

[40]  Peter L. Brooks,et al.  Visualizing data , 1997 .

[41]  Bjorn Andersen,et al.  Methodological Errors in Medical Research: An Incomplete Catalogue , 1990 .

[42]  G. Guyatt,et al.  The Science of Reviewing Research a , 1993, Annals of the New York Academy of Sciences.

[43]  A. Siebes,et al.  Data Mining and Statistics , 2000, Computational Intelligence in Data Mining.

[44]  Cochrane Injuries,et al.  Human albumin administration in critically ill patients: systematic review of randomised controlled trials. , 1998, BMJ.

[45]  H. Hricak,et al.  Evidence-based medicine. , 1997, Singapore medical journal.

[46]  J. Maindonald Statistical design, analysis, and presentation issues , 1992 .

[47]  H C Sox,et al.  Setting the optimal erythrocyte protoporphyrin screening decision threshold for lead poisoning: a decision analytic approach. , 1991, Pediatrics.

[48]  E Marshall Hot Property: Biologists Who Compute , 1996, Science.

[49]  J J Gartland,et al.  Orthopaedic clinical research. Deficiencies in experimental design and determinations of outcome. , 1988, The Journal of bone and joint surgery. American volume.

[50]  Diane McGuinness Why Our Children Can't Read and What We Can Do About It: A Scientific Revolution in Reading , 1997 .

[51]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[52]  Daryl Pregibon,et al.  A Statistical Perspective on Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[53]  G. Lip How the Read a Paper: The Basics of Evidence Based Medicine , 1998, Journal of Human Hypertension.

[54]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[55]  L. Curtin Thriving on chaos. , 1993, Nursing management.

[56]  Peter R. Scholtes,et al.  The Team Handbook , 2020 .

[57]  Patrick Murphy,et al.  What Is Statistics , 2014 .