Nonhypothesis-Driven Research: Data Mining and Knowledge Discovery

Clinical information, stored over time, is a potentially rich source of data for clinical research. Knowledge discovery in databases (KDD), commonly known as data mining, is a process for pattern discovery and predictive modeling in large databases. KDD makes extensive use of data mining methods, automated processes, and algorithms that enable pattern recognition. Characteristically, data mining involves the use of machine learning methods developed in the domain of artificial intelligence. These methods have been applied to healthcare and biomedical data for a variety of purposes with good success and potential or realized clinical translation. Herein, the Fayyad model of knowledge discovery in databases is introduced. The steps of the process are described with select examples from clinical research informatics. These steps range from initial data selection to interpretation and evaluation. Commonly used data mining methods are surveyed: artificial neural networks, decision tree induction, support vector machines (kernel methods), association rule induction, and k-nearest neighbor. Methods for evaluating the models that result from the KDD process are closely linked to methods used in diagnostic medicine. These include the use of measures derived from a confusion matrix and receiver operating characteristic curve analysis. Data partitioning and model validation are critical aspects of evaluation. International efforts to develop and refine clinical data repositories are critically linked to the potential of these methods for developing new knowledge.

[1]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[2]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[3]  Marco Botta,et al.  Microarray data analysis and mining approaches. , 2008, Briefings in functional genomics & proteomics.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[6]  Lucila Ohno-Machado,et al.  Discrimination and calibration of mortality risk prediction models in interventional cardiology , 2005, J. Biomed. Informatics.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Peter J. Haug,et al.  Combining decision support methodologies to diagnose pneumonia , 2001, AMIA.

[10]  Claes Wohlin,et al.  Benchmarking k-nearest neighbour imputation with homogeneous Likert data , 2006, Empirical Software Engineering.

[11]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[12]  Peter J. Haug,et al.  Automatic Identification of Patients Eligible for a Pneumonia Guideline: Comparing the Diagnostic Accuracy of Two Decision Support Models , 2001, MedInfo.

[13]  Kei-Hoi Cheung,et al.  Advancing translational research with the Semantic Web , 2007, BMC Bioinformatics.

[14]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[15]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[16]  A. Hartz,et al.  A comparison of observational studies and randomized, controlled trials. , 2000, The New England journal of medicine.

[17]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[18]  Dominik Aronsky,et al.  Accuracy of Administrative Data for Identifying Patients With Pneumonia , 2005, American journal of medical quality : the official journal of the American College of Medical Quality.

[19]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[20]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[21]  M. Minsky The Society of Mind , 1986 .