An Introduction to Data Mining

Data mining aims at the automated discovery of knowledge from typically large repositories of data. In science this knowledge is most often integrated into a model describing a particular process or natural phenomenon. Requirements with respect to the predictivity and the generality of the resulting models are usually significantly higher than in other application domains. Therefore, in the use of data mining in the sciences, and crystallography in particular, methods from machine learning and statistics play a significantly higher role than in other application areas. In the context of Crystallography, data collection, cleaning, and warehousing are aspects from standard data mining that play an important role, whereas for the analysis of the data techniques from machine learning and statistical analysis are mostly used. The purpose of this chapter is to introduce the reader to the concepts from that latter part of the knowledge discovery process and to provide a general intuition for the methods and possibilities of the different tools for learning from databases.

[1]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[2]  Bernhard Schölkopf,et al.  On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion , 1998, Algorithmica.

[3]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[4]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[5]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[6]  G. Casari,et al.  Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. , 1990, Journal of molecular biology.

[7]  Aapo Hyvärinen,et al.  A Fast Fixed-Point Algorithm for Independent Component Analysis , 1997, Neural Computation.

[8]  Carolyn Pratt Brock,et al.  Investigations of the Systematics of Crystal Packing Using the Cambridge Structural Database , 1996, Journal of research of the National Institute of Standards and Technology.

[9]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[10]  Thomas Lengauer,et al.  Derivation of a scoring function for crystal structure prediction. , 2001, Acta crystallographica. Section A, Foundations of crystallography.

[11]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[12]  Ralf Zimmer,et al.  A simple approach for protein name identification: prospects and limits , 2005, BMC Bioinformatics.

[13]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[14]  Kevin R. Thornton,et al.  Multilocus patterns of nucleotide variability and the demographic and selection history of Drosophila melanogaster populations. , 2005, Genome research.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  M. Sippl Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. , 1990, Journal of molecular biology.

[17]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[18]  K. Dill,et al.  An iterative method for extracting energy-like quantities from protein structures. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[19]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[20]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[21]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[22]  Barry Robson,et al.  Data mining and clinical data repositories: Insights from a 667, 000 patient data set , 2006, Comput. Biol. Medicine.

[23]  F. Allen The Cambridge Structural Database: a quarter of a million crystal structures and rising. , 2002, Acta crystallographica. Section B, Structural science.

[24]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[25]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[26]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[27]  Gisbert Schneider,et al.  Evaluation of Distance Metrics for Ligand‐Based Similarity Searching , 2004, Chembiochem : a European journal of chemical biology.

[28]  Colin Campbell,et al.  Kernel methods: a survey of current techniques , 2002, Neurocomputing.

[29]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[30]  G. Crippen,et al.  Contact potential that recognizes the correct folding of globular proteins. , 1992, Journal of molecular biology.

[31]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[32]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[33]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[34]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.