Knowledge discovery by accuracy maximization

Significance We propose an innovative method to extract new knowledge from noisy and high-dimensional data. Our approach differs from previous methods in that it has an integrated procedure of validation of the results through maximization of cross-validated accuracy. In many cases, this method performs better than existing feature extraction methods and offers a general framework for analyzing any kind of complex data in a broad range of sciences. Examples ranging from genomics and metabolomics to astronomy and linguistics show the versatility of the method. Here we describe KODAMA (knowledge discovery by accuracy maximization), an unsupervised and semisupervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, the peculiarity of KODAMA is that it is driven by an integrated procedure of cross-validation of the results. The discovery of a local manifold’s topology is led by a classifier through a Monte Carlo procedure of maximization of cross-validated predictive accuracy. Briefly, our approach differs from previous methods in that it has an integrated procedure of validation of the results. In this way, the method ensures the highest robustness of the obtained solution. This robustness is demonstrated on experimental datasets of gene expression and metabolomics, where KODAMA compares favorably with other existing feature extraction methods. KODAMA is then applied to an astronomical dataset, revealing unexpected features. Interesting and not easily predictable features are also found in the analysis of the State of the Union speeches by American presidents: KODAMA reveals an abrupt linguistic transition sharply separating all post-Reagan from all pre-Reagan speeches. The transition occurs during Reagan’s presidency and not from its beginning.

[1]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[2]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[3]  H. Senn,et al.  Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. , 2006, Analytical chemistry.

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Dimitris K. Agrafiotis,et al.  Stochastic proximity embedding , 2003, J. Comput. Chem..

[6]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Jarkko Venna,et al.  Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization , 2010, J. Mach. Learn. Res..

[11]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[12]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[13]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[14]  R. Davies,et al.  The ATLAS3D project – I. A volume-limited sample of 260 nearby early-type galaxies: science goals and selection criteria , 2010, 1012.1551.

[15]  Edoardo M. Airoldi,et al.  Tree preserving embedding , 2011, Proceedings of the National Academy of Sciences.

[16]  Timothy A. Davis,et al.  The ATLAS3D project – III. A census of the stellar angular momentum within the effective radius of early‐type galaxies: unveiling the distribution of fast and slow rotators , 2011, 1102.4444.

[17]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[18]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[19]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[20]  F. Windmeijer,et al.  An R-squared measure of goodness of fit for some common nonlinear regression models , 1997 .

[21]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[22]  Ivano Bertini,et al.  Evidence of different metabolic phenotypes in humans , 2008, Proceedings of the National Academy of Sciences.

[23]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[24]  S. Horvath,et al.  Global histone modification patterns predict risk of prostate cancer recurrence , 2005, Nature.

[25]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Saadi Lahlou,et al.  Yes, Ronald Reagan's Rhetoric Was Unique—But Statistically, How Unique? , 2012 .

[27]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[28]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .

[29]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[30]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[31]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[32]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[33]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[35]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[36]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[37]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[38]  Ivano Bertini,et al.  Metabolomic NMR fingerprinting to identify and predict survival of patients with metastatic colorectal cancer. , 2012, Cancer research.

[39]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[40]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[41]  Regina Berretta,et al.  GPU-FS-kNN: A Software Tool for Fast and Scalable kNN Computation Using GPUs , 2012, PloS one.

[42]  Ker-Chau Li,et al.  Exploring the within- and between-class correlation distributions for tumor classification , 2010, Proceedings of the National Academy of Sciences.

[43]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[44]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[45]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[46]  Mukund Balasubramanian,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[47]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[48]  M. Cugmas,et al.  On comparing partitions , 2015 .

[49]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[50]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[52]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[53]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[54]  Katy Börner,et al.  Mapping knowledge domains , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Minho Kang,et al.  A simple and exact Laplacian clustering of complex networking phenomena: Application to gene expression profiles , 2008, Proceedings of the National Academy of Sciences.

[56]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[57]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.