Random knn modeling and variable selection for high dimensional data

High dimensional data is widely available in bioinformatics, chemometrics and other applications. For example, in gene expression experiments, tens of thousands of genes are probed. Phenotype data may be clinical data such as tumor types, or quantities measuring biological characteristics of a subject. While such high dimensional data can be readily generated, successful analysis and modeling of these data is highly challenging. Random KNN, as proposed in this dissertation, is a novel generalization of traditional nearest-neighbor modeling. Random KNN consists of an ensemble of base k nearest-neighbor models, each taking a random subset of the input variables. A theoretical and empirical analysis of the performance of the Random KNN is performed. Based on the proposed Random KNN, a new feature selection method is devised. To rank the importance of the variables, a criterion, named support, is defined and computed on the Random KNN framework. A two-stage backward model selection method is developed using supports. The present study shows that the Random KNN is a more effective and more efficient model for high-dimensional data than existing approaches. The Random KNN approach can be applied to both qualitative and quantitative responses, i.e., classification and regression problems, and has applications in statistics, machine learning, pattern recognition and bioinformatics, etc. Keywords. classification, regression, feature selection, bioinformatics, gene expression analysis

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[4]  Wlodzislaw Duch,et al.  Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter , 2005, CORES.

[5]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[6]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[7]  Keinosuke Fukunaga,et al.  The optimal distance measure for nearest neighbor classification , 1981, IEEE Trans. Inf. Theory.

[8]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[9]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[10]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[11]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[12]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[13]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[14]  Jon Louis Bentley,et al.  Multidimensional Binary Search Trees in Database Applications , 1979, IEEE Transactions on Software Engineering.

[15]  Huan Liu,et al.  A Probabilistic Approach to Feature Selection - A Filter Solution , 1996, ICML.

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  Trevor Hastie,et al.  Imputing Missing Data for Gene Expression Arrays , 2001 .

[18]  D. A. Sprott A Note on a Class of Occupancy Problems , 1969 .

[19]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[20]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[21]  L. Devroye Necessary and sufficient conditions for the pointwise convergence of nearest neighbor regression function estimates , 1982 .

[22]  Tang Shiwei,et al.  A Spatial Feature Selection Method Based on Maximum Entropy Theory , 2003 .

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[24]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[25]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[26]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[27]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[28]  Nick Roussopoulos,et al.  Direct spatial search on pictorial databases using packed R-trees , 1985, SIGMOD Conference.

[29]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[30]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[31]  Wolfgang Stadje,et al.  THE COLLECTOR'S PROBLEM WITH GROUP DRAWINGS , 1990 .

[32]  Josef Kittler,et al.  Floating search methods for feature selection with nonmonotonic criterion functions , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[33]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[34]  George Kollios,et al.  BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[36]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[38]  Bruce W. Weide,et al.  Optimal Expected-Time Algorithms for Closest Point Problems , 1980, TOMS.

[39]  A. Fedorowicz,et al.  A new descriptor selection scheme for SVM in unbalanced class problem: a case study using skin sensitisation dataset , 2007, SAR and QSAR in environmental research.

[40]  David Casasent,et al.  Adaptive branch and bound algorithm for selecting optimal features , 2007, Pattern Recognit. Lett..

[41]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[42]  Irene Gargantini,et al.  An effective way to represent quadtrees , 1982, CACM.

[43]  Robert Sabourin,et al.  An optimized hill climbing algorithm for feature subset selection: evaluation on handwritten character recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[44]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[45]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[46]  Nicholas L. Crookston,et al.  yaImpute: An R Package for kNN Imputation , 2008 .

[47]  Pavel Pudil,et al.  Feature selection toolbox software package , 2002, Pattern Recognit. Lett..

[48]  Jun Gu,et al.  Feature selection based on mutual information and redundancy-synergy coefficient , 2004, Journal of Zhejiang University. Science.

[49]  Mohamed A. Deriche,et al.  A new mutual information based measure for feature selection , 2003, Intell. Data Anal..

[50]  Hong Lin Zhai,et al.  A new approach for the identification of important variables , 2006 .

[51]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[52]  Jiao Licheng,et al.  Automatic model selection for support vector machines using heuristic genetic algorithm , 2006 .

[53]  Neeraj Misra,et al.  Kn-nearest neighbor estimators of entropy , 2008 .

[54]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[55]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[56]  J. Moody,et al.  Feature Selection Based on Joint Mutual Information , 1999 .

[57]  R. Penrose A Generalized inverse for matrices , 1955 .

[58]  B. Park,et al.  Choice of neighbor order in nearest-neighbor classification , 2008, 0810.5276.

[59]  Abraham Kandel,et al.  Information-theoretic algorithm for feature selection , 2001, Pattern Recognit. Lett..

[60]  H. Akaike A new look at the statistical model identification , 1974 .

[61]  Franz Pernkopf,et al.  Floating search algorithm for structure learning of Bayesian network classifiers , 2003, Pattern Recognit. Lett..

[62]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[63]  J. Mesirov,et al.  Chemosensitivity prediction by transcriptional profiling , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[64]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[65]  Sameer A. Nene,et al.  A simple algorithm for nearest neighbor search in high dimensions , 1997 .

[66]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[67]  Nathan Mantel,et al.  A Class of Occupancy Problems , 1968 .

[68]  Harshinder Singh,et al.  Application of the Random Forest Method in Studies of Local Lymph Node Assay Based Skin Sensitization Data , 2005, J. Chem. Inf. Model..

[69]  Lorenzo Stella,et al.  Optimization through quantum annealing: theory and some applications , 2006 .

[70]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[71]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[72]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[73]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[74]  K. S. Khattab,et al.  An occupancy problem , 1985 .

[75]  Pavel Paclík,et al.  Adaptive floating search methods in feature selection , 1999, Pattern Recognit. Lett..

[76]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[77]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[78]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[79]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[80]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[81]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[82]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[83]  Olivier Chapelle,et al.  Model Selection for Support Vector Machines , 1999, NIPS.

[84]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[85]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[86]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[87]  The Committee Problem , 1971 .

[88]  Haim J. Wolfson,et al.  Model-Based Object Recognition by Geometric Hashing , 1990, ECCV.

[89]  Martyn G. Ford,et al.  Unsupervised Forward Selection: A Method for Eliminating Redundant Variables , 2000, J. Chem. Inf. Comput. Sci..

[90]  G. Lugosi,et al.  On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates , 1994 .

[91]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[92]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[93]  Yves Grandvalet,et al.  Adaptive Scaling for Feature Selection in SVMs , 2002, NIPS.

[94]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[95]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[96]  Abel M. Rodrigues Matrix Algebra Useful for Statistics , 2007 .

[97]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[98]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[99]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[100]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[101]  A. Zell,et al.  Feature subset selection for support vector machines by incremental regularized risk minimization , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[102]  Herbert S. Winokur,et al.  Exact Moments of the Order Statistics of the Geometric Distribution and Their Relation to Inverse Sampling and Reliability of Redundant Systems , 1967 .

[103]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[104]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[105]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[106]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[107]  R. Shoemaker The NCI60 human tumour cell line anticancer drug screen , 2006, Nature Reviews Cancer.