High-dimensional data analysis : optimal metrics and feature selection/

High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects.

[1]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[2]  C. R. Rao,et al.  The Utilization of Multiple Measurements in Problems of Biological Classification , 1948 .

[3]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[4]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[5]  Maurice G. Kendall,et al.  A Course in the Geometry of n Dimensions , 1962 .

[6]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[7]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[8]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[9]  A. J. B. Anderson,et al.  Computer Science and Statistics: Proceedings of the Fifteenth Symposium on the Interface. , 1984 .

[10]  Rudy Rucker The fourth dimension. A guided tour of the higher universes. , 1985 .

[11]  Michio Sugeno,et al.  Fuzzy identification of systems and its applications to modeling and control , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[13]  Laws of Large Numbers for Pairwise Independent Uniformly Integrable Random Variables , 1987 .

[14]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[15]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[16]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[17]  Keinosuke Fukunaga,et al.  Effects of Sample Size in Classifier Design , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[19]  Yoshihiko Hamamoto,et al.  Evaluation of the branch and bound algorithm for feature selection , 1990, Pattern Recognit. Lett..

[20]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[21]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[23]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[24]  David Haussler,et al.  Proceedings of the fifth annual workshop on Computational learning theory , 1992, COLT 1992.

[25]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[26]  H. H. Thodberg,et al.  Optimal minimal neural interpretation of spectra , 1992 .

[27]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[28]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[29]  B. K. Alsberg Representation of spectra by continuous functions , 1993 .

[30]  Pierre Demartines Analyse de donnees par reseaux de neurones auto-organises , 1994 .

[31]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[32]  A. S. Weigend,et al.  Selecting Input Variables Using Mutual Information and Nonparemetric Density Estimation , 1994 .

[33]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[34]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[35]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[36]  David G. Lowe,et al.  Similarity Metric Learning for a Variable-Kernel Classifier , 1995, Neural Computation.

[37]  CentresMark,et al.  Regularisation in the Selection of Radial Basis Function , 1995 .

[38]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[40]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[41]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[42]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[43]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[44]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[45]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .

[46]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[47]  David W. Aha,et al.  Lazy Learning , 1997, Springer Netherlands.

[48]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[49]  Thomas S. Huang,et al.  Relevance feedback: a power tool for interactive content-based image retrieval , 1998, IEEE Trans. Circuits Syst. Video Technol..

[50]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[51]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[52]  Peter Yianilos,et al.  Excluded middle vantage point forests for nearest neighbor search , 1998 .

[53]  B. Schölkopf,et al.  Asymptotically Optimal Choice of ε-Loss for Support Vector Machines , 1998 .

[54]  King Ngi Ngan,et al.  Special Issue on segmentation, description, and retrieval of video content , 1998 .

[55]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[56]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[57]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[58]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[59]  Kan Deng,et al.  On the Greediness of Feature Selection Algorithms , 1999 .

[60]  D. Keim,et al.  What Is the Nearest Neighbor in High Dimensional Spaces? , 2000, VLDB.

[61]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[62]  M. Evans Statistical Distributions , 2000 .

[63]  Robert P. W. Duin,et al.  A Generalized Kernel Approach to Dissimilarity-based Classification , 2002, J. Mach. Learn. Res..

[64]  Michel Verleysen,et al.  Learning high-dimensional data , 2001 .

[65]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[66]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[67]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[68]  Ali Zilouchian,et al.  FUNDAMENTALS OF NEURAL NETWORKS , 2001 .

[69]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[70]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[71]  Horst Bunke,et al.  Creation of classifier ensembles for handwritten word recognition using feature selection algorithms , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[72]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[73]  Rich Caruana,et al.  Benefitting from the Variables that Variable Selection Discards , 2003, J. Mach. Learn. Res..

[74]  J. D. Opdyke Fast Permutation Tests that Maximize Power Under Conventional Monte Carlo Sampling for Pairwise and Multiple Comparisons , 2003 .

[75]  Sergey Ablameyko,et al.  Limitations and Future Trends in Neural Computation , 2003 .

[76]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[77]  S. Brooks,et al.  Classical model selection via simulated annealing , 2003, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[78]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[79]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[80]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[81]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[82]  Michel Verleysen,et al.  On the Kernel Widths in Radial-Basis Function Networks , 2003, Neural Processing Letters.

[83]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[84]  Xuegong Zhang,et al.  Kernel Nearest-Neighbor Algorithm , 2002, Neural Processing Letters.

[85]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[86]  Boleslaw K. Szymanski,et al.  Taming the Curse of Dimensionality in Kernels and Novelty Detection , 2004, WSC.

[87]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[88]  Alexander J. Smola,et al.  Learning with non-positive kernels , 2004, ICML.

[89]  Michel Verleysen,et al.  Nonlinear projection with curvilinear distances: Isomap versus curvilinear distance analysis , 2004, Neurocomputing.

[90]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[91]  Florentina Bunea,et al.  Functional classification in Hilbert spaces , 2005, IEEE Transactions on Information Theory.

[92]  Panos M. Pardalos,et al.  Global minimization of indefinite quadratic problems , 1987, Computing.

[93]  Bernard Haasdonk,et al.  Feature space interpretation of SVMs with indefinite kernels , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[94]  Hsuan-Tien Lin A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods , 2005 .

[95]  Nenad Koncar,et al.  A note on the Gamma test , 1997, Neural Computing & Applications.

[96]  Matthew Skala,et al.  Measuring the Difficulty of Distance-Based Indexing , 2005, SPIRE.

[97]  Johan A. K. Suykens,et al.  The differogram: Non-parametric noise variance estimation and its use for model selection , 2005, Neurocomputing.

[98]  Michel Verleysen,et al.  Non-Euclidean metrics for similarity search in noisy datasets , 2005, ESANN.

[99]  Marc M. Van Hulle,et al.  Speeding Up the Wrapper Feature Subset Selection in Regression by Mutual Information Relevance and Redundancy Analysis , 2006, ICANN.

[100]  Michel Verleysen,et al.  Feature Scoring by Mutual Information for Classification of Mass Spectra , 2006 .

[101]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[102]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[103]  M. Verleysen,et al.  Une approche orientée données pour la projection de variables spectrales en spectrométrie , 2006 .

[104]  Michel Verleysen,et al.  Mutual information for the selection of relevant variables in spectrometric nonlinear modelling , 2006, ArXiv.

[105]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[106]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[107]  Michel Verleysen,et al.  Fast Selection of Spectral Variables with B-Spline Compression , 2007, ArXiv.

[108]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[109]  Charles John Read,et al.  A short course on Banach space theory(London Mathematical Society Student Texts 64)By N. L. Carothers: 184 pp., Hardback £40.00 (US$75.00)(LMS members’ price £30.00 (US$56.25))Paperback £18.99 (US$32.99) (LMS members’ price £14.24 (US$24.74)),isbn 0-521-84283-2/0-521-60372-2(P)(Cambridge University , 2007 .

[110]  Fabrice Rossi,et al.  Self-organizing maps and symbolic data , 2007, ArXiv.

[111]  Bhaskar Mukherjee,et al.  Journal of the American Society for Information Science and Technology (2000—2007): a bibliometric study , 2009 .

[112]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .