A performance comparison of dimension reduction methods for molecular structure classification

Mass spectrometry is a powerful tool in chemistry research. A primary aim of data mining in chemistry is to try to obtain useful information from chemistry databases, and then classify the compounds using the useful samples features. Suffering from the traits of high dimension, and small sample in mass spectrometry data, in order to create models, it will be first to provide useful features which are used to analyze, create mining models, and define the best parameters. We focus on the dimension reduction methods and applications in analysis of mass spectra. In this paper, we used several methods such as Principal Component Analysis (PCA), Multidimensional Scaling (MDS) and Isometric Mapping (ISOMAP), Laplacian Eigenmaps, t-Distributed Stochastic Neighbor Embedding (tSNE) and Large Margin NN Classifier (LMNN) and apply them to reduce the dimension of mass spectra. At last, the AdaBoost algorithm united with Classification and Regression Tree (AdaBoost-CART) is used to train a more useful classifier to predict the 11 substructures using the mass spectral features set. The results demonstrate that LMNN can receive a more useful low dimensional dataset to improve the classification accuracy on mass spectral data.

[1]  Chris H. Q. Ding,et al.  Evolving Feature Selection , 2005, IEEE Intell. Syst..

[2]  Bing Liu,et al.  An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[3]  Neil A. B. Gray,et al.  Computer-assisted structure elucidation , 1986 .

[4]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[5]  Dimitrios Gunopulos,et al.  Large margin nearest neighbor classifiers , 2005, IEEE Transactions on Neural Networks.

[6]  D. Schomburg,et al.  GC–MS libraries for the rapid identification of metabolites in complex biological samples , 2005, FEBS letters.

[7]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Stephen Stein,et al.  Mass spectral reference libraries: an ever-expanding resource for chemical identification. , 2012, Analytical chemistry.

[9]  Mukund Balasubramanian,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[10]  Chao-Ton Su,et al.  Multiclass MTS for Simultaneous Feature Selection and Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[11]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[12]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[13]  Lipo Wang,et al.  Feature Selection Based on the Rough Set Theory and Expectation-Maximization Clustering Algorithm , 2008, RSCTC.

[14]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[15]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[16]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[17]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[18]  Joshua B. Tenenbaum,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[19]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .