Structure feature selection for graph classification

With the development of highly efficient graph data collection technology in many application fields, classification of graph data emerges as an important topic in the data mining and machine learning community. Towards building highly accurate classification models for graph data, here we present an efficient graph feature selection method. In our method, we use frequent subgraphs as features for graph classification. Different from existing methods, we consider the spatial distribution of the subgraph features in the graph data and select those ones that have consistent spatial location. We have applied our feature selection methods to several cheminformatics benchmarks. Our method demonstrates a significant improvement of prediction as compared to the state-of-the-art methods.

[1]  Jan Ramon,et al.  Expressivity versus efficiency of graph kernels , 2003 .

[2]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  Wei Wang,et al.  Accurate Classification of Protein Structural Families Using Coherent Subgraph Analysis , 2003, Pacific Symposium on Biocomputing.

[5]  George Karypis,et al.  Frequent Substructure-Based Approaches for Classifying Chemical Compounds , 2005, IEEE Trans. Knowl. Data Eng..

[6]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[7]  G. Karypis,et al.  Frequent sub-structure-based approaches for classifying chemical compounds , 2005, Third IEEE International Conference on Data Mining.

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[10]  Wei Wang,et al.  Mining protein family specific residue packing patterns from protein structure graphs , 2004, RECOMB.

[11]  Qiang Yang,et al.  Feature selection in a kernel space , 2007, ICML '07.

[12]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[13]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Shuicheng Yan,et al.  Graph embedding: a general framework for dimensionality reduction , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Jun Huan,et al.  Chemical Compound Classification with Automatically Mined Structure Patterns , 2008, APBC.

[16]  Thomas Gärtner,et al.  Cyclic pattern kernels for predictive graph mining , 2004, KDD.

[17]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[18]  Yiming Yang,et al.  From Lasso regression to Feature vector machine , 2005, NIPS.

[19]  Hans-Peter Kriegel,et al.  Shortest-path kernels on graphs , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[20]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[21]  J. Snoeyink,et al.  Distance-based identification of structure motifs in proteins using constrained frequent subgraph mining. , 2006, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[22]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[23]  Hua Yu,et al.  A direct LDA algorithm for high-dimensional data - with application to face recognition , 2001, Pattern Recognit..

[24]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[25]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[26]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[27]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[28]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[29]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[30]  Michael K. Gilson,et al.  Virtual Screening of Molecular Databases Using a Support Vector Machine , 2005, J. Chem. Inf. Model..