Feature selection focused within error clusters

We propose a feature selection method that constructs each new feature by analysis of tight error clusters. This is a greedy, time-efficient forward selection algorithm that iteratively constructs one feature at a time, until a desired error rate is reached. The algorithm finds error clusters in the current feature space, then projects one tight cluster into the null space of the feature mapping, where a new feature that helps to classify these errors can be discovered. Tight error clusters indicate that the current features are unable to discriminate these samples. The approach is strongly data-driven and restricted to linear features, but otherwise general. Large scale experiments show that it can achieve a monotonically decreasing error rate within the feature discovery set, and a generally decreasing error rate on a distinct test set.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[3]  Herman Wold,et al.  The Fix-point approach to interdependent systems , 1981 .

[4]  Henry S. Baird,et al.  Versatile document image content extraction , 2006, Electronic Imaging.

[5]  Henry S. Baird,et al.  Towards Versatile Document Analysis Systems , 2006, Document Analysis Systems.

[6]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[7]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[8]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[9]  Satosi Watanabe,et al.  Pattern Recognition: Human and Mechanical , 1985 .

[10]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[11]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[12]  J. Davenport Editor , 1960 .

[13]  P. C. Cross,et al.  Elementary matrix algebra , 1959 .

[14]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[15]  E. Ecer,et al.  Numerical Linear Algebra and Applications , 1995, IEEE Computational Science and Engineering.

[16]  L. Hogben Handbook of Linear Algebra , 2006 .

[17]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[18]  J. Elashoff,et al.  On the choice of variables in classification problems with dichotomous variables. , 1967, Biometrika.

[19]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[20]  Kristin P. Bennett,et al.  Constructing Orthogonal Latent Features for Arbitrary Loss , 2006, Feature Extraction.

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[22]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[23]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[24]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.