Parcel: Feature Subset Selection in Variable Cost Domains

The vast majority of classification systems are designed with a single set of features, and optimised to a single specified cost. However, in examples such as medical and financial risk modelling, costs are known to vary subsequent to system design. In this paper, we present a design method for feature selection in the presence of varying costs. Starting from the Wilcoxon nonparametric statistic for the performance of a classification system, we introduce a concept called the maximum realisable receiver operating characteristic ( ), and prove a related theorem. A novel criterion for feature selection, based on the area under the curve, is then introduced. This leads to a framework which we call Parcel. This has the flexibility to use different combinations of features at different operating points on the resulting curve. Empirical support for each stage in our approach is provided by experiments on real world problems, with Parcel achieving superior results.

[1]  Baozong Yuan,et al.  A more efficient branch and bound algorithm for feature selection , 1993, Pattern Recognit..

[2]  Gerald S. Rogers,et al.  Mathematical Statistics: A Decision Theoretic Approach , 1967 .

[3]  Radford M. Neal Assessing Relevance determination methods using DELVE , 1998 .

[4]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[5]  Ron Kohavi,et al.  MineSet: An Integrated System for Data Mining , 1997, KDD.

[6]  Douglas H. Fisher,et al.  Iterative Optimization and Simplification of Hierarchical Clusterings , 1996, J. Artif. Intell. Res..

[7]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[8]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[9]  Pavel Pudil,et al.  Statistical approach to pattern recognition: Theory and practical solution by means of PREDITAS system , 1991, Kybernetika.

[10]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Paul A. Viola,et al.  Alignment by Maximization of Mutual Information , 1995, Proceedings of IEEE International Conference on Computer Vision.

[13]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[14]  Jack Sklansky,et al.  Feature Selection for Automatic Classification of Non-Gaussian Data , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  Joseph O'Rourke,et al.  Computational Geometry in C. , 1995 .

[16]  Andreas S. Weigend,et al.  Nonparametric selection of input variables for connectionist learning , 1996 .

[17]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[18]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[19]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[20]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[21]  Josef Kittler,et al.  A comparison of colour texture attributes selected by statistical feature selection and neural network methods , 1997, Pattern Recognit. Lett..

[22]  Steven Salzberg,et al.  Lookahead and Pathology in Decision Tree Induction , 1995, IJCAI.

[23]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[24]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[25]  Mahesan Niranjan,et al.  Realisable Classifiers: Improving Operating Performance on Variable Cost Problems , 1998, BMVC.

[26]  Thomas G. Dietterich,et al.  Efficient Algorithms for Identifying Relevant Features , 1992 .

[27]  Josef Kittler,et al.  Using feature selection to aid an iconic search through an image database , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[29]  Mark R. Wade,et al.  Construction and Assessment of Classification Rules , 1999, Technometrics.

[30]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[31]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[32]  H J Jerison,et al.  The Elicited Observing Rate and Decision Processes in Vigilance1 , 1965, Human factors.

[33]  David Haussler,et al.  KDD for Science Data Analysis: Issues and Examples , 1996, KDD.

[34]  D R Soderquist,et al.  Practice effects and signal detection indices in an auditory vigilance task. , 1969, The Journal of the Acoustical Society of America.

[35]  Justin Doak,et al.  An evaluation of feature selection methods and their application to computer security , 1992 .

[36]  R. Mike Cameron-Jones,et al.  Oversearching and Layered Search in Empirical Learning , 1995, IJCAI.

[37]  Josef Kittler,et al.  Floating search methods for feature selection with nonmonotonic criterion functions , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[38]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[39]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[40]  A. K. Jain,et al.  A critical evaluation of intrinsic dimensionality algorithms. , 1980 .

[41]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[42]  Josef Kittler,et al.  Mathematics Methods of Feature Selection in Pattern Recognition , 1975, Int. J. Man Mach. Stud..

[43]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[44]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[45]  David W. Aha,et al.  Feature Selection for Case-Based Classification of Cloud Types: An Empirical Comparison , 1994 .

[46]  Josef Kittler,et al.  Selecting Features For Neural Networks To Aid An Iconic Search Through An Image Database , 1997 .

[47]  Åsa Rudström,et al.  Applications of Machine Learning , 2020, Algorithms for Intelligent Systems.

[48]  David W. Aha,et al.  A Comparative Evaluation of Sequential Feature Selection Algorithms , 1995, AISTATS.

[49]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[50]  Jude Shavlik,et al.  Rapidly Estimating the Quality of Input Representations for Neural Networks , 1995 .

[51]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[52]  Gerald J. Kowalski,et al.  Information Retrieval Systems , 1997, The Information Retrieval Series.

[53]  Pat Langley,et al.  Elements of Machine Learning , 1995 .

[54]  James O. Berger,et al.  Statistical Decision Theory and Bayesian Analysis, Second Edition , 1985 .

[55]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[56]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[57]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  Wj Fitzgerald,et al.  Searching for the optimal data model: two strategies for statistical variable selection , 1996 .

[59]  David W. Aha,et al.  Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms , 1992, Int. J. Man Mach. Stud..

[60]  Anil K. Jain,et al.  Feature definition in pattern recognition with small sample size , 1978, Pattern Recognit..

[61]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[62]  Ron Kohavi,et al.  Data Mining with MineSet: What Worked, What Did Not, and What Might , 1998 .

[63]  T R Holford,et al.  A stepwise variable selection procedure for nonlinear regression models. , 1980, Biometrics.

[64]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[65]  David Lovell,et al.  On the use of Expected Attainable Discrimination for feature selection in large scale medical risk p , 1997 .

[66]  Moshe Ben-Bassat,et al.  f-Entropies, probability of Error, and Feature Selection , 1978, Inf. Control..

[67]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[68]  Stuart J. Russell,et al.  NP-Completeness of Searches for Smallest Possible Feature Sets , 1994 .

[69]  David P. Dobkin,et al.  The quickhull algorithm for convex hulls , 1996, TOMS.

[70]  Michael T. Manry,et al.  Automatic recognition of USGS land use/cover categories using statistical and neural network classifiers , 1993, Defense, Security, and Sensing.

[71]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[72]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[73]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[74]  D R Lovell,et al.  Design, construction and evaluation of systems to predict risk in obstetrics. , 1997, International journal of medical informatics.

[75]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[76]  Mahesan Niranjan,et al.  Maximum Realisable Performance: a Principled Method for Enhancing Performance by Using Multiple Clas , 1998 .

[77]  Andrew D. A. Maidment,et al.  Comparison of receiver operating characteristic curves on the basis of optimal operating points. , 1996, Academic radiology.

[78]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[79]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .