Improving Supervised Learning by Feature Decomposition

This paper presents the Feature Decomposition Approach for improving supervised learning tasks. While in Feature Selection the aim is to identify a representative set of features from which to construct a classification model, in Feature Decomposition, the goal is to decompose the original set of features into several subsets. A classification model is built for each subset, and then all generated models are combined. This paper presents theoretical and practical aspects of the Feature Decomposition Approach. A greedy procedure, called DOT (Decomposed Oblivious Trees), is developed to decompose the input features set into subsets and to build a classification model for each subset separately. The results achieved in the empirical comparison testing with well-known learning algorithms (like C4.5) indicate the superiority of the feature decomposition approach in learning tasks that contains high number of features and moderate numbers of tuples.

[1]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[2]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[3]  Carlos Soares,et al.  Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results , 2003, Machine Learning.

[4]  João Gama,et al.  Characterizing the Applicability of Classification Algorithms Using Meta-Level Learning , 1994, ECML.

[5]  Yishay Mansour,et al.  Generalization Bounds for Decision Trees , 2000, COLT.

[6]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[7]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[8]  M. Pazzani,et al.  Error Reduction through Learning Multiple Descriptions , 1996, Machine Learning.

[9]  Michael Schmitt,et al.  On the Complexity of Computing and Learning with Multiplicative Neural Networks , 2002, Neural Computation.

[10]  Michael J. Pazzani,et al.  Error reduction through learning multiple descriptions , 2004, Machine Learning.

[11]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[12]  Wenxin Jiang On weak base hypotheses and their implications for boosting regression and classification , 2002 .

[13]  Ivan Bratko,et al.  Feature Transformation by Function Decomposition , 1998, IEEE Intell. Syst..

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  L. Rokach,et al.  Data mining by attribute decomposition with semiconductor manufacturing case study , 2001 .

[16]  Alen D. Shapiro,et al.  Structured induction in expert systems , 1987 .

[17]  Donald Michie,et al.  Problem Decomposition and the Learning of Skills , 1995, ECML.

[18]  Xiaohua Hu,et al.  Cluster Ensemble and Its Applications in Gene Expression Analysis , 2004, APBC.

[19]  Barbara A. Kitchenham,et al.  An investigation of analysis techniques for software datasets , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[20]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[21]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[22]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[23]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[24]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[25]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[26]  David A. Landgrebe,et al.  Supervised classification in high-dimensional space: geometrical, statistical, and asymptotical properties of multivariate data , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[27]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[28]  Wray L. Buntine Graphical Models for Discivering Knowledge , 1996, Advances in Knowledge Discovery and Data Mining.

[29]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[30]  Bernhard Pfahringer,et al.  Controlling Constructive Induction in CIPF: An MDL Approach , 1994, ECML.

[31]  G. Dunteman Principal Components Analysis , 1989 .

[32]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[33]  Pat Langley,et al.  Oblivious Decision Trees and Abstract Cases , 1994 .

[34]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[35]  Salvatore J. Stolfo,et al.  A Comparative Evaluation of Voting and Meta-learning on Partitioned Data , 1995, ICML.

[36]  R. H. Myers,et al.  Probability and Statistics for Engineers and Scientists , 1978 .

[37]  Ian Sinclair,et al.  Designing Committees of Models through Deliberate Weighting of Data Points , 2003, J. Mach. Learn. Res..

[38]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[39]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[40]  Thomas Richardson,et al.  Interpretable Boosted Naïve Bayes Classification , 1998, KDD.

[41]  W. R. Garner Applications of Information Theory to Psychology , 1959 .

[42]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[43]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[44]  Stephen D. Bay Nearest neighbor classification from multiple feature subsets , 1999, Intell. Data Anal..

[45]  Jae-On Kim,et al.  Factor Analysis: Statistical Methods and Practical Issues , 1978 .

[46]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[47]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[48]  Oded Maimon Knowledge Discovery and Data Mining : The Info-Fuzzy Network (IFN) Methodology , 2000 .

[49]  David West,et al.  Model selection for medical diagnosis decision support systems , 2004, Decis. Support Syst..

[50]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[51]  Yves Deville,et al.  Multi-class protein fold classification using a new ensemble machine learning approach. , 2003, Genome informatics. International Conference on Genome Informatics.

[52]  Jeffrey C. Schlimmer,et al.  Efficiently Inducing Determinations: A Complete and Systematic Search Algorithm that Uses Optimal Pruning , 1993, ICML.

[53]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[54]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[55]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[56]  Jenq-Neng Hwang,et al.  Nonparametric multivariate density estimation: a comparative study , 1994, IEEE Trans. Signal Process..

[57]  Kagan Tumer,et al.  Linear and Order Statistics Combiners for Pattern Classification , 1999, ArXiv.

[58]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[59]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[60]  Tim Oates,et al.  Large Datasets Lead to Overly Complex Models: An Explanation and a Solution , 1998, KDD.

[61]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[62]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[63]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[64]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[65]  Paul W. Munro,et al.  Improving Committee Diagnosis with Resampling Techniques , 1995, NIPS.

[66]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[67]  Russell L. Purvis,et al.  Forecasting the NYSE composite index with technical analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic decision support , 2002, Decis. Support Syst..

[68]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[69]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[70]  Andrew Kusiak,et al.  Decomposition in data mining: an industrial case study , 2000 .