A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets

OBJECTIVE Medical data sets are usually small and have very high dimensionality. Too many attributes will make the analysis less efficient and will not necessarily increase accuracy, while too few data will decrease the modeling stability. Consequently, the main objective of this study is to extract the optimal subset of features to increase analytical performance when the data set is small. METHODS This paper proposes a fuzzy-based non-linear transformation method to extend classification related information from the original data attribute values for a small data set. Based on the new transformed data set, this study applies principal component analysis (PCA) to extract the optimal subset of features. Finally, we use the transformed data with these optimal features as the input data for a learning tool, a support vector machine (SVM). Six medical data sets: Pima Indians' diabetes, Wisconsin diagnostic breast cancer, Parkinson disease, echocardiogram, BUPA liver disorders dataset, and bladder cancer cases in Taiwan, are employed to illustrate the approach presented in this paper. RESULTS This research uses the t-test to evaluate the classification accuracy for a single data set; and uses the Friedman test to show the proposed method is better than other methods over the multiple data sets. The experiment results indicate that the proposed method has better classification performance than either PCA or kernel principal component analysis (KPCA) when the data set is small, and suggest creating new purpose-related information to improve the analysis performance. CONCLUSION This paper has shown that feature extraction is important as a function of feature selection for efficient data analysis. When the data set is small, using the fuzzy-based transformation method presented in this work to increase the information available produces better results than the PCA and KPCA approaches.

[1]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[2]  Lior Rokach,et al.  Feature set decomposition for decision trees , 2005, Intell. Data Anal..

[3]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[4]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds , 2003, NIPS.

[5]  Der-Chiang Li,et al.  A new method to help diagnose cancers for small sample size , 2007, Expert Syst. Appl..

[6]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[8]  Der-Chiang Li,et al.  Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge , 2007, Comput. Oper. Res..

[9]  T. Warren Liao,et al.  Medical data mining by fuzzy modeling with selected features , 2008, Artif. Intell. Medicine.

[10]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Asoke K. Nandi,et al.  Breast Cancer Diagnosis Using Genetic Programming Generated Feature , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[13]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[15]  G. Muir,et al.  The genetic alterations in the oncogenic pathway of transitional cell carcinoma of the bladder and its prognostic value , 2001, Urological Research.

[16]  Graziano Pesole,et al.  Selection of relevant genes in cancer diagnosis based on their prediction accuracy , 2007, Artif. Intell. Medicine.

[17]  Pasi Luukka,et al.  Similarity classifier in diagnosis of bladder cancer , 2008, Comput. Methods Programs Biomed..

[18]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[19]  I. Jolliffe Principal Component Analysis , 2002 .

[20]  Joaquín A. Pacheco,et al.  A variable selection method based on Tabu search for logistic regression models , 2009, Eur. J. Oper. Res..

[21]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[22]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[23]  Alex Alves Freitas,et al.  A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets , 2004, Artif. Intell. Medicine.

[24]  W. Wolberg,et al.  Psychosexual adaptation to breast cancer surgery , 1989, Cancer.

[25]  T. Warren Liao,et al.  II, A fuzzy c-means variant for the generation of fuzzy term sets , 2003, Fuzzy Sets Syst..

[26]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[27]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[28]  Tingting Mu,et al.  Classification of breast masses via nonlinear transformation of features based on a kernel matrix , 2007, Medical & Biological Engineering & Computing.

[29]  Kate Smith-Miles,et al.  A meta-learning approach to automatic kernel selection for support vector machines , 2006, Neurocomputing.

[30]  Rudy Setiono,et al.  Generating concise and accurate classification rules for breast cancer diagnosis , 2000, Artif. Intell. Medicine.

[31]  Shigeo Abe,et al.  A novel approach to feature selection based on analysis of class regions , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[32]  Lior Rokach,et al.  Genetic algorithm-based feature set partitioning for classification problems , 2008, Pattern Recognit..

[33]  Gene H. Golub,et al.  Matrix computations , 1983 .