Pre-processing for noise detection in gene expression classification data

Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.

[1]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[2]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[3]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[4]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[5]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[6]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[7]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[8]  Ana Carolina Lorena,et al.  Evaluation of noise reduction techniques in the splice junction recognition problem , 2004 .

[9]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[10]  Chien-Yu Chen Detecting homogeneity in protein sequence clusters for automatic functional annotation and noise detection , 2005, Conference, Emerging Information Technology 2005..

[11]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[12]  Wei-min Liu Outlier Detection for Microarray Data , 2008, 2008 2nd International Conference on Bioinformatics and Biomedical Engineering.

[13]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[14]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[17]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[18]  Taghi M. Khoshgoftaar,et al.  Generating multiple noise elimination filters with the ensemble-partitioning filter , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[19]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[22]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[23]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[24]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[25]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[26]  Marti A. Hearst,et al.  SVMs—a practical consequence of learning theory , 1998 .

[27]  M. S. Brown,et al.  Support Vector Machine Classification of Microarray from Gene Expression Data , 1999 .

[28]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[29]  Jianhua Hu,et al.  Cancer outlier detection based on likelihood ratio test , 2008, Bioinform..

[30]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[31]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Ensembles of Pre-processing Techniques for Noise Detection in Gene Expression Data , 2008, ICONIP.

[32]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[33]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[34]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[35]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[36]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.