Evaluation of noise reduction techniques in the splice junction recognition problem

The Human Genome Project has generated a large amount of sequence data. A number of works are currently concerned with analyzing these data. One of the analyses carried out is the identification of genes' structures on the sequences obtained. As such, one can search for particular signals associated with gene expression. Splice junctions represent a type of signal present on eukaryote genes. Many studies have applied Machine Learning techniques in the recognition of such regions. However, most of the genetic databases are characterized by the presence of noisy data, which can affect the performance of the learning techniques. This paper evaluates the effectiveness of five data pre-processing algorithms in the elimination of noisy instances from two splice junction recognition datasets. After the pre-processing phase, two learning techniques, Decision Trees and Support Vector Machines, are employed in the recognition process.

[1]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[2]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[3]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[4]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[5]  A. Lapedes,et al.  Application of neural networks and other machine learning algorithms to DNA sequence analysis , 1988 .

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Geoffrey G. Towell,et al.  Symbolic knowledge and neural networks: insertion, refinement and extraction , 1992 .

[8]  Jude W. Shavlik,et al.  Machine learning approaches to gene recognition , 1994, IEEE Expert.

[9]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Edward C. Uberbacher,et al.  GRAIL: a multi-agent neural network system for gene identification , 1996, Proc. IEEE.

[12]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[13]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[14]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[15]  Steven Salzberg,et al.  A method for identifying splice sites and translational start sites in eukaryotic mRNA , 1997, Comput. Appl. Biosci..

[16]  Salvatore Rampone,et al.  Recognition of splice junctions on DNA sequences by BRAIN learning algorithm , 1998, Bioinform..

[17]  S. Rampone Splice-junction recognition on gene sequences (DNA) by BRAIN learning algorithm , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[18]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[19]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[22]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[23]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[24]  Gene prediction: the end of the beginning , 2000, Genome Biology.

[25]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Applying One-Sided Selection to Unbalanced Datasets , 2000, MICAI.

[27]  Connie M. Borror,et al.  Miller and Freund's Probability and Statistics for Engineers, 6th Ed. , 2001 .

[28]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[29]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Splice Junction Recognition using Machine Learning Techniques , 2002, WOB.

[30]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[31]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  The influence of noisy patterns on the performance of learning methods in the splice junction recognition problem , 2002, VII Brazilian Symposium on Neural Networks, 2002. SBRN 2002. Proceedings..

[32]  Salvatore Rampone,et al.  HS3D: Homo Sapiens Splice Site Data Set , 2002 .

[33]  Vladimir B. Bajic,et al.  An Intelligent System for Vertebrate Promoter Recognition , 2002, IEEE Intell. Syst..

[34]  Salvatore Rampone,et al.  Hs3d, A Dataset Of Homo Sapiens Splice Regions, And Its Extraction Procedure From A Major Public Database , 2002 .

[35]  Alexander Gammerman,et al.  Sequence alignment kernel for recognition of promoter regions , 2003, Bioinform..

[36]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[37]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[38]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.