Feature selection for splice site prediction: A new method using EDA-based feature ranking

BackgroundThe identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data.ResultsIn this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing.ConclusionWe show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features.

[1]  Yvan Saeys,et al.  Fast feature selection using a simple estimation of distribution algorithm: a case study on splice site prediction , 2003, ECCB.

[2]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[3]  Michael Q. Zhang Computational prediction of eukaryotic protein-coding genes , 2002, Nature Reviews Genetics.

[4]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[5]  H. Mühlenbein,et al.  From Recombination of Genes to the Estimation of Distributions I. Binary Parameters , 1996, PPSN.

[6]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[7]  Pedro Larrañaga,et al.  Feature subset selection by Bayesian networks: a comparison with genetic and sequential algorithms , 2001, Int. J. Approx. Reason..

[8]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[9]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[10]  W. Filipowicz,et al.  Pre-mRNA splicing in higher plants. , 2000, Trends in plant science.

[11]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[12]  Christopher W. J. Smith,et al.  Scanning and competition between AGs are involved in 3' splice site selection in mammalian introns , 1993, Molecular and cellular biology.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Dunja Mladenic,et al.  Feature selection on hierarchy of web documents , 2003, Decis. Support Syst..

[15]  Erick Cantú-Paz,et al.  Feature Subset Selection by Estimation of Distribution Algorithms , 2002, GECCO.

[16]  D. Brow,et al.  Allosteric cascade of spliceosome activation. , 2002, Annual review of genetics.

[17]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[18]  Gunnar Rätsch,et al.  New Methods for Splice Site Recognition , 2002, ICANN.

[19]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[20]  Pedro Larrañaga,et al.  Combinatonal Optimization by Learning and Simulation of Bayesian Networks , 2000, UAI.

[21]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[22]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[23]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[24]  David E. Goldberg,et al.  The compact genetic algorithm , 1999, IEEE Trans. Evol. Comput..

[25]  Heinz Mühlenbein,et al.  The Equation for Response to Selection and Its Use for Prediction , 1997, Evolutionary Computation.

[26]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[27]  Kenneth DeJong,et al.  Robust feature selection algorithms , 1993, Proceedings of 1993 IEEE Conference on Tools with Al (TAI-93).

[28]  Bernard De Baets,et al.  Feature subset selection for splice site prediction , 2002, ECCB.

[29]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.