A novel approach for predicting DNA splice junctions using hybrid machine learning algorithms

Accurate identification of splice junctions in a DNA sequence is an active area of research. The knowledge of splice junction’s occurrence provides valuable information about its internal genomic structure and aids in its deeper analysis and interpretation. The major problems faced during gene analysis are diversity, complexity and the uncertainty nature of DNA sequences. The application of computational techniques using machine learning algorithms in this direction has attracted enormous attention in the last few decades. In this study, the development of hybrid machine learning ensembles approaches is presented that address the splice junction problem more effectively. Multiple classifier systems consisting of random subspace, rotation forest and boosting methods are implemented and are validated over the real genome sequence dataset. A novel feature selection technique based on attribute’s correlation estimation using Best first strategy is proposed. The average prediction accuracy achieved is more than 98 % in identifying the splice junctions. All the computations are performed with 95 % confidence interval. The results presented in this study are superior as compared to the state-of-the-art approaches in the literature. This work strengthens the viability of expanding and using machine learning models to similar problems.

[1]  Jason Li,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006, BMC Bioinformatics.

[2]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[3]  N. Sairam,et al.  Enhanced Classification Performance Using Computational Intelligence , 2011, CSE 2011.

[4]  Stanislaw Osowski,et al.  Data mining for feature selection in gene expression autism data , 2015, Expert Syst. Appl..

[5]  Indrajit Mandal,et al.  Accurate Prediction of Coronary Artery Disease Using Reliable Diagnosis System , 2012, Journal of Medical Systems.

[6]  Indrajit Mandal,et al.  A novel approach for accurate identification of splice junctions based on hybrid algorithms , 2015, Journal of biomolecular structure & dynamics.

[7]  Alberto Riva,et al.  PASTA: splice junction identification from RNA-Sequencing data , 2013, BMC Bioinformatics.

[8]  José M. Alonso,et al.  A multiclassifier approach for topology-based WiFi indoor localization , 2013, Soft Computing.

[9]  Lise Getoor,et al.  SplicePort—An interactive splice-site analysis tool , 2007, Nucleic Acids Res..

[10]  Leo Breiman,et al.  Randomizing Outputs to Increase Prediction Accuracy , 2000, Machine Learning.

[11]  Loris Nanni,et al.  Identifying splice-junction sequences by hierarchical multiclassifier , 2006, Pattern Recognit. Lett..

[12]  Yong Deng,et al.  A Novel Feature Selection Method Based on Correlation-Based Feature Selection in Cancer Recognition , 2014 .

[13]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[14]  J. Deogun,et al.  Method of predicting Splice Sites based on signal interactions , 2006, Biology Direct.

[15]  Liaofu Luo,et al.  Recognition of splice sites in genes by use of diversity measure method , 2004 .

[16]  M. Pazzani,et al.  Error Reduction through Learning Multiple Descriptions , 1996, Machine Learning.

[17]  J. L. Li,et al.  High-accuracy splice site prediction based on sequence component and position features. , 2012, Genetics and molecular research : GMR.

[18]  Francisco Herrera,et al.  On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection , 2012, Soft Computing.

[19]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[20]  Vincenzo Punzo,et al.  “No Free Lunch” Theorems Applied to the Calibration of Traffic Simulation Models , 2014, IEEE Transactions on Intelligent Transportation Systems.

[21]  Shu-Lin Wang,et al.  A novel two-stage weak classifier selection approach for adaptive boosting for cascade face detector , 2013, Neurocomputing.

[22]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[23]  Indrajit Mandal,et al.  Accurate telemonitoring of Parkinson's disease diagnosis using robust inference system , 2013, Int. J. Medical Informatics.

[24]  Eibe Frank,et al.  Combining Naive Bayes and Decision Tables , 2008, FLAIRS.

[25]  Xibei Yang,et al.  Recognition of Multiple Imbalanced Cancer Types Based on DNA Microarray Data Using Ensemble Classifiers , 2013, BioMed research international.

[26]  Sing-Wu Liou,et al.  Modelling splice sites with locality-sensitive sequence features , 2013, Int. J. Data Min. Bioinform..

[27]  Yunming Ye,et al.  ForesTexter: An efficient random forest algorithm for imbalanced text categorization , 2014, Knowl. Based Syst..

[28]  Indrajit Mandal,et al.  Developing new machine learning ensembles for quality spine diagnosis , 2015, Knowl. Based Syst..

[29]  Sing-Wu Liou,et al.  Intron Identification Approaches Based on Weighted Features and Fuzzy Decision Trees , 2010, 2010 4th International Conference on Bioinformatics and Biomedical Engineering.

[30]  Haibo He,et al.  DCPE co-training for classification , 2012, Neurocomputing.

[31]  Paulo J. G. Lisboa,et al.  Cohort-based kernel visualisation with scatter matrices , 2012, Pattern Recognit..

[32]  Karim Salahshoor,et al.  A new integrated on-line fuzzy clustering and segmentation methodology with adaptive PCA approach for process monitoring and fault detection and diagnosis , 2013, Soft Comput..

[33]  V. Sadasivam,et al.  Performance enhancement of extreme learning machine for power system disturbances classification , 2014, Soft Comput..

[34]  Rafal Biedrzycki,et al.  KIS: An automated attribute induction method for classification of DNA sequences , 2012, Int. J. Appl. Math. Comput. Sci..

[35]  Jude W. Shavlik,et al.  Extracting Refined Rules from Knowledge-Based Neural Networks , 1993, Machine Learning.

[36]  Måns Thulin,et al.  A high-dimensional two-sample test for the mean using random subspaces , 2013, Comput. Stat. Data Anal..

[37]  Kenneth A. De Jong,et al.  An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  Haibo He,et al.  Feature selection based on sparse imputation , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[39]  Deepak Garg,et al.  Hybrid Approach Using SVM and MM2 in Splice Site Junction Identification , 2014 .

[40]  Huaiqiu Zhu,et al.  A new method for splice site prediction based on the sequence patterns of splicing signals and regulatory elements , 2008 .

[41]  Sanjay L. Nalbalwar,et al.  Feature elimination based random subspace ensembles learning for ECG arrhythmia diagnosis , 2013, Soft Computing.

[42]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[43]  Jing Li,et al.  Splice sites prediction of Human genome using length-variable Markov model and feature selection , 2010, Expert Syst. Appl..

[44]  Asif Ekbal,et al.  Combining feature selection and classifier ensemble using a multiobjective simulated annealing approach: application to named entity recognition , 2012, Soft Computing.

[45]  Zhihua Cai,et al.  Learning attribute weighted AODE for ROC area ranking , 2014, Int. J. Inf. Commun. Technol..

[46]  Mohamed Morchid,et al.  Feature selection using Principal Component Analysis for massive retweet detection , 2014, Pattern Recognit. Lett..

[47]  Si-Yuan Jing,et al.  A hybrid genetic algorithm for feature subset selection in rough set theory , 2014, Soft Comput..

[48]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[49]  Marek Lubicz,et al.  Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients , 2014, Appl. Soft Comput..

[50]  Jude W. Shavlik,et al.  Knowledge-Based Artificial Neural Networks , 1994, Artif. Intell..

[51]  Wei Li,et al.  nsemble-based hybrid probabilistic sampling for imbalanced data earning in lung nodule CAD , 2014 .

[52]  Andreas Stafylopatis,et al.  Self-Organizing Hidden Markov Model Map (SOHMMM) , 2013, Neural Networks.

[53]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[54]  Ivan Bratko,et al.  Learning by Discovering Concept Hierarchies , 1999, Artif. Intell..

[55]  Bailing Zhang,et al.  Random subspace support vector machine ensemble for reliable face recognition , 2014, Int. J. Biom..

[56]  Robertas Damasevicius Structural analysis of regulatory DNA sequences using grammar inference and Support Vector Machine , 2010, Neurocomputing.

[57]  Efendi N. Nasibov,et al.  Classification of splice-junction sequences via weighted position specific scoring approach , 2010, Comput. Biol. Chem..

[58]  K. Thangavel,et al.  Soft computing models based feature selection for TRUS prostate cancer image classification , 2014, Soft Comput..

[59]  Indrajit Mandal,et al.  New machine-learning algorithms for prediction of Parkinson's disease , 2014, Int. J. Syst. Sci..

[60]  Guang Yang,et al.  L 1 Graph Based on Sparse Coding for Feature Selection , 2013, ISNN.

[61]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[62]  Libin Liu,et al.  Prediction of primate splice site using inhomogeneous Markov chain and neural network. , 2007, DNA and cell biology.

[63]  Qingshan Jiang,et al.  A novel splice site prediction method using support vector machine , 2013 .

[64]  A Y Kashiwabara,et al.  Splice site prediction using stochastic regular grammars. , 2007, Genetics and molecular research : GMR.

[65]  Vassilis Koutkias,et al.  SpliceIT: A hybrid method for splice signal identification based on probabilistic and biological inference , 2010, J. Biomed. Informatics.