SFM: A novel sequence-based fusion method for disease genes identification and prioritization.

The identification of disease genes from human genome is of great importance to improve diagnosis and treatment of disease. Several machine learning methods have been introduced to identify disease genes. However, these methods mostly differ in the prior knowledge used to construct the feature vector for each instance (gene), the ways of selecting negative data (non-disease genes) where there is no investigational approach to find them and the classification methods used to make the final decision. In this work, a novel Sequence-based fusion method (SFM) is proposed to identify disease genes. In this regard, unlike existing methods, instead of using a noisy and incomplete prior-knowledge, the amino acid sequence of the proteins which is universal data has been carried out to present the genes (proteins) into four different feature vectors. To select more likely negative data from candidate genes, the intersection set of four negative sets which are generated using distance approach is considered. Then, Decision Tree (C4.5) has been applied as a fusion method to combine the results of four independent state-of the-art predictors based on support vector machine (SVM) algorithm, and to make the final decision. The experimental results of the proposed method have been evaluated by some standard measures. The results indicate the precision, recall and F-measure of 82.6%, 85.6% and 84, respectively. These results confirm the efficiency and validity of the proposed method.

[1]  Jagdish Chandra Patra,et al.  Integration of multiple data sources to prioritize candidate genes using discounted rating system , 2010, BMC Bioinformatics.

[2]  Kyungsook Han,et al.  Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. , 2010, Protein and peptide letters.

[3]  R. Sokal,et al.  Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. , 2006, American journal of physical anthropology.

[4]  Yoshinori Fukasawa,et al.  Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals , 2014, BMC Genomics.

[5]  Yu Zong Chen,et al.  prediction of protein-protein interactions , 2004 .

[6]  L. Kier,et al.  Amino acid side chain parameters for correlation studies in biology and pharmacology. , 2009, International journal of peptide and protein research.

[7]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[8]  K. Chou,et al.  Prediction of protein subcellular locations by GO-FunD-PseAA predictor. , 2004, Biochemical and biophysical research communications.

[9]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[10]  Darby Tien-Hao Chang,et al.  Predicting protein-protein interactions in unbalanced data using the primary structure of proteins , 2010, BMC Bioinformatics.

[11]  Jean-Philippe Vert,et al.  ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples , 2011, BMC Bioinformatics.

[12]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[13]  Carl Kingsford,et al.  The power of protein interaction networks for associating genes with diseases , 2010, Bioinform..

[14]  Rui Jiang,et al.  Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach , 2011, BMC Bioinformatics.

[15]  Xue-wen Chen,et al.  Human Disease-Gene Classification with Integrative Sequence-Based and Topological Features of Protein-Protein Interaction Networks , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[16]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[17]  Xiaoli Li,et al.  Inferring Gene-Phenotype Associations via Global Protein Complex Network Propagation , 2011, PloS one.

[18]  J. Janin,et al.  Surface and inside volumes in globular proteins , 1979, Nature.

[19]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[20]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[21]  Abdulaziz Yousef,et al.  A novel method based on new adaptive LVQ neural network for predicting protein-protein interactions from protein sequences. , 2013, Journal of theoretical biology.

[22]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[23]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[24]  Chee Keong Kwoh,et al.  Positive-unlabeled learning for disease gene identification , 2012, Bioinform..

[25]  D. Eisenberg,et al.  Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. , 1983, Journal of molecular biology.

[26]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[27]  V. McKusick Mendelian Inheritance in Man and Its Online Version, OMIM , 2007, The American Journal of Human Genetics.

[28]  A. D. McLachlan,et al.  Solvation energy in protein folding and binding , 1986, Nature.

[29]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[30]  M. Charton,et al.  The structural dependence of amino acid hydrophobicity parameters. , 1982, Journal of theoretical biology.

[31]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Denise Gorse,et al.  A novel approach to the recognition of protein architecture from sequence using fourier analysis and neural networks , 2002, Proteins.

[33]  C. Chothia The nature of the accessible and buried surfaces in proteins. , 1976, Journal of molecular biology.

[34]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[35]  Yu Zong Chen,et al.  Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. , 2004, RNA.

[36]  J. Nadeau,et al.  Finding Genes That Underlie Complex Traits , 2002, Science.

[37]  Rosario M. Piro,et al.  Prediction of Human Disease Genes by Human-Mouse Conserved Coexpression Analysis , 2008, PLoS Comput. Biol..

[38]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[39]  I. Jolliffe Principal Component Analysis , 2002 .

[40]  M. Prabhakaran,et al.  Shape and surface features of globular proteins , 1982 .

[41]  P. Radivojac,et al.  An integrated approach to inferring gene–disease associations in humans , 2008, Proteins.

[42]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[43]  Kostas Marias,et al.  Microarray Image Denoising Using a Two-Stage Multiresolution Technique , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[44]  Jan Freudenberg,et al.  A similarity-based method for genome-wide prediction of disease-relevant human genes , 2002, ECCB.

[45]  Yongjin Li,et al.  Discovering disease-genes by topological features in human protein-protein interaction network , 2006, Bioinform..