A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification

Identifying the genes that cause disease is one of the most challenging issues to establish the diagnosis and treatment quickly. Several interesting methods have been introduced for disease gene identification for a decade. In general, the main differences between these methods are the type of data used as a prior-knowledge, as well as machine learning (ML) methods used for identification. The disease gene identification task has been commonly viewed by ML methods as a binary classification problem (whether any gene is disease or not). However, the nature of the data (since there is no negative data available for training or leaners) creates a major problem which affect the results. In this paper, sequence-based, one class classification method is introduced to assign genes to disease class (yes, no). First, to generate feature vector, the sequences of proteins (genes) are initially transformed to numerical vector using physicochemical properties of amino acid. Second, as there is no definite approach to define non-disease genes (negative data); we have attempted to model solely disease genes (positive data) to make a prediction by employing Support Vector Data Description algorithm. The experimental results confirm the efficiency of the method with precision, recall and F-measure of 79.3%, 82.6% and 80.9%, respectively.

[1]  Jagdish Chandra Patra,et al.  Integration of multiple data sources to prioritize candidate genes using discounted rating system , 2010, BMC Bioinformatics.

[2]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[3]  A. Need,et al.  One gene, many neuropsychiatric disorders: lessons from Mendelian diseases , 2014, Nature Neuroscience.

[4]  R. Krug,et al.  A site on the influenza A virus NS1 protein mediates both inhibition of PKR activation and temporal regulation of viral RNA synthesis. , 2007, Virology.

[5]  K. Chou,et al.  Prediction of protein subcellular locations by GO-FunD-PseAA predictor. , 2004, Biochemical and biophysical research communications.

[6]  L. Gnudi,et al.  A Role for TRPV1 in Influencing the Onset of Cardiovascular Disease in Obesity , 2013, Hypertension.

[7]  Blair R. Leavitt,et al.  Loss of Huntingtin-Mediated BDNF Gene Transcription in Huntington's Disease , 2001, Science.

[8]  Carl Kingsford,et al.  The power of protein interaction networks for associating genes with diseases , 2010, Bioinform..

[9]  J. Nadeau,et al.  Finding Genes That Underlie Complex Traits , 2002, Science.

[10]  C. Chothia The nature of the accessible and buried surfaces in proteins. , 1976, Journal of molecular biology.

[11]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[12]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[13]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[14]  V. McKusick Mendelian Inheritance in Man and Its Online Version, OMIM , 2007, The American Journal of Human Genetics.

[15]  Yu Zong Chen,et al.  Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. , 2004, RNA.

[16]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[17]  Raymond T Chung,et al.  Protein kinase R is increased and is functional in hepatitis C virus–related hepatocellular carcinoma , 2003, American Journal of Gastroenterology.

[18]  J. Fregnani,et al.  Down-regulation of PHLDA1 gene expression is associated with breast cancer progression , 2007, Breast Cancer Research and Treatment.

[19]  L. Jiang,et al.  PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[20]  Darby Tien-Hao Chang,et al.  Predicting protein-protein interactions in unbalanced data using the primary structure of proteins , 2010, BMC Bioinformatics.

[21]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[22]  Xue-wen Chen,et al.  Human Disease-Gene Classification with Integrative Sequence-Based and Topological Features of Protein-Protein Interaction Networks , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[23]  Chee Keong Kwoh,et al.  Positive-unlabeled learning for disease gene identification , 2012, Bioinform..

[24]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[25]  Q. Yu,et al.  Angiopoietin-3 Inhibits Pulmonary Metastasis by Inhibiting Tumor Angiogenesis , 2004, Cancer Research.

[26]  Z. R. Li,et al.  Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence , 2006, Nucleic Acids Res..

[27]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[28]  K. R. Woods,et al.  Prediction of protein antigenic determinants from amino acid sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[29]  R. Russell,et al.  Amino‐Acid Properties and Consequences of Substitutions , 2003 .

[30]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[31]  A. D. McLachlan,et al.  Solvation energy in protein folding and binding , 1986, Nature.

[32]  S. Aksu,et al.  Soluble platelet glycoprotein V in distinct disease states of pathological thrombopoiesis. , 2008, Journal of the National Medical Association.

[33]  P. Morateck,et al.  Targeting platelet GPIbα transgene expression to human megakaryocytes and forming a complete complex with endogenous GPIbβ and GPIX , 2004 .

[34]  M. Charton,et al.  The structural dependence of amino acid hydrophobicity parameters. , 1982, Journal of theoretical biology.

[35]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[36]  Kyungsook Han,et al.  Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor. , 2010, Protein and peptide letters.

[37]  Rui Jiang,et al.  Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach , 2011, BMC Bioinformatics.

[38]  S. Steinberg,et al.  Increased WSB1 copy number correlates with its over‐expression which associates with increased survival in neuroblastoma , 2006, Genes, chromosomes & cancer.

[39]  P. Radivojac,et al.  An integrated approach to inferring gene–disease associations in humans , 2008, Proteins.

[40]  Jean-Philippe Vert,et al.  ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples , 2011, BMC Bioinformatics.

[41]  Jan Freudenberg,et al.  A similarity-based method for genome-wide prediction of disease-relevant human genes , 2002, ECCB.

[42]  Yongjin Li,et al.  Discovering disease-genes by topological features in human protein-protein interaction network , 2006, Bioinform..

[43]  Yoshinori Fukasawa,et al.  Plus ça change – evolutionary sequence divergence predicts protein subcellular localization signals , 2014, BMC Genomics.

[44]  S. Szala,et al.  Recombinant angioarrestin secreted from mouse melanoma cells inhibits growth of primary tumours. , 2005, Acta biochimica Polonica.

[45]  D. Eisenberg,et al.  Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. , 1983, Journal of molecular biology.

[46]  Yu Zong Chen,et al.  prediction of protein-protein interactions , 2004 .

[47]  Xiaoli Li,et al.  Inferring Gene-Phenotype Associations via Global Protein Complex Network Propagation , 2011, PloS one.

[48]  Abdulaziz Yousef,et al.  A novel method based on new adaptive LVQ neural network for predicting protein-protein interactions from protein sequences. , 2013, Journal of theoretical biology.

[49]  Raffaele Izzo,et al.  Melusin gene (ITGB1BP2) nucleotide variations study in hypertensive and cardiopathic patients , 2009, BMC Medical Genetics.

[50]  R. Sokal,et al.  Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. , 2006, American journal of physical anthropology.

[51]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[52]  Rosario M. Piro,et al.  Prediction of Human Disease Genes by Human-Mouse Conserved Coexpression Analysis , 2008, PLoS Comput. Biol..