C-PUGP: A cluster-based positive unlabeled learning method for disease gene prediction and prioritization

Disease gene detection is an important stage in the understanding disease processes and treatment. Some candidate disease genes are identified using many machine learning methods Although there are some differences in these methods including feature vector of genes, the method used to selecting reliable negative data (non-disease genes), and the classification method, the lack of negative data is the most significant challenge of them. Recently, candidate disease genes are identified by semi-supervised learning methods based on positive and unlabeled data. These methods are reasonably accurate and achieved more desirable results versus preceding methods. In this article, we propose a novel Positive Unlabeled (PU) learning technique based upon clustering and One-Class classification algorithm. In this regard, unlike existing methods, we make a more Reliable Negative (RN) set in three steps: (1) Clustering positive data, (2) Learning One-Class classifier models using the clusters, and (3) Selecting intersection set of negative data as the Reliable Negative set. Next, we attempt to identify and rank the candidate disease genes using a binary classifier based on support vector machine (SVM) algorithm. Experimental results indicate that the proposed method yields to the best results, that is 92.8, 93.6, and 93.1 in terms of precision, recall, and F-measure respectively. Compared to the existing methods, the increase of performances of our proposed method is 11.7 percent better than the best method in terms of F-measure. Also, results show about 6% increase in the prioritization results.

[1]  Pradipta Maji,et al.  RelSim: An integrated method to identify disease genes using gene expression profiles and PPIN based similarity measure , 2017, Inf. Sci..

[2]  Gengxin Sun,et al.  Topology association analysis in weighted protein interaction network for gene prioritization , 2016 .

[3]  S. Oliver Proteomics: Guilt-by-association goes global , 2000, Nature.

[4]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[5]  V. McKusick Mendelian Inheritance in Man and Its Online Version, OMIM , 2007, The American Journal of Human Genetics.

[6]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[7]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[8]  Abdollah Dehzangi,et al.  A novel one-class classification approach to accurately predict disease-gene association in acute myeloid leukemia cancer , 2019, PloS one.

[9]  Chee Keong Kwoh,et al.  Positive-unlabeled learning for disease gene identification , 2012, Bioinform..

[10]  Eghbal G. Mansoori,et al.  Perceptron ensemble of graph-based positive-unlabeled learning for disease gene identification , 2016, Comput. Biol. Chem..

[11]  Nasrollah Moghaddam Charkari,et al.  A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification , 2015, J. Biomed. Informatics.

[12]  Yongjin Li,et al.  Discovering disease-genes by topological features in human protein-protein interaction network , 2006, Bioinform..

[13]  P. Radivojac,et al.  An integrated approach to inferring gene–disease associations in humans , 2008, Proteins.

[14]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[15]  Francesca Bovolo,et al.  Semisupervised One-Class Support Vector Machines for Classification of Remote Sensing Data , 2010, IEEE Transactions on Geoscience and Remote Sensing.