Two-phase prediction of protein functions from biological literature based on Gini-Index

This paper presents a two-phase prediction model for proteins and protein functions from biological literature based on Gini Index algorithm. As the volume and diversity of biological resources grows, computational protein function prediction become much more important. In this paper, we considered automatic annotation of the Gene Ontology (GO) by computational function prediction approaches entailing feature selection method based on Gini Index and protein function prediction model. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, the Gini-Index algorithm for feature selection in text categorization was introduced and proved to be good performances. Thus, we present a novel model to predict both multi-label proteins from PubMed literatures and their functions from protein-function of GO Annotation. First, we introduce a feature selection algorithm with Gini-Index expressions to predict proteins from PubMed and obtain proteintext subsets. Second, we propose a novel two-phase prediction method for proteins and their protein functions with those subsets. As experimental results, we evaluated the results of prediction for the proteins and their functions using the proposed methods. We have good performances notably overall for both of prediction of proteins and protein function from the biological literatures.

[1]  Carolin Strobl,et al.  Unbiased split selection for classification trees based on the Gini Index , 2007, Comput. Stat. Data Anal..

[2]  Goran Nenadic,et al.  Mining semantically related terms from biomedical literature , 2006, TALIP.

[3]  ChengXiang Zhai,et al.  Multi-label literature classification based on the Gene Ontology graph , 2008, BMC Bioinformatics.

[4]  Juho Rousu,et al.  Kernel-Based Learning of Hierarchical Multilabel Classification Models , 2006, J. Mach. Learn. Res..

[5]  Boris Hayete,et al.  GOTrees: Predicting GO Associations from Protein Domain Composition Using Decision Trees , 2004, Pacific Symposium on Biocomputing.

[6]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[7]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[8]  Alex Alves Freitas,et al.  Hierarchical classification of protein function with ensembles of rules and particle swarm optimisation , 2008, Soft Comput..

[9]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[10]  Andrew McCallum,et al.  Collective multi-label classification , 2005, CIKM '05.

[11]  Goran Nenadic,et al.  Mining protein function from text using term-based support vector machines , 2005, BMC Bioinformatics.

[12]  Timothy A. Gonsalves,et al.  Feature Selection for Text Classification Based on Gini Coefficient of Inequality , 2010, FSDM.

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Rolf Apweiler,et al.  GOAnnotator: linking protein GO annotations to evidence text , 2006, Journal of biomedical discovery and collaboration.

[15]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[16]  Daisuke Kihara,et al.  Combining gene sequence similarity and textual information for gene function annotation in the literature , 2008, Information Retrieval.

[17]  Giorgio Valle,et al.  The Gene Ontology in 2010: extensions and refinements , 2009, Nucleic Acids Res..

[18]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[19]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[20]  Haibin Zhu,et al.  An Adaptive Fuzzy kNN Text Classifier Based on Gini Index Weight , 2006, 11th IEEE Symposium on Computers and Communications (ISCC'06).

[21]  Erik M. van Mulligen,et al.  Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes , 2005, Bioinform..

[22]  Jason Weston,et al.  Multi-class Protein Classification Using Adaptive Codes , 2007, J. Mach. Learn. Res..

[23]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[24]  George D. Smith,et al.  Evolutionary Feature Construction Using Information Gain and Gini Index , 2004, EuroGP.

[25]  C. Orengo,et al.  Protein function annotation by homology-based inference , 2009, Genome Biology.

[26]  Duane Szafron,et al.  Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[27]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[28]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..