Towards a semi-automatic functional annotation tool based on decision-tree techniques

BackgroundDue to the continuous improvements of high throughput technologies and experimental procedures, the number of sequenced genomes is increasing exponentially. Ultimately, the task of annotating these data relies on the expertise of biologists. The necessity for annotation to be supervised by human experts is the rate limiting step of the data analysis. To face the deluge of new genomic data, the need for automating, as much as possible, the annotation process becomes critical.ResultsWe consider annotation of a protein with terms of the functional hierarchy that has been used to annotate Bacillus subtilis and propose a set of rules that predict classes in terms of elements of the functional hierarchy, i.e., a class is a node or a leaf of the hierarchy tree. The rules are obtained through two decision-trees techniques: first-order decision-trees and multilabel attribute-value decision-trees, by using as training data the proteins from two lactic bacteria: Lactobacillus sakei and Lactobacillus bulgaricus. We tested the two methods, first independently, then in a combined approach, and evaluated the obtained results using hierarchical evaluation measures. Results obtained for the two approaches on both genomes are comparable and show a good precision together with a high prediction rate. Using combined approaches increases the recall and the prediction rate.ConclusionThe combination of the two approaches is very encouraging and we will further refine these combinations in order to get rules even more useful for the annotators. This first study is a crucial step towards designing a semi-automatic functional annotation tool.

[1]  Yiming Yang,et al.  Text categorization , 2008, Scholarpedia.

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[4]  V. Loux,et al.  The complete genome sequence of the meat-borne lactic acid bacterium Lactobacillus sakei 23K , 2005, Nature Biotechnology.

[5]  Anne-Lise Veuthey,et al.  Automated annotation of microbial proteomes in SWISS-PROT , 2003, Comput. Biol. Chem..

[6]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[7]  J. Gibrat,et al.  The complete genome sequence of Lactobacillus bulgaricus reveals extensive and ongoing reductive evolution. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Amanda Clare,et al.  Machine learning of functional class from phenotype data , 2002, Bioinform..

[9]  K. Bryson,et al.  AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system , 2006, Nucleic acids research.

[10]  Roland Eils,et al.  GOPET: A tool for automated predictions of Gene Ontology terms , 2006, BMC Bioinformatics.

[11]  Stan Matwin,et al.  Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.

[12]  Saso Dzeroski,et al.  Decision Trees for Hierarchical Multilabel Classification: A Case Study in Functional Genomics , 2006, PKDD.

[13]  Walter R. Gilks,et al.  Probabilistic annotation of protein sequences based on functional classifications , 2005, BMC Bioinformatics.

[14]  Antoine Danchin,et al.  SubtiList: the reference database for the Bacillus subtilis genome , 2002, Nucleic Acids Res..

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[18]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[19]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[20]  Igor V. Tetko,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm633 Sequence analysis , 2008 .

[21]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[22]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[23]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[24]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[25]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..