Modelling knowledge strategy for solving the DNA sequence annotation problem through CommonKADS methodology

Finding the genes that exist within a DNA sequence and assigning them biological features and functions is one of the biggest challenges of Genomics. This task, called annotation, has to be as accurate and reliable as possible, because this information will be applied in other researches. Ideally, each sequence should be annotated and validated by a human expert, who has the knowledge to infer the most appropriate annotation. Nevertheless, the huge amount of genomic data produced by the new sequencing technologies prevents this practice. Developing expert systems that are able to annotate sequences automatically and emulate the expert involvement in certain key points of the process would enhance the annotation quality. In this work, the CommonKADS methodology is innovatively applied for this purpose. It is used to structure and model the knowledge required to build an expert system able to deal with the functional part of sequence annotation, i.e. establishing the biological purpose of the sequence. This approach provides the first general framework for the aforementioned problem, which can be easily extended to related issues.

[1]  T. Andrews,et al.  The Ensembl automatic gene annotation system. , 2004, Genome research.

[2]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[3]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[4]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[5]  Daniel Rios,et al.  Ensembl 2011 , 2010, Nucleic Acids Res..

[6]  André Gilles,et al.  FIGENIX: Intelligent automation of genomic annotation: expertise integration in a new software platform , 2005, BMC Bioinformatics.

[7]  S. Searle,et al.  The Ensembl analysis pipeline. , 2004, Genome research.

[8]  Gabriele Sales,et al.  MAGIA2: from miRNA and genes expression data integrative analysis to microRNA–transcription factor mixed regulatory circuits (2012 update) , 2012, Nucleic Acids Res..

[9]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[10]  Frank Thomson Leighton,et al.  Protein folding in the hydrophobic-hydrophilic (HP) is NP-complete , 1998, RECOMB '98.

[11]  C. Liébecq European Journal of Biochemistry , 1967, Springer Berlin Heidelberg.

[12]  S. Debowski Knowledge Management , 2005 .

[13]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[14]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[15]  D. Nelson,et al.  Lehninger Principles of Biochemistry (5th edition) , 2008 .

[16]  Asunción Gómez-Pérez,et al.  Enseñanza de Inteligencia Artificial e Ingeniería del Conocimiento , 1997, Inteligencia Artif..

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  Steven J. Barrett Intelligent Bioinformatics: The Application of Artificial Intelligence Techniques to Bioinformatics Problems , 2006, Genetic Programming and Evolvable Machines.

[19]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[20]  Jonathan Pevsner,et al.  Bioinformatics and functional genomics , 2003 .

[21]  J. Stajich,et al.  Bioinformatics : Tools and Applications , 2009 .

[22]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[23]  Jean-Paul A. Barthès,et al.  Knowledge Management , 1994, Encyclopedia of Database Systems.

[24]  Narmada Thanki,et al.  CDD: a Conserved Domain Database for the functional annotation of proteins , 2010, Nucleic Acids Res..

[25]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[26]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[27]  Nigel Shadbolt,et al.  The efficacy of knowledge elicitation techniques: a comparison across domains and levels of expertise , 1990 .

[28]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[29]  A. Lehninger Principles of Biochemistry , 1984 .

[30]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[31]  Mihalis Yannakakis,et al.  On the complexity of protein folding (abstract) , 1998, RECOMB '98.

[32]  Guus Schreiber,et al.  Knowledge Engineering and Management: The CommonKADS Methodology , 1999 .

[33]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[34]  Mihalis Yannakakis,et al.  On the Complexity of Protein Folding , 1998, J. Comput. Biol..

[35]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[36]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.