Knowledge-based expert systems and a proof-of-concept case study for multiple sequence alignment construction and analysis

The traditional approach to bioinformatics analyses relies on independent task-specific services and applications, using different input and output formats, often idiosyncratic, and frequently not designed to inter-operate. In general, such analyses were performed by experts who manually verified the results obtained at each step in the process. Today, the amount of bioinformatics information continuously being produced means that handling the various applications used to study this information presents a major data management and analysis challenge to researchers. It is now impossible to manually analyse all this information and new approaches are needed that are capable of processing the large-scale heterogeneous data in order to extract the pertinent information. We review the recent use of integrated expert systems aimed at providing more efficient knowledge extraction for bioinformatics research. A general methodology for building knowledge-based expert systems is described, focusing on the unstructured information management architecture, UIMA, which provides facilities for both data and process management. A case study involving a multiple alignment expert system prototype called AlexSys is also presented.

[1]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[2]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[3]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[4]  D. Roos,et al.  Bioinformatics--Trying to Swim in a Sea of Data , 2001, Science.

[5]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[6]  M. Vidal,et al.  Integrating 'omic' information: a bridge between genomics and systems biology. , 2003, Trends in genetics : TIG.

[7]  Smith Rf,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992 .

[8]  Jaap Heringa,et al.  PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information , 2005, Nucleic Acids Res..

[9]  R. Garrett Out of the lab into the field: system design of large expert systems , 1991, [1991] Proceedings of the IEEE/ACM International Conference on Developing and Managing Expert System Programs.

[10]  Desmond G. Higgins,et al.  Evaluation of iterative alignment algorithms for multiple alignment , 2005, Bioinform..

[11]  K Bretonnel Cohen,et al.  Journal of Biomedical Discovery and Collaboration Open Access an Open-source Framework for Large-scale, Flexible Evaluation of Biomedical Text Mining Systems , 2008 .

[12]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[13]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[14]  Shu-Hsien Liao,et al.  Expert system methodologies and applications - a decade review from 1995 to 2004 , 2005, Expert Syst. Appl..

[15]  Mark D. Wilkinson,et al.  BioMOBY: An Open Source Biological Web Services Proposal , 2002, Briefings Bioinform..

[16]  Mark Halling-Brown,et al.  Constructing computational pipelines. , 2008, Methods in molecular biology.

[17]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[18]  Tandy J. Warnow,et al.  The Effect of the Guide Tree on Multiple Sequence Alignments and Subsequent Phylogenetic Analysis , 2007, Pacific Symposium on Biocomputing.

[19]  Olivier Poch,et al.  PipeAlign: a new toolkit for protein family analysis , 2003, Nucleic Acids Res..

[20]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[21]  Olivier Poch,et al.  RASCAL: Rapid Scanning and Correction of Multiple Sequence Alignments , 2003, Bioinform..

[22]  William J. Bosl,et al.  Systems biology by the rules: hybrid intelligent systems for pathway modeling and discovery , 2007, BMC Systems Biology.

[23]  Jill P. Mesirov,et al.  Computational Biology , 2018, Encyclopedia of Parallel Computing.

[24]  James A. Foster,et al.  Phylogenetics Clearcut : a fast implementation of relaxed neighbor joining , 2006 .

[25]  Isabelle Bichindaritz,et al.  Case-based reasoning in the health sciences: What's next? , 2006, Artif. Intell. Medicine.

[26]  R. F. Smith,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992, Protein engineering.

[27]  Sophia Ananiadou,et al.  MaSTerClass: a case-based reasoning system for the classification of biomedical terms , 2005, Bioinform..

[28]  Izak Benbasat,et al.  The Use and Effects of Knowledge-Based System Explanations: Theoretical Foundations and a Framework for Empirical Evaluation , 1996, Inf. Syst. Res..

[29]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[30]  Roland Linder,et al.  Microarray data classified by artificial neural networks. , 2007, Methods in molecular biology.

[31]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[32]  Peter Tarczy-Hornoch,et al.  Biomediator Data Integration and Inference for Functional Annotation of Anonymous Sequences , 2006, Pacific Symposium on Biocomputing.

[33]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[34]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[35]  Yi Yang,et al.  Analyzing functional similarity of protein sequences with discrete wavelet transform , 2005, Comput. Biol. Chem..

[36]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[37]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[38]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[39]  Li Zhao,et al.  Faster algorithm of string comparison , 2003, Pattern Analysis & Applications.

[40]  André Gilles,et al.  FIGENIX: Intelligent automation of genomic annotation: expertise integration in a new software platform , 2005, BMC Bioinformatics.

[41]  Frank van Harmelen,et al.  Evaluating a Formal KBS Specification Language , 1996, IEEE Expert.

[42]  P. Woolf,et al.  A fuzzy logic approach to analyzing gene expression data. , 2000, Physiological genomics.

[43]  John Fox,et al.  Capturing expert knowledge with argumentation: a case study in bioinformatics , 2006, Bioinform..

[44]  J. Thompson,et al.  DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. , 2000, Nucleic acids research.

[45]  Gary Riley,et al.  C Language Integrated Production System (CLIPS) , 1990 .

[46]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[47]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[48]  Paolo Romano,et al.  Automation of in-silico data analysis processes through workflow management systems , 2007, Briefings Bioinform..

[49]  Gary Riley,et al.  Expert Systems: Principles and Programming , 2004 .

[50]  J. D. Thompson,et al.  Towards a reliable objective function for multiple sequence alignments. , 2001, Journal of molecular biology.

[51]  G. Izmirlian,et al.  Overview of Commonly Used Bioinformatics Methods and Their Applications , 2004, Annals of the New York Academy of Sciences.

[52]  C. Brown,et al.  Determination of X-chromosome inactivation status using X-linked expressed polymorphisms identified by database searching. , 2000, Genomics.

[53]  David R. Gilbert,et al.  An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics , 2003, APBC.

[54]  Simon Parsons,et al.  Bioinformatics: The Machine Learning Approach by P. Baldi and S. Brunak, 2nd edn, MIT Press, 452 pp., $60.00, ISBN 0-262-02506-X , 2004, The Knowledge Engineering Review.

[55]  Graham Wilcock,et al.  Unstructured Information Management Architecture (UIMA) , 2009 .

[56]  Anna R. Panchenko,et al.  Refining multiple sequence alignments with conserved core regions , 2006, Nucleic acids research.

[57]  John Durkin Expert Systems: A View of the Field , 1996, IEEE Expert.

[58]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[59]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[60]  J. Thompson,et al.  Multiple Sequence Alignment as a Workbench for Molecular Systems Biology , 2006 .