MACSIMS : multiple alignment of complete sequences information management system

BackgroundIn the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family.ResultsMACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis.ConclusionMACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at http://bips.u-strasbg.fr/MACSIMS/.

[1]  Olivier Poch,et al.  PipeAlign: a new toolkit for protein family analysis , 2003, Nucleic Acids Res..

[2]  Olivier Poch,et al.  RASCAL: Rapid Scanning and Correction of Multiple Sequence Alignments , 2003, Bioinform..

[3]  Seán I. O'Donoghue,et al.  The SRS 3D module: integrating structures, sequences and features , 2004, Bioinform..

[4]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[5]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[6]  Milana Frenkel-Morgenstern,et al.  Enhanced statistics for local alignment of multiple alignments improves prediction of protein function and structure , 2005, Bioinform..

[7]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[8]  T. Steitz,et al.  Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. , 1986, Annual review of biophysics and biophysical chemistry.

[9]  R. Reithmeier,et al.  Characterization and modeling of membrane proteins using sequence analysis. , 1995, Current opinion in structural biology.

[10]  R. King,et al.  Identification and application of the concepts important for accurate and reliable protein secondary structure prediction , 1996, Protein science : a publication of the Protein Society.

[11]  Patrice Koehl,et al.  MAO: a Multiple Alignment Ontology for nucleic acid and protein sequences , 2005, Nucleic acids research.

[12]  Andrew Hayes,et al.  GIMS: an integrated data storage and analysis environment for genomic and functional data , 2003, Yeast.

[13]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[14]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[15]  Vassilios Ioannidis,et al.  MyHits: a new interactive resource for protein annotation and domain identification , 2004, Nucleic Acids Res..

[16]  Pengyu Hong,et al.  GeneNotes – A novel information management software for biologists , 2005, BMC Bioinformatics.

[17]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[18]  Geoffrey J. Barton,et al.  The Jalview Java alignment editor , 2004, Bioinform..

[19]  Michael I. Jordan,et al.  Protein Molecular Function Prediction by Bayesian Phylogenomics , 2005, PLoS Comput. Biol..

[20]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[21]  John C. Wootton,et al.  A Global Compositional Complexity Measure for Biological Sequences: AT-rich and GC-rich Genomes Encode Less Complex Proteins , 2000, Comput. Chem..

[22]  Olivier Poch,et al.  LEON: multiple aLignment Evaluation Of Neighbours. , 2004, Nucleic acids research.

[23]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[24]  Gertraud Burger,et al.  AutoFACT: An Automatic Functional Annotation and Classification Tool , 2005, BMC Bioinformatics.

[25]  B. Rost,et al.  Better prediction of sub‐cellular localization by combining evolutionary and structural information , 2003, Proteins.

[26]  Philip E. Bourne,et al.  Statistically rigorous automated protein annotation , 2004, Bioinform..

[27]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[28]  Chris Morris,et al.  MOLE: A data management application based on a protein production data model , 2005, Proteins.

[29]  I. Rigoutsos,et al.  Dictionary-driven protein annotation. , 2002, Nucleic acids research.

[30]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[31]  Hubert Hackl,et al.  MARS: Microarray analysis, retrieval, and storage system , 2005, BMC Bioinformatics.

[32]  M. Sternberg,et al.  Prediction of protein secondary structure and active sites using the alignment of homologous sequences. , 1987, Journal of molecular biology.

[33]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[34]  Ciamac C. Moallemi,et al.  Protein family annotation in a multiple alignment viewer , 2003, Bioinform..

[35]  Akihiko Noguchi,et al.  Mutation and polymorphism spectrum of the GALNS gene in mucopolysaccharidosis IVA (Morquio A) , 2005, Human Mutation.

[36]  D. Cozzetto,et al.  Relationship between multiple sequence alignments and quality of protein comparative models , 2004, Proteins.

[37]  Alfonso Valencia,et al.  Automatic annotation of protein function based on family identification , 2003, Proteins.

[38]  Edison T Liu,et al.  Systems Biology, Integrative Biology, Predictive Biology , 2005, Cell.

[39]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[40]  R. Raines,et al.  The CXXC motif: a rheostat in the active site. , 1997, Biochemistry.

[41]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[42]  Olivier Poch,et al.  GOAnno: GO annotation based on multiple alignment , 2005, Bioinform..

[43]  J. D. Thompson,et al.  Towards a reliable objective function for multiple sequence alignments. , 2001, Journal of molecular biology.

[44]  J. D. Thompson,et al.  Multiple alignment of complete sequences (MACS) in the post-genomic era. , 2001, Gene.

[45]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..