A context evaluation approach for structural comparison of proteins using cross entropy over n-gram modelling

The structural comparison of proteins is a vital step in structural biology that is used to predict and analyse a new unknown protein function. Although a number of different techniques have been explored, the study to develop new alternative methods is still an active research area. The present paper introduces a text modelling-based technique for the structural comparison of proteins. The method models the secondary and tertiary structure of proteins in two linear sequences and then applies them to the comparison of two structures. The technique used for pairwise comparison of the sequences has been adopted from computational linguistics and its well-known techniques for analysing and quantifying textual sequences. To this end, an n-gram modelling technique is used to capture regularities between sequences, and then, the cross-entropy concept is employed to measure their similarities. Several experiments are conducted to evaluate the performance of the method and compare it with other commonly used programs. The assessments for information retrieval evaluation demonstrate that the technique has a high running speed, which is similar to other linear encoding methods, such as 3D-BLAST, SARST, and TS-AMIR, whereas its accuracy is comparable to CE and TM-align, which are high accuracy comparison tools. Accordingly, the results demonstrate that the algorithm has high efficiency compared with other state-of-the-art methods.

[1]  Peter Lackner,et al.  Comparative Analysis of Protein Structure Alignments , 2007, BMC Structural Biology.

[2]  Kian-Lee Tan,et al.  Rapid 3D protein structure database searching using information retrieval techniques , 2004, Bioinform..

[3]  Douglas L. Brutlag,et al.  Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic Representations , 1997, ISMB.

[4]  Jafar Razmara,et al.  TS-AMIR: a topology string alignment method for intensive rapid protein structure comparison , 2012, Algorithms for Molecular Biology.

[5]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[6]  Pierre Tufféry,et al.  SA-Search: a web tool for protein structure mining based on a Structural Alphabet , 2004, Nucleic Acids Res..

[7]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[8]  Changiz Eslahchi,et al.  STON: A novel method for protein three-dimensional structure comparison , 2009, Comput. Biol. Medicine.

[9]  Jinn-Moon Yang,et al.  Kappa-alpha plot derived structural alphabet and BLOSUM-like substitution matrix for rapid search of protein structure database , 2007, Genome Biology.

[10]  Nicholas Piël Language and Speech Processing , 2007 .

[11]  Thomas Steinke,et al.  Connectivity independent protein-structure alignment: a hierarchical approach , 2006, BMC Bioinformatics.

[12]  Jaime G. Carbonell,et al.  Comparative n-gram analysis of whole-genome protein sequences , 2002 .

[13]  Elena Rivas,et al.  The language of RNA: a formal grammar that includes pseudoknots , 2000, Bioinform..

[14]  Gerard J Kleywegt,et al.  Déjà vu all over again: finding and analyzing protein structure similarities. , 2004, Structure.

[15]  Joël Pothier,et al.  YAKUSA: A fast structural database scanning method , 2005, Proteins.

[16]  Li Liao,et al.  Combining pairwise sequence similarity and support vector machines for remote protein homology detection , 2002, RECOMB '02.

[17]  Inbal Budowski-Tal,et al.  FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately , 2010, Proceedings of the National Academy of Sciences.

[18]  K Henrick,et al.  Electronic Reprint Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions , 2022 .

[19]  Kunihiko Sadakane,et al.  Linear-time protein 3-D structure searching with insertions and deletions , 2009, Algorithms for Molecular Biology.

[20]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[21]  Steve Young,et al.  Corpus-based methods in language and speech processing , 1997 .

[22]  Safaai Deris,et al.  A novel text modeling approach for structural comparison and alignment of biomolecules , 2010 .

[23]  Chih-Hung Chang,et al.  Protein structural similarity search by Ramachandran codes , 2007, BMC Bioinformatics.

[24]  I. Ockene DejaVu All Over Again , 2012 .

[25]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[26]  Judith Klein-Seetharaman,et al.  BLMT: statistical sequence analysis using N-grams. , 2004, Applied bioinformatics.

[27]  Andrew J. Martin,et al.  The ups and downs of protein topology; rapid comparison of protein structure. , 2000, Protein engineering.

[28]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[29]  Ioannis Pitas,et al.  Language engineering and information theoretic methods in protein sequence similarity studies , 2008, Computational Intelligence in Medical Informatics.

[30]  Thomas Steinke,et al.  Fast Structural Alignment of Biomolecules Using a Hash Table, N-Grams and String Descriptors , 2009, Algorithms.

[31]  Gerard J Kleywegt,et al.  Evaluation of protein fold comparison servers , 2003, Proteins.

[32]  Nathan Linial,et al.  Approximate protein structural alignment in polynomial time. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..