Graph Theory-Based Sequence Descriptors as Remote Homology Predictors

Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical–numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.

[1]  Kai Ye,et al.  PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction , 2019, Genes.

[2]  Y. Marrero-Ponce,et al.  tomocomd‐camps and protein bilinear indices – novel bio‐macromolecular descriptors for protein research: I. Predicting protein stability effects of a complete set of alanine substitutions in the Arc repressor , 2010, The FEBS journal.

[3]  Hao Luo,et al.  Accurate prediction of human essential genes using only nucleotide composition and association information , 2016, bioRxiv.

[4]  F. Markowetz,et al.  Evolutionary Distances in the Twilight Zone—A Rational Kernel Approach , 2010, PloS one.

[5]  Humberto González Díaz,et al.  Computational chemistry study of 3D‐structure‐function relationships for enzymes based on Markov models for protein electrostatic, HINT, and van der Waals potentials , 2009, J. Comput. Chem..

[6]  Guillermín Agüero-Chapín,et al.  Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone , 2017, BMC Bioinformatics.

[7]  Jure Zupan,et al.  On representation of proteins by star-like graphs. , 2007, Journal of molecular graphics & modelling.

[8]  Dejan Plavšić,et al.  Novel 2-D graphical representation of DNA sequences and their numerical characterization , 2003 .

[9]  Damminda Alahakoon,et al.  Extraction of high quality k-words for alignment-free sequence comparison. , 2014, Journal of theoretical biology.

[10]  Yovani Marrero-Ponce,et al.  Novel 3D bio-macromolecular bilinear descriptors for protein science: Predicting protein structural classes. , 2015, Journal of theoretical biology.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  S. Basak,et al.  Mathematical descriptors of DNA sequences: development and applications , 2006 .

[13]  Donald A. Adjeroh,et al.  On complexity measures for biological sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[14]  Matteo Comin,et al.  Benchmarking of alignment-free sequence comparison methods , 2019 .

[15]  J. Tagg,et al.  What's in a name? Class distinction for bacteriocins , 2006, Nature Reviews Microbiology.

[16]  B Henrissat,et al.  Cellulase families revealed by hydrophobic cluster analysis. , 1989, Gene.

[17]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[18]  Shandar Ahmad,et al.  Prediction of dinucleotide-specific RNA-binding sites in proteins , 2011, BMC Bioinformatics.

[19]  Natalio Krasnogor,et al.  Measuring the similarity of protein structures by means of the universal similarity metric , 2004, Bioinform..

[20]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[21]  José Ignacio Abreu Salas,et al.  Amino Acid Sequence Autocorrelation Vectors and Ensembles of Bayesian-Regularized Genetic Neural Networks for Prediction of Conformational Stability of Human Lysozyme Mutants , 2006, J. Chem. Inf. Model..

[22]  Albert Y. Zomaya,et al.  Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications , 2011 .

[23]  Gajendra P. S. Raghava,et al.  COPid: Composition Based Protein Identification , 2008, Silico Biol..

[24]  A. Nandy Two-dimensional graphical representation of DNA sequences and intron-exon discrimination in intron-rich sequences , 1996, Comput. Appl. Biosci..

[25]  James Green,et al.  ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins , 2015, BMC Bioinformatics.

[26]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[27]  Cristian Robert Munteanu,et al.  Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices , 2008, Journal of Theoretical Biology.

[28]  Elke Dittmann,et al.  Bioinformatic perspectives on NRPS/PKS megasynthases: advances and challenges. , 2009, Natural product reports.

[29]  Humberto González Díaz,et al.  Markovian negentropies in bioinformatics. 1. A picture of footprints after the interaction of the HIV-1 -RNA packaging region with drugs , 2003, Bioinform..

[30]  W. Pearson,et al.  The limits of protein sequence comparison? , 2005, Current opinion in structural biology.

[31]  Sascha Ott,et al.  An alignment-free model for comparison of regulatory sequences , 2010, Bioinform..

[32]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[33]  Eduardo A. Castro,et al.  Tomocomd-Cardd, a novel approach for computer-aided ‘ rational’ drug design: I. Theoretical and experimental assessment of a promising method for computational screening and in silico design of new anthelmintic compounds , 2004, J. Comput. Aided Mol. Des..

[34]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[35]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[36]  Eugenio Uriarte,et al.  Stochastic-based descriptors studying peptides biological properties: modeling the bitter tasting threshold of dipeptides. , 2004, Bioorganic & medicinal chemistry.

[37]  Yovani Marrero-Ponce,et al.  Examining the predictive accuracy of the novel 3D N-linear algebraic molecular codifications on benchmark datasets , 2016, Journal of Cheminformatics.

[38]  Francisco Torrens,et al.  Nucleic acid quadratic indices of the "macromolecular graph's nucleotides adjacency matrix" , 2004 .

[39]  Lourdes Santana,et al.  Proteomics, networks and connectivity indices , 2008, Proteomics.

[40]  M. Ragan,et al.  Is Multiple-Sequence Alignment Required for Accurate Inference of Phylogeny? , 2007, Systematic biology.

[41]  Humberto González-Díaz,et al.  Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Coffea arabica and prediction of a new sequence. , 2009, Journal of proteome research.

[42]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[43]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[44]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[45]  T G Dewey,et al.  The Shannon information entropy of protein sequences. , 1996, Biophysical journal.

[46]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[47]  David W. Mount,et al.  Using BLOSUM in Sequence Alignments. , 2008, CSH protocols.

[48]  Rajeev K. Azad,et al.  Information entropy based methods for genome comparison , 2013, SIGBIO.

[49]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[50]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[51]  Zhengchang Su,et al.  A Novel Alignment-Free Method for Comparing Transcription Factor Binding Site Motifs , 2010, PloS one.

[52]  Jin Xiong,et al.  Essential bioinformatics , 2006 .

[53]  Swarup Roy,et al.  Big Data Analytics in Bioinformatics: A Machine Learning Perspective , 2015, ArXiv.

[54]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[55]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[56]  Vitor Vasconcelos,et al.  TI2BioP — Topological Indices to BioPolymers. A Graphical– Numerical Approach for Bioinformatics , 2016 .

[57]  Maykel Cruz-Monteagudo,et al.  3D-MEDNEs: an alternative "in silico" technique for chemical research in toxicology. 2. quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. , 2008, Chemical research in toxicology.

[58]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[59]  A. Antunes,et al.  Non-linear models based on simple topological indices to identify RNase III protein members. , 2011, Journal of theoretical biology.

[60]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[61]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[62]  Jianding Qiu,et al.  Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. , 2009, Analytical biochemistry.

[63]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[64]  W. Pearson,et al.  Sensitivity and selectivity in protein structure comparison , 2004, Protein science : a publication of the Protein Society.

[65]  Reinaldo Molina Ruiz,et al.  Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers , 2018, BMC Bioinformatics.

[66]  F. Prado-Prado,et al.  Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. , 2008, Current topics in medicinal chemistry.

[67]  Julio E Terán,et al.  Tensor Algebra-based Geometrical (3D) Biomacro-Molecular Descriptors for Protein Research: Theory, Applications and Comparison with other Methods , 2019, Scientific Reports.

[68]  Robert Olson,et al.  Real Time Metagenomics: Using k-mers to annotate metagenomes , 2012, Bioinform..

[69]  Gary D Stormo,et al.  An Introduction to Sequence Similarity (“Homology”) Searching , 2009, Current protocols in bioinformatics.

[70]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[71]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[72]  Mark Johnson,et al.  NCBI BLAST: a better web interface , 2008, Nucleic Acids Res..

[73]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[74]  Humberto González-Díaz,et al.  Predicting stability of Arc repressor mutants with protein stochastic moments. , 2005, Bioorganic & medicinal chemistry.

[75]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[76]  J. Green,et al.  Proteome-wide Prediction of Lysine Methylation Reveals Novel Histone Marks and Outlines the Methyllysine Proteome , 2020, bioRxiv.

[77]  Humberto González-Díaz,et al.  Markov entropy backbone electrostatic descriptors for predicting proteins biological activity. , 2004, Bioorganic & medicinal chemistry letters.

[78]  Dianhui Wang,et al.  MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences , 2012, BMC Systems Biology.

[79]  Cristian Robert Munteanu,et al.  Natural/random protein classification models based on star network topological indices , 2008, Journal of Theoretical Biology.

[80]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[81]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[82]  Alexandru T Balaban,et al.  Graphical representation of proteins. , 2011, Chemical reviews.

[83]  Humberto González Díaz,et al.  MMM-QSAR Recognition of Ribonucleases without Alignment: Comparison with an HMM Model and Isolation from Schizosaccharomyces pombe, Prediction, and Experimental Assay of a New Sequence , 2008, J. Chem. Inf. Model..

[84]  Klara Kedem,et al.  Finding the Consensus Shape for a Protein Family , 2003, Algorithmica.

[85]  James R. Larus,et al.  Parallel and Scalable Precise Clustering for Homologous Protein Discovery , 2019, bioRxiv.

[86]  Francisco Herrera,et al.  An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species , 2015, BioMed research international.

[87]  Q Gu,et al.  Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns. , 2010, Protein and peptide letters.

[88]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[89]  Tomaz Pisanski,et al.  Graphical representation of proteins as four-color maps and their numerical characterization. , 2009, Journal of molecular graphics & modelling.

[90]  Francisco Torrens,et al.  Protein quadratic indices of the "macromolecular pseudograph's alpha-carbon atom adjacency matrix". 1. Prediction of Arc repressor alanine-mutant's stability. , 2004, Molecules.

[91]  Humberto González Díaz,et al.  Comparative Study of Topological Indices of Macro/Supramolecular RNA Complex Networks , 2008, J. Chem. Inf. Model..

[92]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[93]  James R Green,et al.  ProtDCal‐Suite: A web server for the numerical codification and functional analysis of proteins , 2019, Protein science : a publication of the Protein Society.

[94]  Xiaolong Wang,et al.  Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile‐Based Protein Representation , 2013, Molecular informatics.

[95]  Robert Giegerich,et al.  Fine-tuning structural RNA alignments in the twilight zone , 2010, BMC Bioinformatics.

[96]  K. Marchal,et al.  Peptide signal molecules and bacteriocins in Gram-negative bacteria: a genome-wide in silico screening for peptides containing a double-glycine leader sequence and their cognate transporters , 2004, Peptides.

[97]  K. Chou,et al.  A study on the correlation of G-protein-coupled receptor types with amino acid composition. , 2002, Protein engineering.

[98]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[99]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[100]  D. Frendewey,et al.  Purification and characterization of the Pac1 ribonuclease of Schizosaccharomyces pombe. , 1996, Nucleic acids research.

[101]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[102]  Lemont B. Kier,et al.  An Electrotopological-State Index for Atoms in Molecules , 1990, Pharmaceutical Research.

[103]  Mario Soberón,et al.  Cryptic endotoxic nature of Bacillus thuringiensis Cry1Ab insecticidal crystal protein , 2004, FEBS letters.

[104]  Ivan Erill,et al.  A reexamination of information theory-based methods for DNA-binding site identification , 2009, BMC Bioinformatics.

[105]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[106]  Umberto Ferraro Petrillo,et al.  Alignment-Free Sequence Comparison over Hadoop for Computational Biology , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[107]  Kuo-Chen Chou,et al.  Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. , 2007, Protein and peptide letters.

[108]  Sherif Abou Elela,et al.  Evaluation of the RNA Determinants for Bacterial and Yeast RNase III Binding and Cleavage* , 2004, Journal of Biological Chemistry.

[109]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[110]  Matthew N. Davies,et al.  Alignment-Independent Techniques for Protein Classification , 2008 .

[111]  K. Chou,et al.  A key driving force in determination of protein structural classes. , 1999, Biochemical and biophysical research communications.

[112]  Yunierkis Pérez-Castillo,et al.  TI2BioP: Topological Indices to BioPolymers. Its practical use to unravel cryptic bacteriocin-like domains , 2011, Amino Acids.

[113]  Yasser B. Ruiz-Blanco,et al.  Proteome-wide Prediction of Lysine Methylation Reveals Novel Histone Marks and Outlines the Methyllysine Proteome , 2018, bioRxiv.

[114]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[115]  Pradeep Kumar Naik,et al.  Prediction of enzymes and non-enzymes from protein sequences based on sequence derived features and PSSM matrix using artificial neural network , 2007, Bioinformation.

[116]  Qi Dai,et al.  Comparison study on k-word statistical measures for protein: From sequence to 'sequence space' , 2008, BMC Bioinformatics.

[117]  C. Munteanu,et al.  Generalized lattice graphs for 2D-visualization of biological information , 2009, Journal of Theoretical Biology.

[118]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[119]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract , 2012, RECOMB.

[120]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[121]  Guillermin Agüero-Chapin,et al.  Big Data Supervised Pairwise Ortholog Detection in Yeasts , 2017 .

[122]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[123]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[124]  Humberto González-Díaz,et al.  Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. , 2006, FEBS letters.

[125]  Vince Grolmusz,et al.  Fast and exact sequence alignment with the Smith–Waterman algorithm: The SwissAlign webserver , 2013 .

[126]  Susana Vinga,et al.  Editorial: Alignment-free methods in computational biology , 2014, Briefings Bioinform..

[127]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[128]  Aminael Sánchez-Rodríguez,et al.  Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains. , 2016, Methods in molecular biology.

[129]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[130]  Humberto González Díaz,et al.  Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer aided molecular design II: experimental and theoretical assessment of a novel method for virtual screening of fasciolicides , 2002, Journal of molecular modeling.

[131]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[132]  Yovani Marrero-Ponce,et al.  Non-stochastic and stochastic linear indices of the molecular pseudograph’s atom-adjacency matrix: a novel approach for computational in silico screening and “rational” selection of new lead antibacterial agents , 2006, Journal of molecular modeling.

[133]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[134]  Geoffrey I. Webb,et al.  Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs , 2011, Nucleic acids research.

[135]  Ron Elber,et al.  Enriching the sequence substitution matrix by structural information , 2003, Proteins.

[136]  Stinus Lindgreen,et al.  WAR: Webserver for aligning structural RNAs , 2008, Nucleic Acids Res..

[137]  Francisco Torrens,et al.  Protein linear indices of the 'macromolecular pseudograph alpha-carbon atom adjacency matrix' in bioinformatics. Part 1: prediction of protein stability effects of a complete set of alanine substitutions in Arc repressor. , 2005, Bioorganic & medicinal chemistry.

[138]  Cristian R. Munteanu,et al.  Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. , 2008, Journal of theoretical biology.

[139]  Julio Caballero,et al.  Amino acid sequence autocorrelation vectors and bayesian‐regularized genetic neural networks for modeling protein conformational stability: Gene V protein mutants , 2007, Proteins.

[140]  Ernesto Estrada,et al.  Spectral Moments of the Edge Adjacency Matrix in Molecular Graphs, 1. Definition and Applications to the Prediction of Physical Properties of Alkanes , 1996, J. Chem. Inf. Comput. Sci..

[141]  Makiko Suwa,et al.  Bioinformatics tools for predicting GPCR gene functions. , 2014, Advances in experimental medicine and biology.

[142]  Matteo Comin,et al.  Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles , 2015, BIOINFORMATICS.

[143]  O. White,et al.  Genome sequence of the dissimilatory metal ion–reducing bacterium Shewanella oneidensis , 2002, Nature Biotechnology.

[144]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[145]  Eugenio Uriarte,et al.  Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants , 2004, Proteins.

[146]  Yan Wang,et al.  Advances and Applications in the Quest for Orthologs , 2019, Molecular biology and evolution.

[147]  Reinaldo Molina Ruiz,et al.  An Alignment-Free Approach for Eukaryotic ITS2 Annotation and Phylogenetic Inference , 2011, IWBBIO.

[148]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[149]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[150]  E Uriarte,et al.  Recent advances on the role of topological indices in drug discovery research. , 2001, Current medicinal chemistry.

[151]  Humberto González Díaz,et al.  2D‐RNA‐coupling numbers: A new computational chemistry approach to link secondary structure topology with biological function , 2007, J. Comput. Chem..

[152]  G. R. Bisby,et al.  A Dictionary of the Fungi , 1943, Nature.

[153]  Dejan Plavšić,et al.  Four-color map representation of DNA or RNA sequences and their numerical characterization , 2005 .

[154]  Xiao-Qing Yu,et al.  Predicting protein structural class by incorporating patterns of over-represented k-mers into the general form of Chou's PseAAC. , 2012, Protein and peptide letters.

[155]  A. Banerjee,et al.  A Survey on Protein Sequence Classification with Data Mining Techniques , 2016 .

[156]  Tobias Müller,et al.  A common core of secondary structure of the internal transcribed spacer 2 (ITS2) throughout the Eukaryota. , 2005, RNA.

[157]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[158]  Yovani Marrero Ponce,et al.  Linear indices of the 'macromolecular graph's nucleotides adjacency matrix' as a promising approach for bioinformatics studies. Part 1: prediction of paromomycin's affinity constant with HIV-1 psi-RNA packaging region. , 2005, Bioorganic & medicinal chemistry.

[159]  Bernhard Haubold,et al.  Alignment-free detection of local similarity among viral and bacterial genomes , 2011, Bioinform..

[160]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[161]  David H. Mathews,et al.  Predicting a set of minimal free energy RNA secondary structures common to two sequences , 2005, Bioinform..

[162]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[163]  Humberto González-Díaz,et al.  QSAR study for mycobacterial promoters with low sequence homology. , 2006, Bioorganic & medicinal chemistry letters.

[164]  Vincent Ferretti,et al.  Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification , 2014, Bioinform..

[165]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[166]  Marc A. Martí-Renom,et al.  Quantifying the relationship between sequence and three-dimensional structure conservation in RNA , 2009, BMC Bioinformatics.

[167]  A. Rokas,et al.  Evaluating Ortholog Prediction Algorithms in a Yeast Model Clade , 2011, PloS one.

[168]  Cristian R. Munteanu,et al.  S2SNet: A Tool for Transforming Characters and Numeric Sequences into Star Network Topological Indices in Chemoinformatics, Bioinformatics, Biomedical, and Social-Legal Sciences , 2013 .

[169]  M. Himmel,et al.  Outlook for cellulase improvement: screening and selection strategies. , 2006, Biotechnology advances.

[170]  Enrique Fernández-Blanco,et al.  Naïve Bayes QSDR classification based on spiral-graph Shannon entropies for protein biomarkers in human colon cancer. , 2012, Molecular bioSystems.

[171]  Xinyi Shi,et al.  A Global Analysis of the Polygalacturonase Gene Family in Soybean (Glycine max) , 2016, PloS one.

[172]  Serafim Batzoglou,et al.  The many faces of sequence alignment , 2005, Briefings Bioinform..

[173]  Vitor Vasconcelos,et al.  Exploring the Adenylation Domain Repertoire of Nonribosomal Peptide Synthetases Using an Ensemble of Sequence-Search Methods , 2013, PloS one.

[174]  S. Govindarajan,et al.  Codon bias and heterologous protein expression. , 2004, Trends in biotechnology.

[175]  Junjie Chen,et al.  Protein Remote Homology Detection Based on an Ensemble Learning Approach , 2016, BioMed research international.