Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing

BackgroundClustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment.ResultsIn this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm.ConclusionsThe new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.

[1]  Ulrik Brandes,et al.  Analysis and Visualization of Social Networks , 2003, Graph Drawing Software.

[2]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[3]  Fan Yang,et al.  TIGRFAMs: a protein family resource for the functional identification of proteins , 2001, Nucleic Acids Res..

[4]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[5]  Peer Bork,et al.  SMART: recent updates, new developments and status in 2015 , 2014, Nucleic Acids Res..

[6]  O. Uhlenbeck,et al.  Cloning and biochemical characterization of Bacillus subtilis YxiN, a DEAD protein specifically activated by 23S rRNA: delineation of a novel sub-family of bacterial DEAD proteins. , 1999, Nucleic acids research.

[7]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[8]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[9]  Changjun Wu,et al.  pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs , 2012, IEEE Transactions on Parallel and Distributed Systems.

[10]  Sriram Krishnamoorthy,et al.  A work stealing based approach for enabling scalable optimal sequence homology detection , 2015, J. Parallel Distributed Comput..

[11]  P Bork,et al.  Evolutionarily mobile modules in proteins. , 1993, Scientific American.

[12]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[13]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[14]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[16]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[17]  A. Kalyanaraman,et al.  A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions , 2016, PloS one.

[18]  T. Attwood,et al.  PRINTS--a database of protein motif fingerprints. , 1994, Nucleic acids research.

[19]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[20]  Changjun Wu,et al.  An efficient parallel approach for identifying protein families in large-scale metagenomic data sets , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Nathan Linial,et al.  EVEREST: automatic identification and classification of protein domains in all protein sequences , 2006, BMC bioinformatics.

[22]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[23]  Fedor V. Karginov,et al.  The carboxy-terminal domain of the DExDH protein YxiN is sufficient to confer specificity for 23S rRNA. , 2002, Journal of molecular biology.

[24]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Anantharaman Kalyanaraman,et al.  Parallel Heuristics for Scalable Community Detection , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[27]  Inderjit S. Dhillon,et al.  Overlapping community detection using seed set expansion , 2013, CIKM.

[28]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[29]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[30]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[31]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[32]  M. Linial,et al.  Protein Clustering and Classification , 2004 .

[33]  Svetlana Lockwood,et al.  Applications and extensions of pClust to big microbial proteomic data , 2016 .

[34]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[35]  S. Broschat,et al.  Comparative genomics reveals multiple pathways to mutualism for tick-borne pathogens , 2016, BMC Genomics.

[36]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[37]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[38]  Peer Bork,et al.  SMART: recent updates, new developments and status in 2020 , 2020, Nucleic Acids Res..

[39]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[40]  Jérôme Gracy,et al.  Automated protein sequence database classification. I. Integration of compositional similarity search, local similarity search, and multiple sequence alignment , 1998, Bioinform..

[41]  E. Snyder,et al.  Rickettsia Phylogenomics: Unwinding the Intricacies of Obligate Intracellular Life , 2008, PloS one.