Sequence Similarity Networks for the Protein Universe

As of November 2014, over 86 million protein sequences had been deposited in the TrEMBL database, of which only 0.5 million had experimental support for an enzymatic function. Currently, protein databases depend heavily on homology-based predictions of enzyme function, yet it is estimated that only 50% of current predicted functions in UniprotKB are correct. The process would benefit greatly from added expertise, while still maintaining a balance of careful curation and throughput. For sequence comparison and visualization, the sequence similarity network (SSN) is a computationally efficient alternative to the standard dendrogram. Making SSNs easily accessible to the non-bioinformatician allows enzymologists, microbiologists, and chemists to observe the sequence identity landscape for a protein family of interest and select more informed identity boundaries for appropriate transfer of function via homology. This talk describes the efforts of the Enzyme Function Initiative to provide precomputed SSNs for e...