On identifying statistical redundancy at the level of amino acid subsequences

This paper presents a framework to characterize and identify local sequences of proteins that are statistically redundant under the measure of Shannon information content while accounting for variations in their occurrences over evolutionary insertions, deletions, and substitutions of amino acids. The identification of such local sequences provides insights for downstream studies on proteins. Here, we have applied our methods to amino acid sequence data sets derived from a database corresponding to 935,552 substructural regions of varying sizes, covering 113,724 proteins from the protein data bank. The results identify, among others, a surjective mapping between 110,598 local sequences (with an average length of 82 amino acids per sequence) and 1,493 topological shapes. The C++ source code and supporting material are available from https://lcb.infotech.monash.edu.au/bibm2021.