A Visual Analytics approach to identifying protein structural constraints

Predicting protein structures has long been a grand-challenge problem. Fine-grained computational simulation of folding events from a protein's synthesis to its final stable structure remains computationally intractable. Therefore, methods which derive constraints from other sources are attractive. To date, constraints derived from known structures have proven to be highly successful. However, these cannot be applied to molecules with no identifiable neighbors having already-determined structures. For such molecules, structural constraints must be derived in other ways. One popular approach has been the statistical analysis of large families of proteins, with the hope that residues that “change together” (co-evolve) imply that those residues are in contact. Unfortunately, despite repeated attempts to use this data to deduce structural constraints, this approach has met with minimal success. The consensus of current literature concludes that there is simply too little information contained within the correlated mutations of many protein families to reliably and generally predict structural constraints. Recent work in my laboratory challenges this conclusion. For some time we have been developing methods (MAVL/StickWRLD) to visualize the pattern of co-evolved mutations within sequence families. While our analysis of individual correlations agrees with the literature consensus, we have recently discovered that the visualized pattern of correlations is highly suggestive of structural relationships. In our preliminary test cases, human researchers can unambiguously determine many positive structural constraints by visual analysis of statistical sequence information alone, often with no training on interpretation of the visualization results. Herein we report the visualization design that supports this Visual Analytics approach to identifying high-confidence hypotheses about protein folding from protein sequence, and illustrate preliminary results from this research. Our approach entails a higher-dimensional extension of parallel coordinates which illuminates distant shared sub-tuples of the vectors representing each protein sequence when these sub-tuples occur with an over abundance compared to expectations. It simultaneously eliminates all representations of tuples which occur with frequency near the expected norm. The result is a minimally-occluded representation of outlier, and only outlier co-occurrences within the sequence families.