Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. A representative subset is a subset of sequences from the original data set that (1) minimizes the redundancy in the representative sequences, and (2) maximizes the representativeness of the subset; that is, every sequence in the full data set has at least one representative that is similar to it. The selected representative subset is then used in downstream analysis in place of the full data set. Previous methods for this task, such as CD-HIT, PISCES and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. These sequence selection methods are very widely used---for example, the CD-HIT papers have been cited a total of >3,000 times (Google Scholar)---and are a standard preprocessing step applied to data sets of protein sequences, cDNA sequences and microbial DNA. In this work, we propose a principled framework, Repset, for representative protein sequence subset selection using submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. Our approach involves defining a submodular objective function that quantifies the desirable properties of a given subset of sequences, and then applying a submodular optimization algorithm to choose a representative subset that maximizes this function. Framing this task as an optimization problem has two benefits. First, it allows us to leverage a large existing literature on submodular optimization. This led to the development of a method that is computationally efficient, empirically outperforms other methods, and, in contrast to all existing solutions to this problem, is backed by theoretical guarantees of its performance. In particular, Repset outperforms threshold-based methods on two measures: (1) representative subsets produced by Repset have lower redundancy, as measured by the pairwise similarity of sequences in the set, and (2) these subsets have greater structural diversity, as measured using the SCOPe library of protein domain structures. Second, the optimization-based framework gives the method great flexibility. The user can select one of a variety of objective functions to optimize according to their needs. For example, the user can minimize the redundancy of sequences in the subset, maximize the representativeness of the subset of the full set, or some combination of the two. The user can also choose to prefer some sequences over others, such as preferring long sequences over shorter ones. More broadly, this paper demonstrates the utility of submodular optimization for computational biology. Applying submodular optimization to a new problem has two simple steps: (1) devise a submodular objective function, and (2) apply a standard optimization algorithm to this objective. Therefore, we believe that the strategy we employ here will have analogous applications to hundreds of other problems in computational biology.

[1]  William J. Cook,et al.  Combinatorial optimization , 1997 .

[2]  J. D. Parsons,et al.  Clustering cDNA sequences , 1992, Comput. Appl. Biosci..

[3]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[4]  D. M. Topkis Supermodularity and Complementarity , 1998 .

[5]  A. G. SEARLE,et al.  High Effectiveness of Chronic Neutron Exposures for the Induction of Specific Locus Mutations in Mice , 1964, Nature.

[6]  Johannes Söding,et al.  kClust: fast and sensitive clustering of large protein sequence databases , 2013, BMC Bioinformatics.

[7]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[8]  Joseph Naor,et al.  A Tight Linear Time (1/2)-Approximation for Unconstrained Submodular Maximization , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[9]  X. Vives Oligopoly Pricing: Old Ideas and New Tools , 1999 .

[10]  M. Carter Foundations of mathematical economics , 2001 .

[11]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[12]  U. Feige,et al.  Maximizing Non-monotone Submodular Functions , 2011 .

[13]  Jiong Yang,et al.  CLUSEQ: efficient and effective sequence clustering , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  Satoru Fujishige,et al.  Submodular functions and optimization , 1991 .

[15]  H. Narayanan Chapter 9 Submodular Functions , 1997 .

[16]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Andrew J. Doig,et al.  Maximising the Size of Non-Redundant Protein Datasets Using Graph Theory , 2013, PloS one.

[18]  Jeff A. Bilmes,et al.  Submodular feature selection for high-dimensional acoustic score spaces , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Oliviero Carugo,et al.  Protein sequence redundancy reduction: comparison of various method , 2010, Bioinformation.

[20]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[21]  László Lovász,et al.  Submodular functions and convexity , 1982, ISMP.

[22]  William Stafford Noble,et al.  Choosing panels of genomics assays using submodular optimization , 2016, Genome Biology.