This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in the case of a Caspase (C14) family suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant 3dimensional structural features of proteins. 1. BACKGROUND AND INTRODUCTION Assigning putative functions to protein sequences remains one of the most challenging problems in functional genomics. The function of a protein depends to a large extent on its 3-dimensional structure; the shape of the protein both constrains and facilitates the ways in which the protein can interact with other proteins. Proteins with similar 3-dimensional structural features very often, but not always, have similar functions. However, experimental determination of protein structures using NMR or X-ray crystallography techniques is time consuming and expensive. While there are 254,293 protein records in PIR-PSD database [Release 70.01, Oct-2001], [Baker et al., 2001], there are only 14,339 experimentally determined 3-dimensional protein structures in the Protein Data Bank (PDB) [version 23-Oct-2001] [Berman et al., 2000], corresponding to approximately 3000 different proteins. Hence, protein function prediction often relies on protein structure prediction using computational approaches. Ab initio methods that predict the conformation of a protein from its amino acid sequence are computationally very demanding and are currently limited to relatively short proteins or peptides [Samudrala et al., 1999]. Early work on protein pattern recognition [Dayhoff et al., 1983] suggested that short sequences of amino acid (motifs) may be conserved in a protein family. Currently, motif composition is often used to assign putative functions to novel protein sequences based on the known functions of other proteins that share one or more motifs with the novel protein. Several databases that contain motifs e.g., PROSITE [Hofmann et al., 1999], or groups of motifs referred to as fingerprints or blocks e.g., PRINTS [Attwood et al., 2000], or sequence patterns, often based on weight matrices or hidden Markov models generated from multiple sequence alignments, called profiles, PROSITE [Hofmann et al., 1999] or domains Pfam [Bateman et al., 2000] have been developed. Such motif databases or resources that integrate such databases e.g., InterPro [Apweiler et al., 2001], MetFam [Silverstein et al., 2001] can be queried using a protein sequence to obtain a list of motifs that are found in the sequence as well as the functions or structures associated with these motifs. Motif-based techniques for protein function prediction focus similarity searches on parts of the protein that are likely to be functionally or structurally significant, and hence more likely to be conserved. Current motif-based approaches to protein function prediction are not without drawbacks. Many proteins contain several motifs and the same motif may be found in proteins belonging to several different functional families. More generally, it may be necessary to identify combinations of motifs that must present, or perhaps even absent in a sequence, in order to reliably assign it to a functional family. Indeed, in the PRINTS database [Attwood, et al., 2000], the fingerprints used to assign proteins to functional families can be simple motifs or a combination of motifs. However, the process of identifying a fingerprint for each protein family of interest can be labor intensive and requires considerable domain knowledge. Thus, there is a need for sophisticated tools that automate the discovery of sequence regularities predictive of protein function and allow efficient updating of databases. In this paper, we test the feasibility of a fully automated approach for protein function classification. We present a data-driven approach to discovery of rules for assigning protein sequences to functional families on the basis of the presence or absence of specific motifs or combinations of motifs. (For simplicity, we will use the term motif to include short conserved sequence patterns as well as profiles.) Machine learning algorithms [Mitchell, 1997] offer one the most cost effective approaches to automated discovery of a-priori unknown predictive 1 This research was supported in part by grants from the National Science Foundation (9982341, 9972653), the Carver Foundation, and Pioneer Hi-Bred, Inc. This research has benefited from interactions with Dr. Dake Wang, Zhong Gao, Changhui Yan, and Carson Andorf of the Iowa State University Artificial Intelligence Research Laboratory. relationships from large data sets. Decision tree induction algorithms are relatively fast, and produce rules that are easy to interpret (and hence understandable by humans). Machine learning approaches have been previously used for protein function classification. For example, King et al. [2001] investigated an inductive logic programming approach to the construction of protein function classifiers using alternative representations of protein sequences (amino acid residue frequencies, phylogeny, and predicted structure). In a previous study, we used the C4.5 family of decision tree induction algorithms [Quinlan, 1992] to discover rules for protein classification on the basis of presence or absence of combinations of PROSITE motifs with encouraging results [Wang, et al., 2001]. The study demonstrated, for several protein families, that decision tree classifiers generated using PROSITE patterns and motifs can provide more accurate protein family classification than the use of a single characteristic motif. PROSITE patterns are usually fairly short (less than 20 amino acids) and typically correspond to biologically significant sites experimentally identified in PROSITE functional families. PROSITE profiles, on the other hand, correspond to Hidden Markov models that usually match longer sequence fragments (often over 100 amino acids). These longer profiles are useful as "signatures" for protein families, but make it difficult to identify underlying sequence regularities that are predictive of protein function, or may correspond to biologically significant structural features. Here we explore whether it is possible to use relatively short, automatically generated motifs to discover rules for protein classification A variety of automated approaches have been developed for identification of motifs (see [Hudak and McClure, 1999] for a comparison of several such motif detection methods). In this study, we used MEME (Multiple Expectation Maximization for Motif Elicitation) [Bailey et al., 1999], a multiple sequence alignment based motif discovery program which can be used to automate the construction of motif databases from any given set of sequences. For our data set, we chose a well-characterized subset of protein families from the MEROPS protease database [Release 5.4 23-Mar-2000] [Rawlings et al., 2000]. We compared rules discovered based on motifs automatically generated using MEME with those generated based on PROSITE patterns and profiles [Hofmann et al., 1999]. Further, we investigated the ability of decision trees to discover functionally significant structural features of proteins using the caspase protease family as a test case. 2. DATA DRIVEN DISCOVERY OF RULES FOR PROTEIN FUNCTION CLASSIFICATION USING SEQUENCE MOTIFS The basic computational problem is the following: Given a database or training set of amino acid sequences corresponding to proteins with known (i.e., experimentally determined) function, our goal is to induce a classifier that would be able to assign novel protein sequences to one of the protein families represented in the training set. The general approach is illustrated in Figure 1. Data Representation The first step in this process is the preparation of a data set. A majority of algorithms for data-driven induction of pattern classifiers represent instances to be classified using a fixed set of attributes. Hence, we first map each protein sequence into a corresponding attribute-based representation [Wang et al., 2001]. The choice of attributes plays a critical role in the data mining process. We represent protein sequences using a suitable vocabulary of sequence motifs. The set of motifs to be used can be chosen to correspond to one of the existing motif databases (e.g., PROSITE) or the set of motifs identified by running a suitable motif-finding program (e.g., MEME) on the set of protein sequences. Suppose the vocabulary contains N motifs. Any given sequence typically contains a few of these motifs. We encode each sequence as an N-bit binary pattern where the ith bit is 1 if the corresponding motif is present in the sequence; otherwise the corresponding bit is 0. Each N-bit sequence is associated with a label which identifies the functional family of the sequence (if known). A training set is simply a collection of N-bit binary patterns, each of which has associa
[1]
Peter B. McGarvey,et al.
Protein Information Resource: a community resource for expert annotation of protein data
,
2001,
Nucleic Acids Res..
[2]
S H Kaufmann,et al.
Mammalian caspases: structure, activation, substrates, and functions during apoptosis.
,
1999,
Annual review of biochemistry.
[3]
M. O. Dayhoff,et al.
Establishing homologies in protein sequences.
,
1983,
Methods in enzymology.
[4]
Amanda Clare,et al.
The utility of different representations of protein sequence for predicting functional class
,
2001,
Bioinform..
[5]
Neil D. Rawlings,et al.
Handbook of proteolytic enzymes
,
1998
.
[6]
James E. Johnson,et al.
MetaFam: a unified classification of protein families. I. Overview and statistics
,
2001,
Bioinform..
[7]
Rolf Apweiler,et al.
The SWISS-PROT protein sequence data bank and its supplement TrEMBL
,
1997,
Nucleic Acids Res..
[8]
Vasant Honavar,et al.
Data-Driven Generation of Decision Trees for Motif-Based Assignment of Protein Sequences to Functional Families
,
2000
.
[9]
Ram Samudrala,et al.
A Combined Approach for Ab Initio Construction of Low Resolution Protein Tertiary Structures from Sequence
,
1999,
Pacific Symposium on Biocomputing.
[10]
Xia Wang,et al.
Data-Driven Discovery of Rules for Protein Function Classification Based on Sequence Motifs
,
2003
.
[11]
William Noble Grundy,et al.
MEME, MAST, and Meta-MEME: New Tools for Motif Discovery in Protein Sequences
,
1999,
Pattern Discovery in Biomolecular Data.
[12]
Thomas G. Dietterich.
What is machine learning?
,
2020,
Archives of Disease in Childhood.
[13]
R. Beynon,et al.
The astacin family of metalloendopeptidases
,
1991,
The Journal of biological chemistry.
[14]
Mark A. Murcko,et al.
Structure and mechanism of interleukin-lβ converting enzyme
,
1994,
Nature.
[15]
Amos Bairoch,et al.
The PROSITE database, its status in 2002
,
2002,
Nucleic Acids Res..
[16]
Amos Bairoch,et al.
The PROSITE database, its status in 1999
,
1999,
Nucleic Acids Res..
[17]
R A Bradshaw,et al.
Eukaryotic methionyl aminopeptidases: two classes of cobalt-dependent enzymes.
,
1995,
Proceedings of the National Academy of Sciences of the United States of America.
[18]
Vasant Honavar,et al.
Discovering Protein Function Classification Rules from Reduced Alphabet Representations of Protein Sequences
,
2002,
JCIS.
[19]
T. N. Bhat,et al.
The Protein Data Bank
,
2000,
Nucleic Acids Res..
[20]
R A Sayle,et al.
RASMOL: biomolecular graphics for all.
,
1995,
Trends in biochemical sciences.
[21]
J. Ross Quinlan,et al.
C4.5: Programs for Machine Learning
,
1992
.
[22]
Amos Bairoch,et al.
The PROSITE database, its status in 1997
,
1997,
Nucleic Acids Res..
[23]
Rolf Apweiler,et al.
The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
,
2000,
Nucleic Acids Res..
[24]
James E. Johnson,et al.
MetaFam: a unified classification of protein families. II. Schema and query capabilities
,
2001,
Bioinform..
[25]
Terri K. Attwood,et al.
PRINTS-S: the database formerly known as PRINTS
,
2000,
Nucleic Acids Res..
[26]
Charles Elkan,et al.
Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer
,
1994,
ISMB.
[27]
M. A. McClure,et al.
A Comparative Analysis of Computational Motif-Detection Methods
,
1998,
Pacific Symposium on Biocomputing.
[28]
Alex Bateman,et al.
The InterPro database, an integrated documentation resource for protein families, domains and functional sites
,
2001,
Nucleic Acids Res..
[29]
Robert D. Finn,et al.
The Pfam protein families database
,
2004,
Nucleic Acids Res..