Voting Fuzzy k-NN to Predict Protein Subcellular Localization from Normalized Amino Acid Pair Compositions

There are a huge number of protein sequences in databanks whose functions are not known. Since the biological functions of these proteins are closely correlated with their subcellular localization, it is important to develop a system to automatically predict subcellular localization from sequences for large-scale genome analysis. In this paper, we first propose a new formula to estimate the composition of amino acid pairs for feature extraction, and then we present a voting scheme that combines a set of fuzzy k-nearest-neighbor (k-NN) classifiers to predict subcellular locations. In order to detect sequence-order features, individual classifier is constructed using different types of features, including amino acid and amino acid pair compositions. We apply our method to several datasets and significant improvements are achieved.