Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis ∗ †

The prediction of protein function is a key task in bioinformatics and a variety of techniques and data sets have been employed for that purpose. Using the popular keyword recovery measure, which is based on standard keyword annotations of the SwissProt database, this paper presents a comparative study of the information provided for protein function prediction by different types of data sets: phylogenetic profiles, protein interaction networks, and gene expression data. The technique employed is to evaluate the average keyword recovery achieved when the top (most strongly connected or similar) pairs of proteins are taken from each data set. The results show that protein interaction data contains the most information, then gene expression data, and finally, phylogenetic profiles. In addition, the average keyword recovery is also computed for the top pairs derived from the raw protein interaction data using a measure, h-confidence, which comes from the data mining area of association analysis. This approach gives improved results over raw protein interaction data and even better results when applied to protein complexes that were computationally generated using the raw protein complex data. The paper also briefly discusses the fact that the different data types appear to be