Using Google ’ s PageRank Algorithm to Identify Important Attributes of Genes

In our research, we have applied PageRank technology to identify important attributes of Genes. Google search engine uses this PageRank algorithm to assign a numerical weight of each webpage and cluster search results. We have found PageRank algorithm is a very effective method for selecting significant attributes for high-dimensional data; especially, gene expression data. Important genes can be used in clustering of gene expression data to get a better clustering using minimum time and space. Clustering of high-dimensional data requires lot of resources both in terms of time and memory space. Generally, some attributes are more important than others. We use yeast’s gene expression data from Stanford MicroArray Database. Four datasets are used each with approximate 4905 genes and 5 to 14 expression levels. We calculate the correlation matrix between these genes using their expression levels in various experiments. We have used Weka (data mining software in java) to generate and validate a naïve bayes classifier. We found that the data ranked with PageRank algorithm produces better classification than the raw data. The raw data used to build classifier can classify attributes at 48-50% accuracy while PageRanked data classify attributes at 62-64% accuracy.

[1]  Filippo Menczer,et al.  Evolutionary model selection in unsupervised learning , 2002, Intell. Data Anal..

[2]  Desmond J. Higham,et al.  GeneRank: Using search engine technology for the analysis of microarray experiments , 2005, BMC Bioinformatics.

[3]  Harry Zhang,et al.  Naive Bayesian Classifiers for Ranking , 2004, ECML.

[4]  David Botstein,et al.  The Stanford Microarray Database , 2001, Nucleic Acids Res..

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Júlio C. Nievola,et al.  Attribute selection methods comparison for classification of diffuse large B-cell lymphoma , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[7]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[8]  Huan Liu,et al.  Feature Selection for Clustering , 2000, Encyclopedia of Database Systems.