ReMark: an automatic program for clustering orthologs flexibly combining a Recursive and a Markov clustering algorithms

SUMMARY ReMark is a fully automatic tool for clustering orthologs by combining a Recursive and a Markov clustering (MCL) algorithms. The ReMark detects and recursively clusters ortholog pairs through reciprocal BLAST best hits between multiple genomes running software program (RecursiveClustering.java) in the first step. Then, it employs MCL algorithm to compute the clusters (score matrices generated from the previous step) and refines the clusters by adjusting an inflation factor running software program (MarkovClustering.java). This method has two key features. One utilizes, to get more reliable results, the diagonal scores in the matrix of the initial ortholog clusters. Another clusters orthologs flexibly through being controlled naturally by MCL with a selected inflation factor. Users can therefore select the fitting state of orthologous protein clusters by regulating the inflation factor according to their research interests. AVAILABILITY AND IMPLEMENTATION Source code for the orthologous protein clustering software is freely available for non-commercial use at http://dasan.sejong.ac.kr/~wikim/notice.html, implemented in Java 1.6 and supported on Windows and Linux.

[1]  Sean R. Eddy,et al.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs , 2002, BMC Bioinformatics.

[2]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[3]  S. Dongen Graph clustering by flow simulation , 2000 .

[4]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[5]  Jinyan Li,et al.  Clustering orthologous proteins across phylogenetically distant species , 2007, Proteins.

[6]  S. Hui,et al.  Evaluation of diagnostic tests without gold standards , 1998, Statistical methods in medical research.

[7]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[8]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[9]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[10]  Sunshin Kim,et al.  Clustering Methods for Finding Orthologs among Multiple Species , 2007 .

[11]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[12]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.