CUBIC: search for binding sites

The regulation of gene transcription is achieved through specific interactions between transcription factors and their binding sites in the upstream region of the gene being regulated. Correct identification of these binding sites represents a key challenging problem in computational biology. Our approach to the problem is to find a "clear" cluster in the space of all k-mers from the upstream regulatory regions of a set of genes that potentially share similar binding sites. The cluster identification is performed by using minimal spanning tree (MST) technique with a special distance between k-mers based on the chosen profile. It's shown that widely used "conservation" characteristic in position is a result of a "common sense" requirement for "conservation". The local convergence of algorithm for "conservation" maximization of profile has been proved and the method for statistical significance evaluation of results is presented. All ideas have been implemented in a form of software CUBIC.