Computational analysis of transcriptional regulatory elements: a field in flux

Sequence analytic methods have played a role in the understanding of transcriptional regulation for many years (for example, an alignment of E. coli promoter regions showing conserved upstream regions was reported by Pribnow et al. in 1975). In recent years there has been a tremendous increase in experimental work aimed at understanding the fundamental biochemistry of transcription initiation as well as the mechanisms that regulate gene expression at the level of transcription. There is currently great interest in developing new computational methods as well. This interest is driven partly by the new biological understanding (and will hopefully contribute to it, by the use of quantitative models), and partly by the need to efficiently analyse newly determined genomic sequences. Computational biologists interested in transcription have made good progress in the last few years. For example, an improved ability to describe the DNA binding specificity of proteins involved in transcription lies at the foundation of much of the mathematical modeling in the field. It has been generally recognized that consensus sequences are usually inadequate to describe DNA-binding specificity, and it is now most common to describe the binding sites of a particular protein as the set of sequences scoring above a particular threshold with a Positional Weight Matrix (PWM). Considerable theoretical work, and some experimental effort, have gone into the development of algorithms to find a PWM from known binding sites, understand to what extent the description of specificity by means of a PWM is valid, and record PWMs for particular proteins. Other important areas of progress include computer programs for the recognition of eukaryotic promoters that for the first time have error rates low enough so that the program is of practical interest, and a number of recently developed data collections that are either more complete or more consistent than what is available in the primary sequence databases. It may also be counted as progress that some early errors of the field have now been corrected, so that (1) it is now widely recognized that transcriptional regulation is exceedingly complex, and that algorithms must take into account alternative pathways and the synergism of multiple transcription factors, and (2) cross-validation techniques are now commonly employed in the benchmarking of new algorithms for functional prediction. We feel that one of the main needs in the field is simply for better communication. Experimentalists often do not take advantage of the best computational techniques; algorithm developers often base their methods on an overly simplified view of the biology; computer scientists do not use the best data collections; and mathematicians sometimes show an aversion to learning about powerful machine learning techniques. In order to promote communication and collaboration, and to assess the state of the art, we organized the first International Workshop on Computational Analysis of Eukaryotic Transcriptional Regulatory Elements, at the Deutsches Krebsforschungszentrum in Heidelberg, in January of 1996. We were very pleased to have a highly interdisciplinary meeting of about 70 people, with participation from sequence analysts, pure experimentalists, computer scientists, researchers working on the nucleosome positioning problem, microscopists using computers for image processing, structural biologists interested in the 3D structure of promoters and gene regulatory proteins, and experts from the neighbouring field of prokaryotic gene transcription. It was particularly encouraging that there were several presentations at the meeting by groups comprising both experimental and computational biologists, and that communication between the experimental and the computational side seemed to be excellent. The interdisciplinary nature of the meeting also helped to focus attention on the primary scientific goal that all participants share: to understand, by modelling and model-testing, the transcription initiation event and its use in the regulation of gene expression. It remains a controversial issue whether function can be determined from the DNA sequence, at the level of symbol manipulation, without reference to 3D structure or other more biological representations of the data. (Most sequence analysis developers tacitly assume that such is possible, although it may be unwise to do so.) What is clear is that the work of a person in any one discipline will be much more effective if he or she is willing to understand and make use of the results of related disciplines. Experience to date suggests that in the analysis of transcriptional regulatory elements, more than in other domains of

[1]  R. KNÜPPEL,et al.  TRANSFAC Retrieval Program: A Network Model Database of Eukaryotic Transcription Regulating Sequences and Proteins , 1994, J. Comput. Biol..

[2]  R. Tjian,et al.  Molecular machines that control genes. , 1995, Scientific American.

[3]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[4]  Rodger Staden,et al.  Graphic methods to determine the function of nucleic acid sequences , 1984, Nucleic Acids Res..

[5]  J. Fickett Coordinate positioning of MEF2 and myogenin binding sites. , 1996, Gene.

[6]  Rodger Staden,et al.  Methods to define and locate patterns of motifs in sequences , 1988, Comput. Appl. Biosci..

[7]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.

[8]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[9]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[10]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[11]  R. Staden Searching for patterns in protein and nucleic acid sequences. , 1990, Methods in enzymology.

[12]  D. Pribnow Nucleotide sequence of an RNA polymerase binding site at an early T7 promoter. , 1975, Proceedings of the National Academy of Sciences of the United States of America.

[13]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Robert Entriken,et al.  Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity , 1984, Nucleic Acids Res..

[15]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[16]  G. Stormo Computer methods for analyzing sequence recognition of nucleic acids. , 1988, Annual Review of Biophysics and Biophysical Chemistry.

[17]  D. Ghosh,et al.  A relational database of transcription factors. , 1990, Nucleic acids research.