Dotplot : a program for exploring self-similarity in millions of lines of text and code

Abstract An interactive program, dotplot, has been developed for browsing millions of lines of text and source code, using an approach borrowed from biology for studying homology (self-similarity) in DNA sequences. With conventional browsing tools such as a screen editor, it is difficult to identify structures that are too big to fit on the screen. In contrast, with dotplots we find that many of these structures show up as diagonals, squares, textures, and other visually recognizable features, as will be illustrated in examples selected from biology and two new application domains, text (AP news, Canadian Hansards) and source code (5ESS®). In an attempt to isolate the mechanisms that produce these features, we have synthesized similar features in dotplots of artificial sequences. We also introduce an approximation that makes the calculation of dotplots practical for use in an interactive browser.

[1]  Paul Wintz,et al.  Instructor's manual for digital image processing , 1987 .

[2]  S. Harrison,et al.  Structure and assembly of turnip crinkle virus. IV. Analysis of the coat protein gene and implications of the subunit primary structure. , 1987, Journal of molecular biology.

[3]  T. L. Blundell,et al.  Knowledge-based prediction of protein structures and the design of novel molecules , 1987, Nature.

[4]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[5]  G. Air,et al.  Amino acid sequence changes in the haemagglutinin of A/Hong Kong (H3N2) influenza virus during the period 1968–77 , 1980, Nature.

[6]  James Gettys,et al.  The X window system , 1990 .

[7]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[8]  Edsger W. Dijkstra,et al.  Go To Statement Considered Harmful , 2022, Software Pioneers.

[9]  L Simpson,et al.  Evolution of parasitism: kinetoplastid protozoan history reconstructed from mitochondrial rRNA gene sequences. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[10]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[11]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[12]  F. Kafatos,et al.  A high speed, high capacity homology matrix: zooming through SV40 and polyoma. , 1982, Nucleic acids research.

[13]  Gerald Salton,et al.  Automatic text processing , 1988 .

[14]  P. Fayers,et al.  The Visual Display of Quantitative Information , 1990 .

[15]  F. Bolivar,et al.  Plasmid vector pBR322 and its special-purpose derivatives--a review. , 1986, Gene.

[16]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[17]  Edsger W. Dijkstra,et al.  Letters to the editor: go to statement considered harmful , 1968, CACM.

[18]  C. Yanisch-Perron,et al.  Improved M13 phage cloning vectors and host strains: nucleotide sequences of the M13mp18 and pUC19 vectors. , 1985, Gene.

[19]  P Argos,et al.  A sensitive procedure to compare amino acid sequences. , 1987, Journal of molecular biology.

[20]  B. Marx The Visual Display of Quantitative Information , 1985 .

[21]  W. Quax,et al.  Primary and secondary structure of hamster vimentin predicted from the nucleotide sequence. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Paul Wintz,et al.  Digital image processing (2nd ed.) , 1987 .