Substring matching for clone detection and change tracking

Legacy systems pose problems to maintainers that can be solved partially with effective tools. A prototype tool for determining collections of files sharing a large amount of text has been developed and applied to a 40 megabyte source tree containing two releases of the gcc compiler. Similarities in source code and documentation corresponding to software cloning, movement and inertia between releases, as well as the effects of preprocessing easily stand out in a way that immediately conveys nonobvious structural information to a maintainer taking responsibility for such a system.<<ETX>>