Text pattern visualization for analysis of biology full text and captions

Large textbanks comprised of thousands of full-text biology papers are rapidly becoming available. We describe an approach to characterize all major language patterns in biology text in terms of frameworks. Frameworks are "containers" made up of common phrases surrounding specific informational items such as gene and protein names. A framework viewer has been developed that shows similar text frameworks aligned on the screen much as biosequence visualization tools do. Using the viewer, it is evident that frameworks have the power to find the types of structures needed to develop useful information retrieval systems. As a simple example, one framework was able to concisely select 45,000 nouns from a corpus of 5 million words without error. This work points the way to highly automated systems that will be able to extract and index information in biology textbanks. Work in progress includes extensions to characterize recursive structures in text, subsystems to retrieve figures in papers, and the discovery of semantic relations to aid concept-based retrieval.