Learning to extract and classify names from text

A requirement of virtually all analytic tools, such as timeline and spatial analysis, is structured data; however, much data is in text, an unstructured form. This article presents a new technology to bridge the gap between data buried in text and the requirement of structured data for analysis. The outcome should be an easy-to-maintain information technology component to support DoD and law enforcement applications. Our new approach uses statistical pattern recognition to learn to find data that is locally identifiable, e.g., that is not highly dependent on contexts. Examples are person names, organization names, locations, dates, times, monetary amounts, phone numbers, addresses, and social security numbers. The paper describes the statistical model employed, compares and contrasts the approach to previous approaches, numerically evaluates the adequacy of the technology on Government-supplied data, and illustrates the kind of examples needed for the system to learn to recognize the data desired from examples in documents.