Accurate Unsupervised Learning of Field Structure Models for Information Extraction

The applicability of current information extraction techniques is severely limited by the need for supervised training data. We demonstrate that for certainfield structuredextraction tasks, small amounts of prior knowledge can be used to effectively learn models in a primarily unsupervised fashion. Many text information sources exhibit a latent field structure: such documents can be viewed as dense sequences of semantically coherent fields. Examples include classified advertisements and bibliographic citations, which we investigate here. Although hidden Markov models (HMMs) provide a suitable generative model for field structured text, general unsupervised HMM learning fails to learn useful structure in either of our domains. However, we show that one can dramatically improve the quality of the learned structure by exploiting simple prior knowledge of the desired solutions. In both domains, unsupervised methods can attain accuracies comparable to simple supervised methods trained on the same data, using a combination of structural model constraints and targeted initializations.