A Language Modeling Approach to Metadata for Cross-Database Linkage and Search

This research demonstrates that language models are a sound and effective foundation on which to build large-scale, distributed information systems for government applications. It contributes to providing an alternative to human-generated metadata for locating information resources. Manual indexing is expensive, and studies show that people are inconsistent and inaccurate when doing indexing, which leads to poor retrieval effectiveness. Generating content descriptions automatically from the markup and structure of documents is less expensive and, when coupled with good search techniques, can be used to locate relevant information more consistently. The evaluation testbeds for our research have been government databases such as those found in FedStats and GPO.