Temporal Language Models for the Disclosure of Historical Text

Historical and heritage collections consist for a considerable part of text and may incorporate diverse text types such as journals, archival documents, and catalogue descriptions. Because of the historical distance, access to this content is not straightforward. Historical variants of text are often more complex to identify and retrieve than modern variants. This is due to the less standardized spelling, the effect of on-going language change and different word (de)compounding principles. Moreover, more words are ambiguous because one or more meaning shifts may have occurred. Common full-text search tools can only be applied successfully by users who are able to formulate queries with (a) knowledge of historical language and (b) insight in the relevant time spam from which the words have evolved. This paper explores techniques which may compensate for these linguistic obstacles: linking of contemporary search terms to their historical equivalents and ’dating’ of texts. We envisage to restore the diachronic relationship between terms which may be obscured by language evolution and usage, by applying statistical language models. These models may support the automatic detection of semantic similarities between words and word ambiguities, and they also allow to classify a text according to the time span from which it originates. This approach involves building temporal profiles of words as longitudinal sections in a reference corpus and temporal language models as cross sections. In section 2 some detailed examples will be presented of the added value of this approach both for the accessibility of historical content and the detection of language change in relatively recent corpora from the news domain. In section 3 an overview of related work will be given, plus some technical background on statistical language models. Section 4 describes the proposed methodology in more detail, and some experiments for it in the news domain will described in section 5.

[1]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[2]  W. Bruce Croft,et al.  Time-based language models , 2003, CIKM '03.

[3]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[4]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[5]  Vijay Kumar,et al.  Metadata visualization for digital libraries: interactive timeline editing and review , 1998, DL '98.

[6]  Wessel Kraaij,et al.  Transitive probabilistic CLIR models , 2004 .

[7]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[9]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[10]  Wessel Kraaij,et al.  Variations on language modeling for information retrieval , 2005, SIGF.

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[13]  Roeland Ordelman,et al.  Dutch speech recognition in multimedia information retrieval , 2003 .

[14]  Ben Shneiderman,et al.  Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays , 1994 .

[15]  Fernando Diaz,et al.  Using temporal profiles of queries for precision prediction , 2004, SIGIR '04.

[16]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[17]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[18]  W. Bruce Croft,et al.  Discovering and Comparing Topic Hierarchies , 2000, RIAO.