TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data

Methods and tools for finding documents relevant to a user’s needs in document corpora can be found in the information retrieval, library science, and hypertext communities. Typically, these systems provide retrieval capabilities for fairly static corpora, their algorithms are dependent on the language for which they are written, e.g. English, and they do not perform well when presented with misspelled words or text that has been degraded by OCR (optical character recognition) techniques. In this article, we present experimentation results for the TELLTALE system. TELLTALE is a dynamic hypertext environment that provides full-text search from a hypertext-style user interface for text corpora that may be garbled by OCR or transmission errors, and that may contain languages other than English. TELLTALE uses several techniques based on ngrams (n character sequences of text). With these results we show that the dynamic linkage mechanisms in TELLTALE are tolerant of garbles in up to 30% of the characters in the body of the text.

[1]  Carolyn J. Crouch,et al.  The use of cluster hierarchies in hypertext information retrieval , 1989, Hypertext.

[2]  Theodor Holm Nelson Managing immense storage , 1988 .

[3]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[4]  Jakob Nielsen,et al.  Hypertext and hypermedia , 1990 .

[5]  Douglas C. Engelbart,et al.  A research center for augmenting human intellect , 1968, AFIPS Fall Joint Computing Conference.

[6]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Mark E. Frisse,et al.  Information retrieval from hypertext: update on the dynamic medical handbook project , 1989, Hypertext.

[8]  Raymond J. D'Amore,et al.  One-time complete indexing of text: theory and practice , 1985, SIGIR '85.

[9]  Stephen V. Rice,et al.  An Evaluation of OCR Accuracy , 1993 .

[10]  Emmanuel J. Yannakoudakis,et al.  The generation and use of text fragments for data compression , 1982, Inf. Process. Manag..

[11]  W. Bruce Croft,et al.  I3R: A new approach to the design of document retrieval systems , 1987, J. Am. Soc. Inf. Sci..

[12]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[13]  Claudia Pearce A dynamic hypertext environment through n-gram analysis , 1994 .

[14]  W. Bruce Croft,et al.  A retrieval model incorporating hypertext links , 1989, Hypertext.

[15]  W. B. Cavnar,et al.  N-Gram-Based Text Filtering For TREC-2 , 1993, TREC.

[16]  Jonathan D. Cohen,et al.  Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  W. Bruce Croft,et al.  Retrieval Strategies for Hypertext , 1993, Inf. Process. Manag..

[19]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[20]  Peter Willett Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms , 1979, J. Documentation.

[21]  Claude Chrisment,et al.  Querying a Hypertext Information Retrieval System by the Use of Classification , 1993, Inf. Process. Manag..

[22]  Elena M. Zamora,et al.  The use of trigram analysis for spelling error detection , 1981, Inf. Process. Manag..