Scalable browsing for large collections: a case study

Phrase browsing techniques use phrases extracted automatically from a large information collection as a basis for browsing and accessing it. This paper describes a case study that uses an automatically constructed phrase hierarchy to facilitate browsing of an ordinary large Web site. Phrases are extracted from the full text using a novel combination of rudimentary syntactic processing and sequential grammar induction techniques. The interface is simple, robust and easy to use. To convey a feeling for the quality of the phrases that are generated automatically, a thesaurus used by the organization responsible for the Web site is studied and its degree of overlap with the phrases in the hierarchy is analyzed. Our ultimate goal is to amalgamate hierarchical phrase browsing and hierarchical thesaurus browsing: the latter provides an authoritative domain vocabulary and the former augments coverage in areas the thesaurus does not reach.

[1]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[2]  Gordon W. Paynter,et al.  Topic-based browsing within a digital library using keyphrases , 1999, DL '99.

[3]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[4]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[5]  Ian H. Witten,et al.  Lexically-generated subject hierarchies for browsing large collections , 1999, International Journal on Digital Libraries.

[6]  A. S. Pollitt,et al.  An Evaluation of Concept Translation Through Menu Navigation in the MenUSE Intermediary System , 1993 .

[7]  Shan-Ju L. Chang,et al.  Browsing: a multidimensional framework , 1993 .

[8]  Stephen E. Robertson,et al.  Interactive Thesaurus Navigation: Intelligence Rules OK? , 1995, J. Am. Soc. Inf. Sci..

[9]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[10]  Ian H. Witten,et al.  Greenstone: a comprehensive open-source digital library software system , 2000, DL '00.

[11]  Dagobert Soergel,et al.  Organizing information - principles of data base and retrieval systems , 1985 .

[12]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[13]  Gerald Salton,et al.  Automatic text processing , 1988 .

[14]  Bill N. Schilit,et al.  Linking by inking: trailblazing in a paper-like hypertext , 1998, HYPERTEXT '98.

[15]  G. Nevill-ManningCraig,et al.  Identifying hierarchical structure in sequences , 1997 .

[16]  Edoardo Greppi FAO (Food and Agriculture Organization of the United Nations) , 1981 .

[17]  Ian H. Witten,et al.  Browsing in digital libraries: a phrase-based approach , 1997, DL '97.

[18]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.