Visualizing document classification: a search aid for the digital library

The recent explosion of the Internet and the World Wide Web has made digital libraries popular. Easy access to a digital library is provided by commercially available Web browsers, which provide a user‐friendly interface. To retrieve documents of interest, the user is provided with a search interface that may only consist of one input field and one push button. Most users type in a single keyword, click the button, and hope for the best. The result of a query using this kind of search interface can consist of a large unordered set of documents, or a ranked list of documents based on the frequency of the keywords. Both lists can contain articles unrelated to the user's inquiry unless a sophisticated search was performed and the user knows exactly what to look for. More sophisticated algorithms for ranking the search results according to how well they meet the users' needs as expressed in the search input may help. However, what is desperately needed are software tools that can analyze the search result and manipulate large hierarchies of data graphically. In this article we describe the design of a language‐independent document classification system being developed to help users of the Florida Center for Library Automation analyze search query results. Easy access through the Web is provided, as well as a graphical user interface to display the classification results. We also describe the use of this system to retrieve and analyze sets of documents from public Web sites.

[1]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[2]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[3]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[4]  Chia-Hui Chang,et al.  Customizable Multi-Engine Search Tool with Clustering , 1997, Comput. Networks.

[5]  Jonathan D. Cohen,et al.  Drawing graphs to convey proximity: an incremental arrangement method , 1997, TCHI.

[6]  Ramana Rao,et al.  Visualizing large trees using the hyperbolic browser , 1996, CHI Conference Companion.

[7]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[8]  Jonathan D. Cohen,et al.  Recursive hashing functions for n-grams , 1997, TOIS.

[9]  Hsinchun Chen,et al.  Internet Categorization and Search: A Self-Organizing Approach , 1996, J. Vis. Commun. Image Represent..

[10]  Rick Kazman,et al.  WebQuery: Searching and Visualizing the Web Through Connectivity , 1997, Comput. Networks.

[11]  Dik Lun Lee,et al.  Document ranking on weight-partitioned signature files , 1996, TOIS.

[12]  Tilak Agerwala,et al.  SP2 System Architecture , 1999, IBM Syst. J..

[13]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[14]  Marti A. Hearst,et al.  Visualizing information retrieval results: a demonstration of the TileBar interface , 1996, CHI Conference Companion.

[15]  Jonathan D. Cohen Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[16]  B DanzigPeter,et al.  Scalable Internet resource discovery , 1994 .

[17]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[18]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[19]  Nicholas J. Belkin,et al.  Evaluation of a tool for visualization of information retrieval results , 1996, SIGIR '96.

[20]  Marti A. Hearst Interfaces for Searching the Web , 1997, Scientific American.

[21]  Efthimis N. Efthimiadis,et al.  User Choices: A new Yardstick for the Evaluation of Ranking Algorithms for Interactive Query Expansion , 1995, Inf. Process. Manag..

[22]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[23]  Ben Shneiderman,et al.  Sorting out searching: a user-interface framework for text searches , 1998, CACM.