Implications of the Recursive Representation Problem for Automatic Concept Identifcation in On-line Governmental lnformation

This paper describes ongoing research into the application of unsupervised learning techniques for improving access to governmental information on the Web. Under the auspices of the GovStat Project (http://www.ils.unc.edu/govstat), our goal is to identify a small number of semantically valid and mutually exclusive "concepts" that adequately span the intellectual domain of a web site. While this is a classic instance of the clustering problem [14] the task is complicated by the dual-representational nature of term-document relationships. Since documents are defined in term-space and vice versa, we may approach this as a document-or term-clustering problem. The current study explores the implications of pursuing both term- and document-centered representations. Based on initial work, we argue for a document clustering-based approach. Describing completed research, we suggest that term clustering yields semantically valid categories, but that these categories are not suitably broad. To improve the coverage of the clustering, we describe a process based on document clustering.

[1]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[2]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[3]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[4]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[5]  Donald A. Jackson STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS: A COMPARISON OF HEURISTICAL AND STATISTICAL APPROACHES' , 1993 .

[6]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  James Pustejovsky,et al.  Automatic construction of faceted terminological feedback for context-based information retrieval , 1999 .

[10]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[11]  Chris H. Q. Ding,et al.  Automatic topic identification using webpage clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[12]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[13]  Erkki Oja,et al.  Independent Component Analysis , 2001 .

[14]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[17]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[18]  Gary Marchionini,et al.  Toward a Statistical Knowledge Network , 2003, DG.O.