Clustering Scientific Literature Using Sparse Citation Graph Analysis

It is well known that connectivity analysis of linked documents provides significant information about the structure of the document space for unsupervised learning tasks. However, the ability to identify distinct clusters of documents based on link graph analysis is proportional to the density of the graph and depends on the availability of the linking and/or linked documents in the collection. In this paper, we present an information theoretic approach towards measuring the significance of individual words based on the underlying link structure of the document collection. This enables us to generate a non-uniform weight distribution of the feature space which is used to augment the original corpus-based document similarities. The experimental results on the collection of scientific literature show that our method achieves better separation of distinct groups of documents, yielding improved clustering solutions.

[1]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[2]  S. Lawrence Free online availability substantially increases a paper's impact , 2001, Nature.

[3]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[4]  Micah Adler,et al.  Clustering Relational Data Using Attribute and Link Information , 2003 .

[5]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[6]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[7]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8]  Masaru Kitsuregawa,et al.  Link Based Clustering of Web Search Results , 2001, WAIM.

[9]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[10]  W. Scott Spangler,et al.  Clustering hypertext with applications to web searching , 2000, HYPERTEXT '00.

[11]  Yanchun Zhang,et al.  Utilizing Hyperlink Transitivity to Improve Web Page Clustering , 2003, ADC.

[12]  Ben Taskar,et al.  Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..

[13]  Yao Wang,et al.  A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.

[14]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[15]  Chris H. Q. Ding,et al.  Web document clustering using hyperlink structures , 2001, Comput. Stat. Data Anal..

[16]  Ken Hyland,et al.  Self-citation and Self-reference: Credibility and Promotion in Academic Publication , 2003, J. Assoc. Inf. Sci. Technol..

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  Yiming Yang,et al.  Stochastic link and group detection , 2002, AAAI/IAAI.

[19]  P. Schönemann On artificial intelligence , 1985, Behavioral and Brain Sciences.

[20]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[21]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[22]  Steve Lawren Online or invisible ? , 2001 .

[23]  Yitong Wang,et al.  Use link-based clustering to improve Web search results , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[24]  Jeremy Kubica,et al.  A Comparison of Statistical and Machine Learning Algorithms on the Task of Link Completion , 2003 .

[25]  Moshe Yitzhaki,et al.  The ‘language preference’ in sociology: Measures of ‘language self-citation’, ‘relative own-language preference indicator’, and ‘mutual use of languages’ , 2006, Scientometrics.