Document clustering using small world communities

Words in natural language documents exhibit a small world network structure. Thus the physics community provides us with an extensive supply of algorithms for extracting community structure. We present a novel method for semantically clustering a large collection of documents using small world communities. This method combines modified physics algorithms with traditional information retrieval techniques. A term network is generated from the document collection, the terms are clustered into small world communities, the semantic term clusters are used to generate overlapping document clusters. The algorithm combines the speed of single link with the quality of complete link. Clustering takes place in nearly real-time and the results are judged to be coherent by expert users. Our algorithm occupies a middle ground between speed and quality of document clustering.

[1]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[2]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[3]  S H Strogatz,et al.  Random graph models of social networks , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[4]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[6]  M. Lachance,et al.  Yeast communities from host plants and associated Drosophila in southern arizona: new isolations and analysis of the relative importance of hosts and vectors on comunity composition , 1986, Oecologia.

[7]  Carol Collier Kuhlthau Information Search Process: A Summary of Research and Implications for School Library Media Programs. , 1989 .

[8]  M. Newman Models of the Small World: A Review , 2000, cond-mat/0001118.

[9]  Stanislaw Osinski Improving Quality of Search Results Clustering with Approximate Matrix Factorisations , 2006, ECIR.

[10]  Dahlia Malkhi,et al.  K-clustering in wireless ad hoc networks , 2002, POMC '02.

[11]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[13]  Bruce R. Schatz,et al.  Semantic indexing for a complete subject discipline , 1999, DL '99.

[14]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[15]  M. Newman Analysis of weighted networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[17]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[18]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[19]  G. Robinson,et al.  Stimulation of muscarinic receptors mimics experience-dependent plasticity in the honey bee brain. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Mark Newman,et al.  Models of the Small World , 2000 .

[21]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[22]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[24]  Jiawei Han,et al.  Mining scale-free networks using geodesic clustering , 2004, KDD.

[25]  Dawid Weiss,et al.  Conceptual Clustering Using Lingo Algorithm: Evaluation on Open Directory Project Data , 2004, Intelligent Information Systems.

[26]  J. Barker,et al.  Coexistence of Ecologically Similar Colonising Species: Intra- and Interspecific Competition in Drosophila aldrichi and D. buzzatii , 1991 .

[27]  R. Solé,et al.  Selection, Tinkering, and Emergence in Complex Networks - Crossing the Land of Tinkering , 2002 .

[28]  G. Caldarelli,et al.  Widespread occurrence of the inverse square distribution in social sciences and taxonomy. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Reinhard Köhler,et al.  Patterns in syntactic dependency networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.