Document clustering algorithms, representations and evaluation for information retrieval

This thesis presents new methods for classification and thematic grouping of billions of web pages, at scales previously not achievable. This process is also known as document clustering, where similar documents are automatically associated with clusters that represent various distinct topic. These automatically discovered topics are in turn used to improve search engine performance by only searching the topics that are deemed relevant to particular user queries.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Akhil Kumar G-Tree: A New Data Structure for Organizing Multidimensional Data , 1994, IEEE Trans. Knowl. Data Eng..

[3]  Masayasu Atsumi Attention-Guided Organized Perception and Learning of Object Categories Based on Probabilistic Latent Variable Models , 2013 .

[4]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[5]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[6]  James P. Callan,et al.  Document allocation policies for selective searching of distributed indexes , 2010, CIKM '10.

[7]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[8]  Sreenivas Gollapudi,et al.  Indexing strategies for graceful degradation of search quality , 2011, SIGIR.

[9]  Shlomo Geva,et al.  Pairwise similarity of TopSig document signatures , 2012, ADCS.

[10]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[11]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[12]  Sergio Greco,et al.  Toward Semantic XML Clustering , 2006, SDM.

[13]  Alistair Moffat,et al.  Against recall: is it persistence, cardinality, density, coverage, or totality? , 2009, SIGF.

[14]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[15]  Boris Chidlovskii Multi-label Wikipedia Classification with Textual and Link Features , 2009, INEX.

[16]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[19]  K. Sparck Jones,et al.  A TEST FOR THE SEPARATION OF RELEVANT AND NON‐RELEVANT DOCUMENTS IN EXPERIMENTAL RETRIEVAL COLLECTIONS , 1973 .

[20]  Azucena Montes Rendón,et al.  An Iterative Clustering Method for the XML-Mining Task of the INEX 2010 , 2010, INEX.

[21]  李幼升,et al.  Ph , 1989 .

[22]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[23]  Richard C. Dubes,et al.  Experiments in projection and clustering by simulated annealing , 1989, Pattern Recognit..

[24]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[25]  Takeo Kanade,et al.  Finding natural clusters having minimum description length , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[26]  Key-Sun Choi,et al.  Re-ranking model based on document clusters , 2001, Inf. Process. Manag..

[27]  Feng Liang,et al.  PKU at INEX 2010 XML Mining Track , 2010, INEX.

[28]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[29]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[30]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[31]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[32]  Edward A. Fox,et al.  Research Contributions , 2014 .

[33]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[34]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[35]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[36]  Alan Wee-Chung Liew,et al.  Fuzzy image clustering incorporating spatial continuity , 2000 .

[37]  Bruce R. Schatz,et al.  Document clustering using small world communities , 2007, JCDL '07.

[38]  Boris Chidlovskii,et al.  Semi-supervised Categorization of Wikipedia Collection by Label Expansion , 2009, INEX.

[39]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[40]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[41]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[42]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[43]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[44]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[45]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[46]  Jim Woodcock,et al.  Using Z - specification, refinement, and proof , 1996, Prentice Hall international series in computer science.

[47]  Charles L. A. Clarke,et al.  Improving document clustering using Okapi BM25 feature weighting , 2011, Information Retrieval.

[48]  Christos Faloutsos,et al.  Signature files: an access method for documents and its analytical performance evaluation , 1984, TOIS.

[49]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[50]  Richi Nayak,et al.  Data Mining and XML Documents , 2002, International Conference on Internet Computing.

[51]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[52]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[53]  Wolfgang Nejdl,et al.  Exploiting Distribution Skew for Scalable P2P Text Clustering , 2008, DBISP2P.

[54]  Kyo Kageura,et al.  Implicit ambiguity resolution using incremental clustering in cross-language information retrieval , 2004, Inf. Process. Manag..

[55]  Frank M. Shipman,et al.  Adaptive clustering and interactive visualizations to support the selection of video clips , 2011, ICMR '11.

[56]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[57]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[58]  Victoria J. Hodge,et al.  A hardware-accelerated novel IR system , 2002, Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing.

[59]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[60]  Charles L. A. Clarke,et al.  Effective measures for inter-document similarity , 2013, CIKM.

[61]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[62]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[63]  Charu C. Aggarwal,et al.  An Introduction to Cluster Analysis , 2018, Data Clustering: Algorithms and Applications.

[64]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[65]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[66]  Richi Nayak,et al.  Clustering XML Documents Using Frequent Subtrees , 2008, INEX.

[67]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[68]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[69]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[70]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[71]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[72]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[73]  Gjergji Kasneci,et al.  YAWN: A Semantically Annotated Wikipedia XML Corpus , 2007, BTW.

[74]  Ran Jin,et al.  Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment , 2013, Journal of Cloud Computing: Advances, Systems and Applications.

[75]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2005 and INEX 2006: categorization and clustering of XML documents , 2007, SIGF.

[76]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[77]  Alan F. Smeaton,et al.  Multilingual and Multimodal Information Access Evaluation, International Conference of the Cross-Language Evaluation Forum, CLEF 2010, Padua, Italy, September 20-23, 2010. Proceedings , 2010, CLEF.

[78]  Benno Stein,et al.  The optimum clustering framework: implementing the cluster hypothesis , 2011, Information Retrieval.

[79]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[80]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[81]  Kaspar Riesen,et al.  Graph Embedding in Vector Spaces by Means of Prototype Selection , 2007, GbRPR.

[82]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[83]  James Allan,et al.  A New Measure of the Cluster Hypothesis , 2009, ICTIR.

[84]  Richi Nayak,et al.  HCX: an efficient hybrid clustering approach for XML documents , 2009, DocEng '09.

[85]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[86]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[87]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[88]  Jianwu Yang,et al.  Extended VSM for XML Document Classification Using Frequent Subtrees , 2009, INEX.

[89]  Geoffrey E. Hinton,et al.  Distributed representations and nested compositional structure , 1994 .

[90]  Andrew Trotman,et al.  Overview of the INEX 2010 Ad Hoc Track , 2010, INEX.

[91]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[92]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[93]  Fabrizio Silvestri,et al.  Query-driven document partitioning and collection selection , 2006, InfoScale '06.

[94]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[95]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[96]  Mingwei Leng,et al.  An Efficient K-means Clustering Algorithm Based on Influence Factors , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[97]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[98]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[99]  Fazli Can,et al.  Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases , 1990, TODS.

[100]  R. Ladner Entropy-constrained Vector Quantization , 2000 .

[101]  Christophe Moulin,et al.  UJM at INEX 2009 XML Mining Track , 2009, INEX.

[102]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[103]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[104]  Andrew Trotman,et al.  Compressing Inverted Files , 2004, Information Retrieval.

[105]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[106]  Masayasu Atsumi Visual Categorization Based on Learning Contextual Probabilistic Latent Component Tree , 2012, ICANN.

[107]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[108]  Yun Chi,et al.  Evolutionary spectral clustering by incorporating temporal smoothness , 2007, KDD '07.

[109]  Boon-Lock Yeo,et al.  Segmentation of Video by Clustering and Graph Analysis , 1998, Comput. Vis. Image Underst..

[110]  Anton Leuski,et al.  Evaluating document clustering for interactive information retrieval , 2001, CIKM '01.

[111]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[112]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[113]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[114]  Christophe Moulin,et al.  UJM at INEX 2008 XML Mining Track , 2008, INEX.

[115]  Marcos M. Campos,et al.  O-Cluster: scalable clustering of large high dimensional data sets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[116]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[117]  Aoying Zhou,et al.  An adaptive and dynamic dimensionality reduction method for high-dimensional indexing , 2007, The VLDB Journal.

[118]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[119]  Bodo Manthey,et al.  k-Means Has Polynomial Smoothed Complexity , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[120]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[121]  Moran Feldman,et al.  On the Impact of Random Index-Partitioning on Index Compression , 2011, ArXiv.

[122]  Liu Rui,et al.  Fuzzy c-Means Clustering Algorithm , 2008 .

[123]  Ophir Frieder,et al.  Exploiting parallelism to support scalable hierarchical clustering , 2007, J. Assoc. Inf. Sci. Technol..

[124]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[125]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[126]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[127]  Gabriella Kazai INitiative for the Evaluation of XML Retrieval , 2009, Encyclopedia of Database Systems.

[128]  Andrew Trotman,et al.  Comparative Evaluation of Focused Retrieval , 2010, Lecture Notes in Computer Science.

[129]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[130]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[131]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[132]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[133]  R. Alhajj,et al.  Achieving Natural Clustering by Validating Results of Iterative Evolutionary Clustering Approach , 2006, 2006 3rd International IEEE Conference Intelligent Systems.

[134]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[135]  V Latora,et al.  Efficient behavior of small-world networks. , 2001, Physical review letters.

[136]  Shengli Wu,et al.  Testing the cluster hypothesis in distributed information retrieval , 2006, Inf. Process. Manag..

[137]  Robert M. Losee,et al.  Are two document clusters better than one? The Cluster Performance Question for information retrieval , 2005, J. Assoc. Inf. Sci. Technol..

[138]  Edwin R. Hancock,et al.  Spectral embedding of graphs , 2003, Pattern Recognit..

[139]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[140]  Alessandra Lumini,et al.  MKL-tree: an index structure for high-dimensional vector spaces , 2007, Multimedia Systems.

[141]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[142]  Bing Zhou,et al.  PARCLE: a parallel clustering algorithm for cluster system , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[143]  Craig MacDonald,et al.  Voting for candidates: adapting data fusion techniques for an expert search task , 2006, CIKM '06.

[144]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[145]  Shlomo Geva,et al.  Clustering with Random Indexing K-tree and XML Structure , 2009, INEX.

[146]  Richi Nayak,et al.  XML Documents Clustering Using a Tensor Space Model , 2011, PAKDD.

[147]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[148]  Behrang Q. Zadeh,et al.  Random Manhattan Indexing , 2014, 2014 25th International Workshop on Database and Expert Systems Applications.

[149]  Gabriella Kazai Initiative for the Evaluation of XML Retrieval , 2009 .

[150]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[151]  Jun Wang,et al.  Self-taught hashing for fast similarity search , 2010, SIGIR.

[152]  Sylvain Lamprier,et al.  Using Text Segmentation to Enhance the Cluster Hypothesis , 2008, AIMSA.

[153]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[154]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[155]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[156]  Ludovic Denoyer,et al.  The Wikipedia XML Corpus , 2006, INEX.

[157]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[158]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[159]  Fernando Diaz,et al.  Regularizing ad hoc retrieval scores , 2005, CIKM '05.

[160]  Laura A. Mather,et al.  A linear algebra measure of cluster quality , 2000, J. Am. Soc. Inf. Sci..

[161]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[162]  Chunming Rong,et al.  Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[163]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[164]  Shlomo Geva,et al.  TOPSIG: topology preserving document signatures , 2011, CIKM '11.

[165]  Richi Nayak,et al.  Overview of the INEX 2009 XML Mining Track: Clustering and Classification of XML Documents , 2009, INEX.

[166]  Andrew Trotman,et al.  Document Clustering Evaluation: Divergence from a Random Baseline , 2012, ArXiv.

[167]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[168]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[169]  B. AfeArd CALCULATING THE SINGULAR VALUES AND PSEUDOINVERSE OF A MATRIX , 2022 .

[170]  Mitsuru Ishizuka,et al.  Graph-based Word Clustering using a Web Search Engine , 2006, EMNLP.

[171]  Noam Chomsky,et al.  Modular Approaches to the Study of the Mind , 1984 .

[172]  S. Kotsiantis,et al.  Recent Advances in Clustering : A Brief Survey , 2004 .

[173]  Shlomo Geva,et al.  K-tree: large scale document clustering , 2009, SIGIR.

[174]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[175]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[176]  Liang-Gee Chen,et al.  Vector quantization using tree-structured self-organizing feature maps , 1994, IEEE J. Sel. Areas Commun..

[177]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[178]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[179]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[180]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[181]  Xiaohua Hu,et al.  Exploiting Wikipedia as external knowledge for document clustering , 2009, KDD.

[182]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[183]  Djoerd Hiemstra,et al.  Shard ranking and cutoff estimation for topically partitioned collections , 2012, CIKM.

[184]  Andrew Y. Ng,et al.  Emergence of Object-Selective Features in Unsupervised Feature Learning , 2012, NIPS.

[185]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[186]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[187]  James Allan,et al.  A cluster-based resampling method for pseudo-relevance feedback , 2008, SIGIR '08.

[188]  Ting Liu,et al.  Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[189]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[190]  Xiaohua Hu,et al.  A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[191]  Theodore S. Rappaport,et al.  Wireless communications - principles and practice , 1996 .

[192]  Sid Lamrous,et al.  Divisive Hierarchical K-Means , 2006, 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06).

[193]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[194]  Shlomo Geva,et al.  Document Clustering with K-tree , 2008, INEX.

[195]  Ludovic Denoyer,et al.  Report on the XML Mining Track at INEX 2005 and INEX 2006 , 2006, INEX.

[196]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[197]  Henri Maître,et al.  Kernel MDL to Determine the Number of Clusters , 2007, MLDM.

[198]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[199]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[200]  Sanjoy Dasgupta,et al.  Random projection trees and low dimensional manifolds , 2008, STOC.

[201]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[202]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[203]  Mihai Surdeanu,et al.  A hybrid unsupervised approach for document clustering , 2005, KDD '05.

[204]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[205]  Sargur N. Srihari,et al.  Fast k-nearest neighbor classification using cluster-based trees , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[206]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[207]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[208]  Luis M. de Campos,et al.  Probabilistic Methods for Link-Based Classification at INEX 2008 , 2009, INEX.

[209]  Fionn Murtagh,et al.  Overcoming the Curse of Dimensionality in Clustering by Means of the Wavelet Transform , 2000, Comput. J..

[210]  Ludovic Denoyer,et al.  Overview of the INEX 2008 XML Mining Track , 2008, INEX.

[211]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[212]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[213]  Stefan Kopp,et al.  Learning hierarchical prototypes of motion time series for interactive systems , 2012, ECAI 2012.

[214]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[215]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[216]  Kenneth Rose,et al.  Entropy-constrained tree-structured vector quantizer design , 1996, IEEE Trans. Image Process..

[217]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2007 categorization and clustering of XML documents , 2008, SIGF.

[218]  W. Bruce Croft,et al.  An Evaluation of Techniques for Clustering Search Results , 2005 .

[219]  Ah Chung Tsoi,et al.  Self Organizing Maps for the Clustering of Large Sets of Labeled Graphs , 2008, INEX.

[220]  Frederic Maire,et al.  ENTS - a fast and adaptive indexing system for codebooks , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[221]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[222]  Tefko Saracevic,et al.  Effects of Inconsistent Relevance Judgments on Information Retrieval Test Results: A Historical Perspective , 2008, Libr. Trends.

[223]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[224]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Web Track , 2012, TREC.

[225]  Amresh Kumar,et al.  Verification and validation of MapReduce program model for parallel K-means algorithm on Hadoop cluster , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[226]  Christopher Michael,et al.  Application of K-tree to document clustering , 2010 .

[227]  R. DeVore,et al.  A Simple Proof of the Restricted Isometry Property for Random Matrices , 2008 .

[228]  Özgür Ulusoy,et al.  Exploiting Index Pruning Methods for Clustering XML Collections , 2009, INEX.

[229]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[230]  Jaap Kamps,et al.  Using Links to Classify Wikipedia Pages , 2008, INEX.

[231]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[232]  Sylvain Lamprier,et al.  SegGen: A Genetic Algorithm for Linear Text Segmentation , 2007, IJCAI.

[233]  Dong-Hong Ji,et al.  Document clustering based on cluster validation , 2004, CIKM '04.

[234]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[235]  Stijn van Dongen,et al.  Graph Clustering Via a Discrete Uncoupling Process , 2008, SIAM J. Matrix Anal. Appl..

[236]  Andrew Trotman,et al.  Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, December 7-9, 2009, Revised and Selected Papers , 2010, INEX.

[237]  Ah Chung Tsoi,et al.  Supervised Encoding of Graph-of-Graphs for Classification and Regression Problems , 2009, INEX.

[238]  Darnes Vilariño Ayala,et al.  BUAP: Performance of K-Star at the INEX'09 Clustering Task , 2009, INEX.

[239]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[240]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[241]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[242]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[243]  DenoyerLudovic,et al.  Report on the XML mining track at INEX 2005 and INEX 2006 , 2007 .

[244]  Vlado Keselj,et al.  Document clustering using character N-grams: a comparative evaluation with term-based and word-based clustering , 2005, CIKM '05.

[245]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[246]  Andrew Trotman,et al.  Fast and Effective Focused Retrieval , 2009, INEX.

[247]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[248]  Richi Nayak,et al.  Utilizing the Structure and Content Information for XML Document Clustering , 2008, INEX.

[249]  Andrew Trotman,et al.  Advances in Focused Retrieval, 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, Dagstuhl Castle, Germany, December 15-18, 2008. Revised and Selected Papers , 2009, INEX.

[250]  Masayasu Atsumi Probabilistic Learning of Visual Object Composition from Attended Segments , 2010, ISVC.

[251]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[252]  Shlomo Geva,et al.  Random Indexing K-tree , 2009, HiPC 2010.

[253]  T.W. Fox Document vector compression and its application in document clustering , 2005, Canadian Conference on Electrical and Computer Engineering, 2005..

[254]  Pentti Kanerva,et al.  The Spatter Code for Encoding Concepts at Many Levels , 1994 .

[255]  James P. Callan,et al.  Topic-based Index Partitions for Efficient and Effective Selective Search , 2010, LSDS-IR@SIGIR.

[256]  Ellen M. Vdorhees The cluster hypothesis revisited , 1985, SIGIR 1985.

[257]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[258]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[259]  Theodore Kalamboukis,et al.  Using clustering to enhance text classification , 2007, SIGIR.

[260]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[261]  Justin Zobel,et al.  Effective ranking with arbitrary passages , 2001, J. Assoc. Inf. Sci. Technol..

[262]  Robert M. Gray,et al.  Clustering and Finding the Number of Clusters by Unsupervised Learning of Mixture Models using Vector Quantization , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[263]  Milad Shokouhi,et al.  Advances in Information Retrieval Theory, Second International Conference on the Theory of Information Retrieval, ICTIR 2009, Cambridge, UK, September 10-12, 2009, Proceedings , 2009, ICTIR.

[264]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[265]  Salvatore T. March,et al.  Design and natural science research on information technology , 1995, Decis. Support Syst..

[266]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[267]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[268]  Josef Stoer,et al.  Numerische Mathematik 1 , 1989 .

[269]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[270]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[271]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[272]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[273]  Pamela Forner Multilingual and Multimodal Information Access Evaluation - Second International Conference of the Cross-Language Evaluation Forum, CLEF 2011, Amsterdam, The Netherlands, September 19-22, 2011. Proceedings , 2011, CLEF.

[274]  Shlomo Geva K-tree: a height balanced tree structured vector quantizer , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[275]  Luis M. de Campos,et al.  Link-Based Text Classification Using Bayesian Networks , 2009, INEX.

[276]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[277]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[278]  Edward Y. Chang,et al.  Parallel Spectral Clustering , 2008, ECML/PKDD.