Introduction to Text Clustering

Text data is ubiquitous. As the volume of text data increases, management and analysis of text data becomes unprecedentedly important. Text mining is an emerging technology for handling the increasing text data. Text clustering is one of the fundamental functions in text mining. Text clustering is to divide a collection of text documents into different category groups so that documents in the same category group describe the same topic, such as classic music or Chinese history.

[1]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[2]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[3]  N. Chater,et al.  Proceedings of the fourteenth annual conference of the cognitive science society , 1992 .

[4]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[5]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[6]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[7]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[8]  Allison Woodruff,et al.  Guidelines for using multiple views in information visualization , 2000, AVI '00.

[9]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[10]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[11]  Magnus Rosell Improving Clustering of Swedish Newspaper Articles using Stemming and Compound Splitting , 2003 .

[12]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[13]  Mark Sanderson,et al.  Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ , 2022 .

[14]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[15]  Hichem Frigui,et al.  Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents , 2004 .

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[18]  Hugh E. Williams,et al.  Fast phrase querying with combined indexes , 2004, TOIS.

[19]  Sumithra Velupillai,et al.  Revealing Relations between Open and Closed Answers in Questionnaires through Text Clustering Evaluation , 2008, LREC.

[20]  Matthias Jarke,et al.  A modular approach for exploring the semantic structure of technical document collections , 2000, AVI '00.

[21]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[22]  Turid Hedlund,et al.  Compounds in dictionary-based cross-language information retrieval , 2002, Inf. Res..

[23]  Manuel J. Maña López,et al.  Multidocument summarization: An added value to clustering in interactive retrieval , 2004, TOIS.

[24]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[25]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[26]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[27]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[28]  Chaomei Chen,et al.  Visualizing the Semantic Web: XML-Based Internet and Information Visualization, 2nd Edition , 2004, Visualizing the Semantic Web, 2nd Edition.

[29]  Ted Pedersen,et al.  SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts , 2005, ACL.

[30]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[31]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[32]  Sang-goo Lee,et al.  A semi-supervised document clustering technique for information organization , 2000, CIKM '00.

[33]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[34]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[35]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[36]  Andreas Becks,et al.  SWAPit: a multiple views paradigm for exploring associations of texts and structured data , 2004, AVI.

[37]  Hinrich Schütze,et al.  Projections for efficient document clustering , 1997, SIGIR '97.

[38]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[39]  Y. Wilks,et al.  A General Architecture for Text Engineering (gate) { a New Approach to Language Engineering R&d a General Architecture for Text Engineering (gate) | a New Approach to Language Engineering R&d a E G T , 1995 .

[40]  Fredric C. Gey,et al.  Combining Query Translation and Document Translation in Cross-Language Retrieval , 2003, CLEF.

[41]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[42]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[43]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[44]  Hercules Dalianis Improving search engine retrieval using a compound splitter for Swedish , 2005, NODALIDA.

[45]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[46]  D. Swanson Medical literature as a potential source of new knowledge. , 1990, Bulletin of the Medical Library Association.

[47]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[48]  Ola Knutsson,et al.  Improving Precision in Information Retrieval for Swedish using Stemming , 2001, NODALIDA.

[49]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[50]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[51]  Michael K. Ng,et al.  M-FastMap: A Modified FastMap Algorithm for Visual Cluster Validation in Data Mining , 2002, PAKDD.

[52]  Joydeep Ghosh,et al.  Relationship-based clustering and cluster ensembles for high-dimensional data mining , 2002 .

[53]  K. Murali,et al.  MedMeSH Summarizer: Text Mining for Gene Clusters , 2002, SDM.

[54]  Magnus Rosell,et al.  Infomat A Vector Space Visualization Tool , 1999 .

[55]  Samuel Kaski,et al.  Mining massive document collections by the WEBSOM method , 2004, Inf. Sci..

[56]  James P. Callan,et al.  Automatically labeling hierarchical clusters , 2006, DG.O.

[57]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[58]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[59]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[60]  Turid Hedlund,et al.  Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval , 2001, Inf. Process. Manag..

[61]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[62]  Magnus Rosell Clustering in Swedish : The Impact of some Properties of the Swedish Language on Document Clustering and an Evaluation Method , 2005 .

[63]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[64]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[65]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[66]  Sumithra Velupillai,et al.  The impact of phrases in document clustering for Swedish , 2005, NODALIDA.

[67]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[68]  Viggo Kann,et al.  Comparing Comparisons: Document Clustering Evaluation Using Two Manual Classifications , 2004 .