Unsupervised Non-topical Classification of Documents

Abstract : We describe the problem of non-topical clustering of documents, the purpose of which is to divide a set of documents into clusters that share some aspect. We present experiments on the British National Corpus that cluster documents by genre. We show that words are superior to part of speech information for genre clustering, but that better results can be obtained by using both. We also demonstrate that the new multi-way distributional clustering approach is highly effective for this task because it requires less feature crafting than other techniques.

[1]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[2]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[3]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[4]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[5]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[6]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[7]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[8]  Marina Santini A Shallow Approach To Syntactic Feature Extraction For Genre Classification , 2003 .

[9]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[10]  R. Bekkerman,et al.  Using Bigrams in Text Categorization , 2003 .

[11]  G. A. Mishne,et al.  Expiriments with mood classification in blog posts , 2005, SIGIR 2005.

[12]  Omid Madani,et al.  Biasing web search results for topic familiarity , 2005, CIKM '05.

[13]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[14]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[15]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Barry Smyth,et al.  Genre Classification and Domain Transfer for Information Filtering , 2002, ECIR.

[18]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[19]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[20]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[21]  Andreas Rauber,et al.  Integrating automatic genre analysis into digital libraries , 2001, JCDL '01.

[22]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[23]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[24]  N. Kando,et al.  Analysis of Multi-Document Viewpoint Summarization Using Multi-Dimensional Genres , 2004 .

[25]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[26]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[27]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[28]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[29]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.