An investigation of linguistic features and clustering algorithms for topical document clustering

We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in the official TDT2 competition.

[1]  Nina Wacholder,et al.  Disambiguation of Proper Names in Text , 1997, ANLP.

[2]  Lynette Hirschman,et al.  MITRE: Description of the Alembic System Used for MUC-6 , 1995, MUC.

[3]  Douglas M. Bates,et al.  Nonlinear Regression Analysis and Its Applications , 1988 .

[4]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  Gerard Salton,et al.  On the application of syntactic methodologies in automatic text analysis , 1989, SIGIR '89.

[7]  James Allan,et al.  UMASS Approaches to Detection and Tracking at TDT2 , 1999 .

[8]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[9]  George Doddington The Topic Detection and Tracking Phase 2 (TDT2) evaluation plan , 1998 .

[10]  W. Bruce Croft,et al.  Interpreting nominal compounds for information retrieval , 1990, Inf. Process. Manag..

[11]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[14]  Vasileios Hatzivassiloglou,et al.  Text-Based Approaches for the Categorization of Images , 1999, ECDL.

[15]  Stephen A. Lowe The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection , 1999 .

[16]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[17]  Jonathan G. Fiscus,et al.  NIST's 1998 topic detection and tracking evaluation (TDT2) , 1999, EUROSPEECH.

[18]  Nina Wacholder,et al.  Simplex NPs Clustered by Head: A Method for Identifying Significant Topics Within a Document , 1998 .

[19]  D. G. Simpson,et al.  The Statistical Analysis of Discrete Data , 1989 .

[20]  Regina Barzilay,et al.  Towards Multidocument Summarization by Reformulation: Progress and Prospects , 1999, AAAI/IAAI.

[21]  Alan F. Smeaton,et al.  Progress in the Application of Natural Language Processing to Information Retrieval Tasks , 1992, Comput. J..