Improving text categorization methods for event tracking

Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different data collections, suggesting a robust solution for parameter optimization.

[1]  W. B. CroftCenter Combining Classiiers in Text Categorization , 1996 .

[2]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[3]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[4]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[5]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[6]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[7]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[8]  Richard M. Schwartz,et al.  A maximum likelihood model for topic classification of broadcast news , 1997, EUROSPEECH.

[9]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[10]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[11]  Jonathan G. Fiscus,et al.  NIST's 1998 topic detection and tracking evaluation (TDT2) , 1999, EUROSPEECH.

[12]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[13]  J. M. Schultz,et al.  Topic Detection and Tracking using idf-Weighted Cosine Coefficient , 1999 .

[14]  SingerYoram,et al.  Context-sensitive learning methods for text categorization , 1999 .

[15]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[16]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[17]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[18]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[19]  Richard M. Schwartz,et al.  Topic detection in broadcast news , 1999, EUROSPEECH.

[20]  Yiming Yang,et al.  CMU Report on TDT-2: Segmentation, Detection and Tracking , 1999 .

[21]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[22]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[23]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[24]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.