On-line new event detection, clustering, and tracking (information retrieval, internet)

In this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line sources of news. We present solutions to three related classification problems: new event detection, event clustering, and event tracking . The primary focus of this thesis is new event detection, where the goal is to identify news stories that have not Previously been reported, in a stream of broadcast news comprising radio, television, and newswire. We present an algorithm for new event detection, and analyze the effects of incorporating domain properties into the classification algorithm. We explore a solution that models the temporal relationship between news stories, and investigate the use of proper noun phrase extraction to capture the who, what, when, and where contained in news. Our results for new event detection suggest that previous approaches to document clustering provide a good basis for an approach to new event detection, and that further improvements to classification accuracy are obtained when the domain properties of broadcast news are modeled. New event detection is related to the problem of event clustering , where the goal is to group stories that discuss the same event. We investigate on-line clustering as an approach to new event detection, and re-evaluate existing cluster comparison strategies previously used for document retrieval. Our results suggest that these strategies produce different groupings of events, and that the on-line single-link strategy extended with a model for domain properties is faster and more effective than other approaches. In this dissertation, we explore several text representation issues in the context of event tracking, where a classifier for an event is formulated from one or more sample stories. The classifier is used to monitor the subsequent news strewn for documents related to the event. We discuss different approaches to classifier formulation, and compare feature selection and weight-learning steps as extensions to a baseline process used for new event detection. In addition, we evaluate an unsupervised adaptive approach to event tracking that captures the property of event evolution in broadcast news. The implementations of our approaches to on-line new event detection, clustering, and tracking have been evaluated in comparison to other systems, and we present cross-system comparisons for all three classification problems. In general, the results using our approaches compared favorably to other approaches for each problem.

[1]  Ron Papka Learning Query Bias for Improved On-Line Document Classification , 1999 .

[2]  David Hawking,et al.  Proximity Operators - So Near And Yet So Far , 1995, TREC.

[3]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[4]  Brian T. Bartell,et al.  Optimizing ranking functions: a connectionist approach to adaptive information retrieval , 1994 .

[5]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[6]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[7]  Chris Buckley,et al.  Learning routing queries in a query zone , 1997, SIGIR '97.

[8]  Robert Krovetz,et al.  Word sense disambiguation for large text databases , 1996 .

[9]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[10]  A Min Tjoa,et al.  Database and Expert Systems Applications: Proceedings of the International Conference, Valencia, Spain, 1992 , 1992 .

[11]  Jonathan Yamron,et al.  Topic Tracking in a News Stream , 1999 .

[12]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[13]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[14]  James Allan,et al.  Strategy-based interactive cluster visualization for information retrieval , 2000, International Journal on Digital Libraries.

[15]  Tomek Strzalkowski,et al.  Natural Language Information Retrieval: TREC-8 Report , 1994, TREC.

[16]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[17]  Stephen A. Lowe The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection , 1999 .

[18]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[19]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[20]  James P. Callan,et al.  Document filtering with inference networks , 1996, SIGIR '96.

[21]  Richard M. Schwartz,et al.  Topic tracking for radio, TV broadcast, and newswire , 1999, EUROSPEECH.

[22]  T. Wilkerson,et al.  Events: A Metaphysical Study , 1987 .

[23]  James P. Callan,et al.  Text-Based Information Retrieval Using Exponentiated Gradient Descent , 1996, NIPS.

[24]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[25]  Richard M. Schwartz,et al.  Topic detection in broadcast news , 1999, EUROSPEECH.

[26]  James Allan,et al.  Document classification using multiword features , 1998, CIKM '98.

[27]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[28]  Warren R. Grei,et al.  Empirical Studies of Query/Document Characteristics as Evidence in Favor of Relevance , 1998 .

[29]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[30]  Tom Fawcett,et al.  Robust Classification Systems for Imprecise Environments , 1998, AAAI/IAAI.

[31]  Carolyn Watters,et al.  Automatic association of news items , 1997, Inf. Process. Manag..

[32]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[33]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[34]  Robert P. Goldman A Probabilistic Approach to Language Understanding , 1990 .

[35]  J. M. Schultz,et al.  Topic Detection and Tracking using idf-Weighted Cosine Coefficient , 1999 .

[36]  Evelyne Tzoukermann,et al.  Effective use of natural language processing techniques for automatic conflation of multi-word terms: the role of derivational morphology, part of speech tagging, and shallow parsing , 1997, SIGIR '97.

[37]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[38]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[39]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[40]  James Allan,et al.  Incremental relevance feedback for information filtering , 1996, SIGIR '96.

[41]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[42]  Winson Taam Introduction to Probability and Statistics for Scientists and Engineers , 1999, Technometrics.

[43]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[44]  C. Brenneis,et al.  Searching for Memory: The Brain, the Mind, and the Past , 1999 .

[45]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[46]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[47]  Ralph A. Szweda,et al.  Information processing management , 1972 .

[48]  RiloffEllen,et al.  Information extraction as a basis for high-precision text classification , 1994 .

[49]  James Allan,et al.  UMASS Approaches to Detection and Tracking at TDT2 , 1999 .

[50]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[51]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[52]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[53]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[54]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[55]  Gerald DeJong,et al.  Prediction and Substantiation: A New Approach to Natural Language Processing , 1979, Cogn. Sci..

[56]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[57]  Jinxi Xu,et al.  Solving the word mismatch problem through automatic text analysis , 1997 .

[58]  E. Voorhees The Effectiveness & Efficiency of Agglomerative Hierarchic Clustering in Document Retrieval , 1985 .

[59]  Philip J. Hayes,et al.  A News Story Categorization System , 1988, ANLP.

[60]  W. Bruce Croft,et al.  Support for Browsing in an Intelligent Text Retrieval System , 1989, Int. J. Man Mach. Stud..

[61]  James Allan,et al.  Recent Experiments with INQUERY , 1995, TREC.

[62]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[63]  Gerald Salton,et al.  Automatic text processing , 1988 .

[64]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[65]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[66]  W. Bruce Croft A model of cluster searching bases on classification , 1980, Inf. Syst..

[67]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[68]  J. Allan,et al.  On-Line New Event Detection using Single Pass Clustering , 1998 .