Time-dependent event hierarchy construction

In this paper, an algorithm called Time Driven Documents-partition (TDD) is proposed to construct an event hierarchy in a text corpus based on a given query. Specifically, assume that a query contains only one feature - Election. Election is directly related to the events such as 2006 US Midterm Elections Campaign, 2004 US Presidential Election Campaign and 2004 Taiwan Presidential Election Campaign, where these events may further be divided into several smaller events (e.g. the 2006 US Midterm Elections Campaign can be broken down into events such as campaign for vote, election results and the resignation of Donald H. Rumsfeld). As such, an event hierarchy is resulted. Our proposed algorithm, TDD, tackles the problem by three major steps: (1)Identify the features that are related to the query according to both the timestamps and the contents of the documents. The features identified are regarded as bursty features; (2) Extract the documents that are highly related to the bursty features based on time; (3) Partition the extracted documents to form events and organize them in a hierarchicalstructure. To the best of our knowledge, there is little works targeting for constructing a feature-based event hierarchy for a text corpus. Practically, event hierarchies can assist us to efficiently locate our target information in a text corpus easily. Again, assume that Election is used for a query. Without an event hierarchy, it is very difficult to identify what are the major events related to it, when do these events happened, as well as the features and the news articles that are related to each of these events. We have archived two-year news articles to evaluate the feasibility of TDD. The encouraging results indicated that TDD is practically sound and highly effective.

[1]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[2]  Hanxiong Chen,et al.  Using Stem Rules to Refine Document Retrieval Queries , 1998, FQAS.

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  David D. Lewis,et al.  Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[5]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[6]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[7]  Vasant Dhar,et al.  Intelligent information triage , 2001, SIGIR '01.

[8]  David A. Smith,et al.  Detecting and Browsing Events in Unstructured text , 2002, SIGIR '02.

[9]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[10]  Bin Wang,et al.  A probabilistic model for retrospective news event detection , 2005, SIGIR '05.

[11]  Yiming Yang,et al.  Improving text categorization methods for event tracking , 2000, SIGIR '00.

[12]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[13]  Douglas C. Montgomery,et al.  Applied Statistics and Probability for Engineers, Third edition , 1994 .

[14]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[15]  James Allan,et al.  Automatic generation of overview timelines , 2000, SIGIR '00.

[16]  James Allan,et al.  Extracting significant time varying features from text , 1999, CIKM '99.

[17]  James Allan,et al.  UMass at TDT 2004 , 2004 .

[18]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[19]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[20]  Robert V. Brill,et al.  Applied Statistics and Probability for Engineers , 2004, Technometrics.

[21]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[22]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[23]  Earl Rennison,et al.  Galaxy of news: an approach to visualizing and understanding expansive news landscapes , 1994, UIST '94.

[24]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[25]  Dimitrios Gunopulos,et al.  Identifying similarities, periodicities and bursts for online search queries , 2004, SIGMOD '04.

[26]  Wessel Kraaij,et al.  TNO at TDT2001: Language Model-Based Topic Detection , 2001 .

[27]  Dolf Trieschnigg,et al.  Hierarchical topic detection in large digital news archives: Exploring a sample based approach , 2005, J. Digit. Inf. Manag..

[28]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[29]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[30]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[31]  Satoshi Morinaga,et al.  Tracking dynamics of topic trends using a finite mixture model , 2004, KDD.

[32]  Peter Willett,et al.  The limitations of term co-occurrence data for query expansion in document retrieval systems , 1991, J. Am. Soc. Inf. Sci..

[33]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[34]  Vijay Kumar,et al.  Metadata visualization for digital libraries: interactive timeline editing and review , 1998, DL '98.

[35]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[36]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[37]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[38]  J. Allan,et al.  On-Line New Event Detection using Single Pass Clustering , 1998 .

[39]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.