Unweaving a web of documents

We develop an algorithmic framework to decompose a collection of time-stamped text documents into semantically coherent threads. Our formulation leads to a graph decomposition problem on directed acyclic graphs, for which we obtain three algorithms --- an exact algorithm that is based on minimum cost flow and two more efficient algorithms based on maximum matching and dynamic programming that solve specific versions of the graph decomposition problem. Applications of our algorithms include superior summarization of news search results, improved browsing paradigms for large collections of text-intensive corpora, and integration of time-stamped documents from a variety of sources. Experimental results based on over 250,000 news articles from a major newspaper over a period of four years demonstrate that our algorithms efficiently identify robust threads of varying lengths and time-spans.

[1]  Richard M. Karp,et al.  A n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs , 1971, SWAT.

[2]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[3]  Richard M. Karp,et al.  A n^5/2 Algorithm for Maximum Matchings in Bipartite Graphs , 1971, SWAT.

[4]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[5]  Éva Tardos,et al.  A strongly polynomial minimum cost circulation algorithm , 1985, Comb..

[6]  Andrew V. Goldberg,et al.  Solving minimum-cost flow problems by successive approximation , 1987, STOC.

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Alan F. Smeaton,et al.  Experiments on the automatic construction of hypertexts from texts , 1995, New Rev. Hypermedia Multim..

[9]  James Allan,et al.  Automatic Hypertext Construction , 1995 .

[10]  Mark D. Dunlop,et al.  Automatic Construction of News Hypertext , 1997, HIM.

[11]  Stephen J. Green,et al.  Automatically generating hypertext in newspaper articles by computing semantic relatedness , 1998, CoNLL.

[12]  Naohiko Uramoto,et al.  A Method for Relating Multiple Newspaper Articles by Using Graphs, and Its Application to Webcasting , 1998, COLING-ACL.

[13]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[14]  Theodore Dalamagas,et al.  NHS: A Tool for the Automatic Construction of News Hypertext , 1998, BCS-IRSG Annual Colloquium on IR Research.

[15]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[16]  Sylvia L. Osborn,et al.  Hypertext versions of journal articles: computer-aided linking and realistic human-based evaluation , 1999 .

[17]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[18]  James Allan,et al.  Automatic generation of overview timelines , 2000, SIGIR '00.

[19]  C. Lee Giles,et al.  Efficient identification of Web communities , 2000, KDD '00.

[20]  James Allan,et al.  Temporal summaries of new topics , 2001, SIGIR '01.

[21]  J. Kleinberg Bursty and Hierarchical Structure in Streams , 2002, Data mining and knowledge discovery.

[22]  David A. Smith Detecting events with date and place information in unstructured text , 2002, JCDL '02.

[23]  Ramakrishnan Srikant,et al.  Mining newsgroups using networks arising from social behavior , 2003, WWW '03.

[24]  Monika Henzinger,et al.  Query-Free News Search , 2003, WWW '03.

[25]  Ravi Kumar,et al.  A graph-theoretic approach to extract storylines from search results , 2004, KDD.

[26]  Mayur Datar,et al.  On the streaming model augmented with a sorting primitive , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.