Tracking Information Flow in Financial Text

Information is fundamental to Finance, and understanding how it flows from official sources to news agencies is a central problem. Readers need to digest information rapidly from high volume news feeds, which often contain duplicate and irrelevant stories, to gain a competitive advantage. We propose a text categorisation task over pairs of official announcements and news stories to identify whether the story repeats announcement information and/or adds value. Using features based on the intersection of the texts and relative timing, our system identifies information flow at 89.5% F-score and three types of journalistic contribution at 73.4% to 85.7% Fscore. Evaluation against majority annotator decision performs 13% better than a bag-of-words baseline.

[1]  Hal Daumé Notes on CG and LM-BFGS Optimization of Logistic Regression , 2008 .

[2]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[3]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[4]  James Allan,et al.  Relevance models for topic detection and tracking , 2002 .

[5]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[6]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[7]  Nancy Chinchor,et al.  Statistical Significance of MUC-6 Results , 1995, MUC.

[8]  John Tait,et al.  Karen Spärck Jones , 2008 .

[9]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[10]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[11]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[12]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[13]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[14]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[15]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[16]  Simone Santini,et al.  Similarity Measures , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Ian Soboroff,et al.  Overview of the TREC 2004 Novelty Track , 2004, TREC.

[18]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[19]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[21]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[22]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[23]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[24]  A. Zaheer,et al.  Catching the wave: alertness, responsiveness, and market influence in global electronic networks , 1997 .

[25]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[26]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[27]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.