Software Newsroom – an approach to automation of news search and editing

We have developed tools and applied methods for automated identification of potential news from textual data for an automated news search system called Software Newsroom. The purpose of the tools is to analyze data collected from the internet and to identify information that has a high probability of containing new information. The identified information is summarized in order to help understanding the semantic contents of the data, and to assist the news editing process. It has been demonstrated that words with a certain set of syntactic and semantic properties are effective when building topic models for English. We demonstrate that words with the same properties in Finnish are useful as well. Extracting such words requires knowledge about the special characteristics of the Finnish language, which are taken into account in our analysis. Two different methodological approaches have been applied for the news search. One of the methods is based on topic analysis and it applies Multinomial Principal Component Analysis (MPCA) for topic model creation and data profiling. The second method is based on word association analysis and applies the log-likelihood ratio (LLR). For the topic mining, we have created English and Finnish language corpora from Wikipedia and Finnish corpora from several Finnish news archives and we have used bag-of-words presentations of these corpora as training data for the topic model. We have performed topic analysis experiments with both the training data itself and with arbitrary text parsed from internet sources. The results suggest that the effectiveness of news search strongly depends on the quality of the training data and its linguistic analysis. In the association analysis, we use a combined methodology for detecting novel word associations in the text. For detecting novel associations we use the background corpus from which we extract common word associations. In parallel, we collect the statistics of word co-occurrences from the documents of interest and search for associations with larger likelyhood in these documents than in the background. We have demonstrated the applicability of these methods for Software Newsroom. The results indicate that the background-foreground model has significant potential in news search. The experiments also indicate great promise in employing background-foreground word associations for other applications. A combined application of the two methods is planned as well as the application of the methods on social media using a pre-translator of social media language.

[1]  Hua Li,et al.  Document Summarization Using Conditional Random Fields , 2007, IJCAI.

[2]  Hannu Toivonen,et al.  Lexical Creativity from Word Associations , 2012, 2012 Seventh International Conference on Knowledge, Information and Creativity Support Systems.

[3]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[4]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[5]  M. Kimura,et al.  Multinomial PCA for extracting major latent topics from document streams , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[6]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[7]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[8]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[9]  Fred Karlsson,et al.  Constraint Grammar as a Framework for Parsing Running Text , 1990, COLING.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[12]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[13]  Arjen van Ooyen,et al.  2 – Theoretical Aspects of Pattern Analysis , 2001 .

[14]  Aleks Jakulin,et al.  Discrete Component Analysis , 2005, SLSFS.

[15]  Hannu Toivonen,et al.  Discovery of Novel Term Associations in a Document Collection , 2012, Bisociative Knowledge Discovery.

[16]  Michael Piotrowski,et al.  Systems and Frameworks for Computational Morphology , 2015, Communications in Computer and Information Science.

[17]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[18]  Eneko Agirre,et al.  Semantic Services in FreeLing 2.1: WordNet and UKB , 2010 .

[19]  Jingen Liu,et al.  Constructing semantic network based on Bayesian Network , 2009, 2009 1st IEEE Symposium on Web Society.

[20]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[21]  Sadaoki Furui,et al.  A new approach to automatic speech summarization , 2003, IEEE Trans. Multim..

[22]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[23]  Daniel Marcu,et al.  Summarization beyond sentence extraction: A probabilistic approach to sentence compression , 2002, Artif. Intell..

[24]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[25]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[26]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[27]  Zhendong Niu,et al.  News topic detection based on hierarchical clustering and named entity , 2011, 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.

[28]  Tommi A. Pirinen,et al.  HFST - Framework for Compiling and Applying Morphologies , 2011, SFCM.

[29]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[30]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[31]  Wei-Pang Yang,et al.  Text summarization using a trainable summarizer and latent semantic analysis , 2005, Inf. Process. Manag..

[32]  David D. McDonald Internal and External Evidence in the Identification and Semantic Categorization of Proper Names , 1993 .

[33]  E. Ukkonen,et al.  Mining the UKIDSS Galactic Plane Survey: star formation and embedded clusters , 2012, 1203.5292.

[34]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.