News Representation with Multi-Word Features

Information is commonly reflected in news articles. However, texts are unstructured and thus demanding to analyze automatically. To identify and capture the facts in a news story we propose a novel approach, which utilizes natural language engineering. A combination of selected linguistic and statistical criteria enables the identification of grammatical units such as noun, verb, adjective, and adverb phrases. In literature, these entities are presumed to carry the meaning and the information expressed in English texts. In our study, we focus on determining multi-word features in articles related to the monetary policy conducted by the central bank in the USA, FED. The features are composed as attribute-value pairs, where the attributes represent grammatical units, which quantify the major event characteristics. The corresponding values are conditional expressions, which vary over time as facts evolve. The final set is aggregated over the corpus by the application of heuristic and syntax-based rules. Financial experts contributed to the project by providing expertise for the document interpretation.

[1]  Xiao Li,et al.  Understanding the Semantic Structure of Noun Phrase Queries , 2010, ACL.

[2]  Bin Tang,et al.  Document Representation and Dimension Reduction for Text Clustering , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.