An Extraction Method of Time-Series Numerical Data from Press Releases Using Co-Occurrence Conditions of Numbers and Time Stamps Related to Target Business Keyword

Recently business is considered to be under unpredictable environment. In the process of designing a business strategy under such environment, it is essential to collect and analyze time-series numerical data which are pairs of time stamp and numerical data about business keywords. We propose an extraction method of time-series numerical data from enterprise press releases as shown in Fig. 1. A specific keyword about a certain matter is used in press releases published by each company. Therefore necessary data can be found with a few specific keywords from press releases. However, if there are several such keywords and numerical data in a sentence, this situation makes it difficult to judge which pair of a keyword and a numerical data should be extracted. And, expression of time stamps are various (ex.“this year”, “in 2006” etc.). Becuase of such differences, it is impossible to make a chart from simply extracted data. To solve these problems, this research discusses the following issues; • Extract correct pairs of time and numerical data • Unify expression of time stamps Press releases consist of plain sentences, itemization, tables and figures. Tables are well-formatted, and figures includes few target data. Therefore, as extraction points of views, plain sentences and itemizations are considered to be target portions for extraction. In plain sentences, numerical data and time stamps required by an analyst often appear near input keywords. Then, numerical data near keywords and time stamps near numerical data are extracted. A pair whose word-distance is shortest is extracted as shown in Fig. 2. In itemization, time stamps don’t often appear near numerical data. Therefore, we extract using not only the word-