A methodology to improve the performance of extracting information from financial documents

The Information Extraction (IE) technology retrieves the most relevant, context sensitive, and specific pieces of information from unstructured documents and presents it in a structured format. The IE problem is very difficult for several reasons. First of all, there is no clear boundary of the items to be retrieved. Secondly, information retrieval techniques, by using a bag of words and word statistics, may not suffice to retrieve most of the relevant information because of missing contexts. Thirdly, the direct use of some statistical techniques such as the use of Naive Bayes classifier or the use of Average Mutual Information performs well on document retrieval tasks, but these techniques are not directly applicable to the IE tasks. This study proposes an IE methodology that aims at extracting financial information of various NASDAQ listed companies with high precision and recall. The performance is improved partly by using a rule-based symbolic-learning model. A set of rules is learned by the simplest form of Tabu search algorithm. The results show that the application of the Tabu search algorithm with parts of speech tags improves precision and recall over the application of other methods and resources. The output of the learned model is further analyzed by a statistical method called "Max-Strength" to improve the precision of the items extracted by the symbolic learning model. The strength of the methodology has been evidenced by its performance on the "Seminar Announcement" corpus that has been used by several well known systems.