Abstract : In the application domain of stock portfolio management, software agents that evaluate the risks associated with the individual companies of a portfolio should be able to read electronic news articles that are written to give investors an indication of the financial outlook of a company. There is a positive correlation between news reports on a company' financial outlook and the company' attractiveness as an investment. However, because of the volume of such reports, it is impossible for financial analysts or investors to track and read each one. Therefore, it would be very helpful to have a system that automatically classifies news reports that reflect positively or negatively on a company' financial outlook. To accomplish this task, we treat the analysis of news articles as a text classification problem. We developed a text classification algorithm that classifies financial news article by using a combination of a reduced but highly informative word feature sets and a variant of weighted majority algorithm. By clustering words represented in latent semantic vector space by LSA into groups with similar concepts, we are able to find semantically coherent word groups. A learning method with unlabeled data Self-Confident sampling was proposed to handle the problem of expensive data labeling. Vote entropy is the criterion that information-theoretically assigns a label to an unlabeled document. In comparison with naive Bayes classification boosted by Expectation Maximization (EM), the proposed method showed a better performance in terms of accuracy. Two criteria are used to evaluate methods: (1) how well they improve their performances with unlabeled data after being initially trained on a small number of human-labeled articles and (2) how well they classify the latest financial news articles which are mostly not seen during the training.
[1]
Anil K. Jain,et al.
Algorithms for Clustering Data
,
1988
.
[2]
T. Landauer,et al.
Indexing by Latent Semantic Analysis
,
1990
.
[3]
Shlomo Argamon,et al.
Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers
,
1995
.
[4]
Katia P. Sycara,et al.
Distributed Intelligent Agents
,
1996,
IEEE Expert.
[5]
Yoram Singer,et al.
Context-sensitive learning methods for text categorization
,
1996,
SIGIR '96.
[6]
Prasad Tadepalli,et al.
Active Learning with Committees for Text Categorization
,
1997,
AAAI/IAAI.
[7]
Katia P. Sycara,et al.
Designing behaviors for information agents
,
1997,
AGENTS '97.
[8]
Yiming Yang,et al.
A Comparative Study on Feature Selection in Text Categorization
,
1997,
ICML.
[9]
Andrew McCallum,et al.
Employing EM and Pool-Based Active Learning for Text Classification
,
1998,
ICML.
[10]
Thorsten Joachims,et al.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
,
1998,
ECML.
[11]
Yiming Yang,et al.
A re-examination of text categorization methods
,
1999,
SIGIR '99.
[12]
David A. Cohn,et al.
Improving generalization with active learning
,
1994,
Machine Learning.
[13]
Sebastian Thrun,et al.
Text Classification from Labeled and Unlabeled Documents using EM
,
2000,
Machine Learning.