On building predictive models with company annual reports

Text mining and machine learning methodologies have been applied to biomedicine and business domains for new relationship and knowledge discovery. Company annual reports (or 10K filings), as one of the most important mandatory information disclosures, have remained untapped by the text mining and machine learning community. Previous research indicates that the narrative disclosures in company annual reports can be used to assess the company’s short-term financial prospects. In this study, we apply text classification methods to 10K filings to systematically assess the predictive potential of company annual reports. We specify our research problem along five dimensions: financial performance indicators, choice of predictions, evaluation criteria, document representation, and experiment design. Different combinations of the choices we made along the five dimensions provide us with different perspectives and insights into the feasibility of using annual reports to predict company future performance. Our results confirm that predictive models can be successfully built using the textual content of annual reports. Mock portfolios constructed with firms predicted by the text-based model are shown to produce positive average stock return. Sub-sample experiments and post-hoc analysis further confirm that the text-based model is able to catch the textual differences among firms with different financial characteristics. We see a rich set of research questions with the promise of further insight in this research area. Abstract Approved: Thesis SupervisorApproved: Thesis Supervisor Title and Department

[1]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[2]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[3]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[4]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[5]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[6]  Reuven Lehavy,et al.  Prophets and Losses: Reassessing the Returns to Analysts' Stock Recommendations , 2001 .

[7]  Weiguo Fan,et al.  Literature-based discovery on the World Wide Web , 2002, TOIT.

[8]  Feng Li Do Stock Market Investors Understand the Risk Sentiment of Corporate Annual Reports? , 2006 .

[9]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[10]  Thomas Z. Lys,et al.  Empirical Research on Accounting Choice , 2001 .

[11]  Philip J. Hayes,et al.  CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories , 1990, IAAI.

[12]  Tapio Salakoski,et al.  New Techniques for Disambiguation in Natural Language and Their Application to Biological Text , 2004, J. Mach. Learn. Res..

[13]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[14]  I. Herremans,et al.  The case for better measurement and reporting of marketing performance , 1995 .

[15]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[16]  Svetlana Kiritchenko,et al.  Hierarchical text categorization and its application to bioinformatics , 2006 .

[17]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[18]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[19]  Robert G. Insley,et al.  Performance and Readability: A Comparison of Annual Reports of Profitable and Unprofitable Corporations , 1993 .

[20]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization , 1999, SIGIR 1999.

[21]  Andreas S. Weigend,et al.  Exploiting Hierarchy in Text Categorization , 1999, Information Retrieval.

[22]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[23]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[24]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[25]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[26]  Victor L. Bernard,et al.  The Feltham-Ohlson Framework: Implications for Empiricists* , 1995 .

[27]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[28]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[29]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[30]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[31]  James A. Ohlson Earnings, Book Values, and Dividends in Equity Valuation* , 1995 .

[32]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[33]  Satoshi Morinaga,et al.  Mining product reputations on the Web , 2002, KDD.

[34]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[35]  Masahiko Haruno,et al.  Feature Selection in SVM Text Categorization , 1999, AAAI/IAAI.

[36]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[37]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[38]  Hannu Vanharanta,et al.  Knowledge discovery from text documents based on paragraph maps , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[39]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[40]  Matthew Hurst,et al.  Deriving marketing intelligence from online discussion , 2005, KDD '05.

[41]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[42]  Yuan-Fang Wang,et al.  The use of bigrams to enhance text categorization , 2002, Inf. Process. Manag..

[43]  Dennis M. Wilkinson,et al.  A method for finding communities of related genes , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[45]  Marc R. Reinganum Misspecification of capital asset pricing : Empirical anomalies based on earnings' yields and market values , 1981 .

[46]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[47]  Yiming Yang,et al.  A Linear Least Squares Fit Mapping Method for Information Retrieval From Natural Language Texts , 1992, COLING.

[48]  Abraham Bernstein,et al.  Discovering Knowledge from Relational Data Extracted from Business News , 2002 .

[49]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[50]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[51]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[52]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[53]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[54]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[55]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[56]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[57]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[58]  E. Henry Market Reaction to Verbal Components of Earnings Press Releases: Event Study Using a Predictive Algorithm , 2006 .

[59]  Mark Craven,et al.  Learning to Extract Relations from MEDLINE , 1999 .

[60]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[61]  Hannu Vanharanta,et al.  Combining data and text mining techniques for analysing financial reports , 2004, Intell. Syst. Account. Finance Manag..

[62]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[63]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[64]  Eric Abrahamson,et al.  THE INFORMATION CONTENT OF THE PRESIDENT'S LETTER TO SHAREHOLDERS , 1996 .

[65]  S. Ross Do Stock Market Investors Understand the Risk Sentiment of Corporate Annual Reports , 2006 .

[66]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[67]  Anil K. Jain,et al.  Classification of text documents , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[68]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[69]  Padmini Srinivasan,et al.  Exploring the Forecasting Potential of Company Annual Reports , 2006, ASIST.

[70]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[71]  Padmini Srinivasan,et al.  GO for gene documents , 2007, BMC Bioinformatics.

[72]  Feng Li Annual Report Readability, Current Earnings, and Earnings Persistence , 2008 .

[73]  Albert H. Segars,et al.  The President's Letter to Stockholders: An Examination of Corporate Communication Strategy , 1992 .

[74]  Gerald A. Feltham,et al.  Valuation and Clean Surplus Accounting for Operating and Financial Activities , 1995 .

[75]  Russell J. Lundholm A Tutorial on the Ohlson and Feltham/Ohlson Models: Answers to Some Frequently Asked Questions , 1995 .

[76]  TanChade-Meng,et al.  The use of bigrams to enhance text categorization , 2002 .

[77]  R. Taffler,et al.  The chairman’s statement ‐ A content analysis of discretionary narrative disclosures , 2000 .

[78]  David D. Lewis,et al.  Applying Support Vector Machines to the TREC-2001 Batch Filtering and Routing Tasks , 2001, TREC.

[79]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[80]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[81]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[82]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[83]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[84]  Padmini Srinivasan,et al.  GO for gene documents , 2006, TMBIO '06.

[85]  Kevin R. Gee Using latent semantic indexing to filter spam , 2003, SAC '03.

[86]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[87]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[88]  Tom Fawcett,et al.  Robust Classification Systems for Imprecise Environments , 1998, AAAI/IAAI.

[89]  Ran El-Yaniv,et al.  On feature distributional clustering for text categorization , 2001, SIGIR '01.

[90]  L. Brown,et al.  An Information Interpretation of Financial Analyst Superiority in Forecasting Earnings , 1987 .

[91]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[92]  Mark D. West Theory, method, and practice in computer content analysis , 2001 .

[93]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[94]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[95]  Katherine Schipper,et al.  on Analysts ' Forecasts , 2005 .

[96]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[97]  S. Brooks Marshall,et al.  Content Analysis of Information Cited in Reports of Sell-Side Financial Analysts , 1998 .

[98]  Narasimhan Jegadeesh,et al.  Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency , 1993 .

[99]  Jeremy Piger,et al.  Louis Working Paper Series Beyond the Numbers : An Analysis of Optimistic and Pessimistic Language in Earnings Press Releases , 2006 .

[100]  P. W. Foltz,et al.  Using latent semantic indexing for information filtering , 1990, COCS '90.

[101]  Ion Muslea,et al.  Active Learning with Multiple Views , 2009, Encyclopedia of Data Warehousing and Mining.

[102]  Wei Zhang,et al.  Neural Network Earnings per Share Forecasting Models: A Comparative Analysis of Alternative Methods , 2004, Decis. Sci..

[103]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[104]  Goran Nenadic,et al.  Mining protein function from text using term-based support vector machines , 2005, BMC Bioinformatics.

[105]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[106]  Narasimhan Jegadeesh,et al.  Analyzing the Analysts: When Do Recommendations Add Value? , 2002 .