Using Python for Text Analysis in Accounting Research

The prominence of textual data in accounting research has increased dramatically. To assist researchers in understanding and using textual data, this monograph defines and describes common measures of textual data and then demonstrates the collection and processing of textual data using the Python programming language. The monograph is replete with sample code that replicates textual analysis tasks from recent research papers. In the first part of the monograph, we provide guidance on getting started in Python. We first describe Anaconda, a distribution of Python that provides the requisite libraries for textual analysis, and its installation. We then introduce the Jupyter notebook, a programming environment that improves research workflows and promotes replicable research. Next, we teach the basics of Python programming and demonstrate the basics of working with tabular data in the Pandas package. The second part of the monograph focuses on specific textual analysis methods and techniques commonly used in accounting research. We first introduce regular expressions, a sophisticated language for finding patterns in text. We then show how to use regular expressions to extract specific parts from text. Next, we introduce the idea of transforming text data (unstructured data) into numerical measures representing variables of interest (structured data). Specifically, we introduce dictionary-based methods of 1) measuring document sentiment, 2) computing text complexity, 3) identifying forward-looking sentences and risk disclosures, 4) collecting informative numbers in text, and 5) computing the similarity of different pieces of text. For each of these tasks, we cite relevant papers and provide code snippets to implement the relevant metrics from these papers. Finally, the third part of the monograph focuses on automating the collection of textual data. We introduce web scraping and provide code for downloading filings from EDGAR.

[1]  Madhav V. Rajan,et al.  Knowledge, Compensation, and Firm Value: An Empirical Analysis of Firm Communication , 2014 .

[2]  S. Chava,et al.  Hyperbole or Reality? Investor Response to Extreme Language in Earnings Conference Calls , 2020, The Accounting Review.

[3]  Bill McDonald,et al.  IPO First-Day Returns, Offer Price Revisions, Volatility, and Form S-1 Language , 2013 .

[4]  Paul C. Tetlock Giving Content to Investor Sentiment: The Role of Media in the Stock Market , 2005, The Journal of Finance.

[5]  David R. Peterson,et al.  Earnings Conference Calls and Stock Returns: The Incremental Informativeness of Textual Tone , 2011 .

[6]  Stephen V. Brown,et al.  Large-Sample Evidence on Firms’ Year-Over-Year MD&A Modifications , 2011 .

[7]  R. Gunning The Technique of Clear Writing. , 1968 .

[8]  Zahn Bozanic,et al.  Management Earnings Forecasts and Other Forward-Looking Statements , 2017 .

[9]  Feng Li Annual Report Readability, Current Earnings, and Earnings Persistence , 2008 .

[10]  Hsinchun Chen,et al.  The information content of mandatory risk factor disclosures in corporate filings , 2010 .

[11]  Tim Loughran,et al.  When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks , 2010 .

[12]  Manipulating the Narrative: Managerial Discretion in the Use of Plain English in Earnings Announcements , 2021, SSRN Electronic Journal.

[13]  Praveen Pathak,et al.  Making words work: Using financial text as a predictor of financial events , 2010, Decis. Support Syst..

[14]  Theodore E. Christensen,et al.  Disentangling Managers’ and Analysts’ Non-GAAP Reporting , 2017 .

[15]  Tobias Kretschmer,et al.  SURVEY OF LITERATURE , 2012 .

[16]  Khrystyna Bochkay,et al.  Using MD&A to Improve Earnings Forecasts , 2013 .

[17]  Sofus A. Macskassy,et al.  More than Words: Quantifying Language to Measure Firms' Fundamentals the Authors Are Grateful for Assiduous Research Assistance from Jie Cao and Shuming Liu. We Appreciate Helpful Comments From , 2007 .

[18]  Kenneth J. Merkley,et al.  The Effect of Annual Report Readability on Analyst Following and the Properties of Their Earnings Forecasts , 2011 .

[19]  Mark H. Lang,et al.  The Evolution of 10-K Textual Disclosure: Evidence from Latent Dirichlet Allocation , 2017 .

[20]  Khrystyna Bochkay,et al.  Dynamics of CEO Disclosure Style , 2018, The Accounting Review.

[21]  Mark Heitmann,et al.  More than a Feeling: Benchmarks for Sentiment Analysis Accuracy , 2020, SSRN Electronic Journal.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Joshua J. Filzen,et al.  Financial Statement Complexity and Meeting Analysts’ Expectations , 2014 .

[24]  Kin Lo,et al.  Earnings Management and Annual Report Readability , 2016 .

[25]  W. Guay,et al.  Guiding Through the Fog: Financial Statement Complexity and Voluntary Disclosure , 2016 .

[26]  Ian D. Gow,et al.  Non-answers during Conference Calls , 2020, SSRN Electronic Journal.

[27]  Todd D. Kravet,et al.  Textual risk disclosures and investors’ risk perceptions , 2013 .

[28]  Rani Hoitash,et al.  Measuring Accounting Reporting Complexity with XBRL , 2015 .

[29]  Andrew J. Leone,et al.  An Empirical Analysis of Auditor Reporting and Its Association with Abnormal Accruals , 2004 .

[30]  M. R. Kerbel What About Us? , 2018, Remote & Controlled.

[31]  Feng Li The Information Content of Forward-Looking Statements in Corporate Filings—A Naïve Bayesian Machine Learning Approach , 2010 .

[32]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[33]  Gerard Hoberg,et al.  The Information Content of IPO Prospectuses , 2009 .

[34]  Andrew J. Leone,et al.  Measuring Qualitative Information in Capital Markets Research: Comparison of Alternative Methodologies to Measure Disclosure Tone , 2016 .

[35]  The Impact of Information Processing Costs on Firm Disclosure Choice: Evidence from the XBRL Mandate , 2019, Journal of Accounting Research.

[36]  Gerard Hoberg,et al.  Text-Based Network Industries and Endogenous Product Differentiation , 2010, Journal of Political Economy.

[37]  Xiao-Jun Zhang,et al.  Financial reporting complexity and investor underreaction to 10-K information , 2009 .

[38]  D. Larcker,et al.  Detecting Deceptive Discussions in Conference Calls , 2012 .

[39]  Di Wu,et al.  Word Power: A New Approach for Content Analysis , 2013 .

[40]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[41]  A. Leone,et al.  Complexity of financial reporting standards and accounting expertise , 2019, Journal of Accounting and Economics.

[42]  Suresh Radhakrishnan,et al.  Forward-Looking MD&A Disclosures and the Information Environment , 2015, Manag. Sci..

[43]  Mark Lang,et al.  Textual analysis and international financial reporting: Large sample evidence ☆ , 2015 .

[44]  K. Kolev,et al.  Information transfer and conference calls , 2017 .

[45]  Andrew J. Leone,et al.  A Plain English Measure of Financial Reporting Readability , 2017 .

[46]  Bill McDonald,et al.  Textual Analysis in Accounting and Finance: A Survey , 2016 .