Learning Company Embeddings from Annual Reports for Fine-grained Industry Characterization

Organizingcompaniesbyindustrysegment(e.g.artificial intelligence, healthcare or fintech) is useful foranalyzingstockmarketperformanceandfordesigning theme base investment funds, among others. Current practice is to manually assign companies to sectors or industries from a small predefined list, which has two key limitations. First, due to the manual effort involved, this strategy is only feasible for relatively mainstream industry segments, and can thus not easily be used for niche or emerging topics. Second, the use of hard label assignments ignores the fact that different companies will be more or less exposed to a particular segment. To address these limitations, we propose to learn vector representations of companies based ontheirannualreports. Thekeychallengeistodistill the relevant information from these reports for characterizing their industries, since annual reports also contain a lot of information which is not relevant for our purpose. To this end, we introduce a multi-task learning strategy, which is based on fine-tuning the BERT language model on (i) existingsectorlabelsand(ii)stockmarketperformance. Experiments in both English and Japanese demonstrate the usefulness of this strategy.

[1]  Tianyu Gao,et al.  KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation , 2019, ArXiv.

[2]  Shay B. Cohen,et al.  Proceedings of ACL , 2013 .

[3]  H. J. Mclaughlin,et al.  Learn , 2002 .

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[6]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[7]  Noah A. Smith,et al.  Proceedings of EMNLP , 2007 .

[8]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[9]  Erik Cambria,et al.  Natural language based financial forecasting: a survey , 2017, Artificial Intelligence Review.

[10]  Jilei Tian,et al.  Towards Personalized Context-Aware Recommendation by Mining Context Logs through Topic Models , 2012, PAKDD.

[11]  Zhen Wang,et al.  Knowledge Graph and Text Jointly Embedding , 2014, EMNLP.

[12]  Atsuo Kato,et al.  Related Stocks Selection with Data Collaboration Using Text Mining , 2019, Inf..

[13]  V. Plerou,et al.  Identifying Business Sectors from Stock Price Fluctuations , 2000, cond-mat/0011145.

[14]  Xi Chen,et al.  How LinkedIn Economic Graph Bonds Information and Product: Applications in LinkedIn Salary , 2018, KDD.

[15]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[16]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Zhiyuan Liu,et al.  Representation Learning of Knowledge Graphs with Entity Descriptions , 2016, AAAI.

[18]  Daniel Isemann,et al.  Classifying Companies by Industry Using Word Embeddings , 2018, NLDB.

[19]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[20]  V. Akila,et al.  Information , 2001, The Lancet.

[21]  Andrei Popescu-Belis,et al.  8th International Conference on Applications of Natural Language to Information Systems , 2003 .

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Ming-Wei Chang,et al.  Zero-Shot Entity Linking by Reading Entity Descriptions , 2019, ACL.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Guillaume Bouchard,et al.  Knowledge Graph Completion via Complex Tensor Factorization , 2017, J. Mach. Learn. Res..

[26]  Timothy M. Hospedales,et al.  TuckER: Tensor Factorization for Knowledge Graph Completion , 2019, EMNLP.

[27]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.