A survey of the applications of text mining in financial domain

Text mining has found a variety of applications in diverse domains. Of late, prolific work is reported in using text mining techniques to solve problems in financial domain. The objective of this paper is to provide a state-of-the-art survey of various applications of Text mining to finance. These applications are categorized broadly into FOREX rate prediction, stock market prediction, customer relationship management (CRM) and cyber security. Since finance is a service industry, these problems are paramount in operational and customer growth aspects. We reviewed 89 research papers that appeared during the period 2000-2016, highlighted some of the issues, gaps, key challenges in this area and proposed some future research directions. Finally, this review can be extremely useful to budding researchers in this area, as many open problems are highlighted.

[1]  Fadi A. Thabtah,et al.  Intelligent phishing detection system for e-banking using fuzzy data mining , 2010, Expert Syst. Appl..

[2]  Mike Y. Chen,et al.  Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web , 2001 .

[3]  Sehl Mellouli,et al.  An ontology for representing financial headline news , 2010, J. Web Semant..

[4]  Hsinchun Chen,et al.  Business Intelligence and Analytics: Research Directions , 2013, TMIS.

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[6]  Kuldip K. Paliwal,et al.  Intrusion detection using text processing techniques with a kernel based similarity measure , 2007, Comput. Secur..

[7]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[8]  Cindy Yoshiko Shirata,et al.  Extracting Key Phrases as Predictors of Corporate Bankruptcy: Empirical Analysis of Annual Reports by Text Mining , 2011 .

[9]  Huaiqing Wang,et al.  An ontology for causal relationships between news and financial instruments , 2008, Expert Syst. Appl..

[10]  Dirk Neumann,et al.  Automated news reading: Stock price prediction based on financial news using context-capturing features , 2013, Decis. Support Syst..

[11]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[12]  Samuel W. K. Chan,et al.  A text-based decision support system for financial sequence prediction , 2011, Decis. Support Syst..

[13]  Lluís Màrquez i Villodre,et al.  Boosting Applied to Word Sense Disambiguation , 2000, ArXiv.

[14]  Chenn-Jung Huang,et al.  Realization of a news dissemination agent based on weighted association rules and text mining techniques , 2010, Expert Syst. Appl..

[15]  Werner Antweiler,et al.  Is All that Talk Just Noise? The Information Content of Internet Stock Message Boards , 2001 .

[16]  Xi Chen,et al.  Assessing the severity of phishing attacks: A hybrid data mining approach , 2011, Decis. Support Syst..

[17]  Gholam Ali Montazer,et al.  Design and implementation of fuzzy expert system for Tehran Stock Exchange portfolio recommendation , 2010, Expert Syst. Appl..

[18]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[19]  Chunhua Zhang,et al.  Spam filtering with several novel bayesian classifiers , 2008, 2008 19th International Conference on Pattern Recognition.

[20]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[21]  Sheng-Hsun Hsu,et al.  Application of SVM and ANN for intrusion detection , 2005, Comput. Oper. Res..

[22]  F. Galton Regression Towards Mediocrity in Hereditary Stature. , 1886 .

[23]  Juan E. Tapiador,et al.  Dendroid: A text mining approach to analyzing and classifying code structures in Android malware families , 2014, Expert Syst. Appl..

[24]  Lipika Dey,et al.  Document Clustering for Event Identification and Trend Analysis in Market News , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[25]  Juan Jose García Adeva,et al.  Intrusion detection in web applications using text mining , 2007, Eng. Appl. Artif. Intell..

[26]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[27]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[28]  Andrew H. Sung,et al.  Detection of Phishing Attacks: A Machine Learning Approach , 2008, Soft Computing Applications in Industry.

[29]  Ambuj Mahanti,et al.  A knowledge based scheme for risk assessment in loan processing by banks , 2016, Decis. Support Syst..

[30]  V. Rao Vemuri,et al.  Using Text Categorization Techniques for Intrusion Detection , 2002, USENIX Security Symposium.

[31]  Yanfang Ye,et al.  IMDS: intelligent malware detection system , 2007, KDD '07.

[32]  Nigel Collier,et al.  An Experiment in Integrating Sentiment Features for Tech Stock Prediction in Twitter , 2012 .

[33]  John Riedl,et al.  E-Commerce Recommendation Applications , 2004, Data Mining and Knowledge Discovery.

[34]  Verdine Saviola Noronha Ensemble Clustering for Internet Security Applications , 2013 .

[35]  Wu He,et al.  International Journal of Information Management Social Media Competitive Analysis and Text Mining: a Case Study in the Pizza Industry , 2022 .

[36]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[37]  Kristof Coussement,et al.  Improving Customer Complaint Management by Automatic Email Classification Using Linguistic Style Features as Predictors , 2007 .

[38]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[39]  Tsuhan Chen,et al.  Malicious web content detection by machine learning , 2010, Expert Syst. Appl..

[40]  Franciska de Jong,et al.  Classifying the influence of negative affect expressed by the financial media on investor behavior , 2014, IIiX.

[41]  Eric Gilbert,et al.  Widespread Worry and the Stock Market , 2010, ICWSM.

[42]  S. Appavu alias Balamurugan,et al.  Data mining based intelligent analysis of threatening e-mail , 2009, Knowl. Based Syst..

[43]  Hong Miao,et al.  Currency jumps, cojumps and the role of macro news , 2014 .

[44]  Naren Ramakrishnan,et al.  Forex-foreteller: currency trend modeling using news articles , 2013, KDD.

[45]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[46]  Fadi A. Thabtah,et al.  Phishing detection based Associative Classification data mining , 2014, Expert Syst. Appl..

[47]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[48]  Lipika Dey,et al.  Mining Customer Feedbacks for Actionable Intelligence , 2010 .

[49]  Marc-André Mittermayer,et al.  Forecasting Intraday stock price trends with text mining techniques , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[50]  Dorothy E. Denning,et al.  An Intrusion-Detection Model , 1986, 1986 IEEE Symposium on Security and Privacy.

[51]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[52]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[53]  Michel Ballings,et al.  CRM in social media: Predicting increases in Facebook usage frequency , 2015, Eur. J. Oper. Res..

[54]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[55]  Surya B. Yadav,et al.  A computational model for financial reporting fraud detection , 2011, Decis. Support Syst..

[56]  V. Rao Vemuri,et al.  Intrusion Detection Using Text Processing Techniques with a Binary-Weighted Cosine Metric , 2006 .

[57]  P. Lalitha,et al.  New Filtering Approaches for Phishing Email , 2013 .

[58]  Lei Zhang,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[59]  Mingxing He,et al.  An efficient phishing webpage detector , 2011, Expert Syst. Appl..

[60]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[61]  Yang Yu,et al.  The impact of social and conventional media on firm equity value: A sentiment analysis approach , 2013, Decis. Support Syst..

[62]  Maozhen Li,et al.  A survey of emerging approaches to spam filtering , 2012, CSUR.

[63]  Hsinchun Chen,et al.  Evaluating sentiment in financial news articles , 2012, Decis. Support Syst..

[64]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[65]  James D. Thomas Integrating Genetic Algorithms and Text Learning for Financial Prediction , 2000 .

[66]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[67]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[68]  Jochen Dörre,et al.  Text mining: finding nuggets in mountains of textual data , 1999, KDD '99.

[69]  Hannu Vanharanta,et al.  Comparing numerical data and text information from annual reports using self-organizing maps , 2001, Int. J. Account. Inf. Syst..

[70]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[71]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[72]  Zhen Liu,et al.  A comparison of input representations in neural networks: a case study in intrusion detection , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[73]  C. Goodhart News and the Foreign Exchange Market , 1990 .

[74]  David Zimbra,et al.  Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network , 2013, Expert Syst. Appl..

[75]  Alex Brodsky,et al.  Trinitya: distributed defense against transient spam-bots , 2007, PODC '07.

[76]  Venu Govindaraju,et al.  Malware detection via API calls, topic models and machine learning , 2015, 2015 IEEE International Conference on Automation Science and Engineering (CASE).

[77]  Youssef Iraqi,et al.  Phishing Detection: A Literature Survey , 2013, IEEE Communications Surveys & Tutorials.

[78]  Marti A. Hearst,et al.  Why phishing works , 2006, CHI.

[79]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[80]  Khurshid Ahmad,et al.  Sentiment Polarity Identification in Financial News: A Cohesion-based Approach , 2007, ACL.

[81]  Feng Li The Information Content of Forward-Looking Statements in Corporate Filings—A Naïve Bayesian Machine Learning Approach , 2010 .

[82]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[83]  Xuhua Ding,et al.  Anomaly Based Web Phishing Page Detection , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[84]  Lipika Dey,et al.  Mining Financial News for Major Events and Their Impacts on the Market , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[85]  Lluís Màrquez i Villodre,et al.  Boosting Applied toe Word Sense Disambiguation , 2000, ECML.

[86]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[87]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[88]  Moshe Koppel,et al.  Good News or Bad News? Let the Market Decide , 2006, Computing Attitude and Affect in Text.

[89]  Lorrie Faith Cranor,et al.  An Empirical Analysis of Phishing Blacklists , 2009, CEAS 2009.

[90]  Dirk Thorleuchter,et al.  Predicting e-commerce company success by mining the text of its publicly-accessible website , 2012, Expert Syst. Appl..

[91]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[92]  Hsinchun Chen,et al.  Textual analysis of stock market prediction using breaking financial news: The AZFin text system , 2009, TOIS.

[93]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[94]  Albert Bifet,et al.  Sentiment Knowledge Discovery in Twitter Streaming Data , 2010, Discovery Science.

[95]  Ying Wah Teh,et al.  Text mining for market prediction: A systematic review , 2014, Expert Syst. Appl..

[96]  Le Minh Nguyen,et al.  Text analytics in industry: Challenges, desiderata and trends , 2016, Comput. Ind..

[97]  Aditya P. Mathur,et al.  A Survey of Malware Detection Techniques , 2007 .

[98]  Rayid Ghani,et al.  Text mining for product attribute extraction , 2006, SKDD.

[99]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[100]  Hannu Vanharanta,et al.  Combining data and text mining techniques for analysing financial reports: Research Articles , 2004 .

[101]  David D. Jensen,et al.  Mining of Concurrent Text and Time Series , 2008 .

[102]  Wai Lam,et al.  Stock prediction: Integrating text mining approach using real-time news , 2003, 2003 IEEE International Conference on Computational Intelligence for Financial Engineering, 2003. Proceedings..

[103]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[104]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[105]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[106]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[107]  Praveen Pathak,et al.  Making words work: Using financial text as a predictor of financial events , 2010, Decis. Support Syst..

[108]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[109]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[110]  Júlio C. Nievola,et al.  Predicting published news effect in the Brazilian stock market , 2012, Expert Syst. Appl..

[111]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[112]  Vasant Honavar,et al.  Automated discovery of concise predictive rules for intrusion detection , 2002, J. Syst. Softw..

[113]  Vlado Keselj,et al.  Financial Forecasting Using Character N-Gram Analysis and Readability Scores of Annual Reports , 2009, Canadian Conference on AI.

[114]  Amit Vasudevan,et al.  SPiKE: engineering malware analysis tools using unobtrusive binary-instrumentation , 2006, ACSC.

[115]  Niels Provos,et al.  A framework for detection and measurement of phishing attacks , 2007, WORM '07.

[116]  Yong Chen,et al.  Automatic malware categorization using cluster ensemble , 2010, KDD.

[117]  Sofus A. Macskassy,et al.  More than Words: Quantifying Language to Measure Firms' Fundamentals the Authors Are Grateful for Assiduous Research Assistance from Jie Cao and Shuming Liu. We Appreciate Helpful Comments From , 2007 .

[118]  Teruo Higashino,et al.  Twitter user profiling based on text and community mining for market analysis , 2013, Knowl. Based Syst..

[119]  Salvatore J. Stolfo,et al.  Anomalous Payload-Based Network Intrusion Detection , 2004, RAID.

[120]  Wai Lam,et al.  News Sensitive Stock Trend Prediction , 2002, PAKDD.

[121]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[122]  Haym Hirsh,et al.  Mining Associations in Text in the Presence of Background Knowledge , 1996, KDD.

[123]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[124]  Christopher Krügel,et al.  On the Effectiveness of Techniques to Detect Phishing Sites , 2007, DIMVA.

[125]  Jan Muntermann,et al.  An intraday market risk management approach based on textual analysis , 2011, Decis. Support Syst..

[126]  Ying Wah Teh,et al.  Text mining of news-headlines for FOREX market prediction: A Multi-layer Dimension Reduction Algorithm with semantics and sentiment , 2015, Expert Syst. Appl..

[127]  Daisuke Miyamoto,et al.  An Evaluation of Machine Learning-Based Methods for Detection of Phishing Sites , 2008, ICONIP.

[128]  Niels Provos,et al.  The Ghost in the Browser: Analysis of Web-based Malware , 2007, HotBots.

[129]  Muhammad Zubair Shafiq,et al.  Using spatio-temporal information in API calls with machine learning algorithms for malware detection , 2009, AISec '09.

[130]  Barton C. Massey,et al.  Learning Spam: Simple Techniques For Freely-Available Software , 2003, USENIX Annual Technical Conference, FREENIX Track.

[131]  Stephen Shaoyi Liao,et al.  An ontology based framework for mining dependence relationships between news and financial instruments , 2011, Expert Syst. Appl..

[132]  B. John Oommen,et al.  Anomaly Detection in Dynamic Systems Using Weak Estimators , 2011, TOIT.

[133]  Peng Hao,et al.  Transfer learning using computational intelligence: A survey , 2015, Knowl. Based Syst..

[134]  Martínez Guardado,et al.  Automatic document classification , 2017 .

[135]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[136]  Vadlamani Ravi,et al.  A survey on opinion mining and sentiment analysis: Tasks, approaches and applications , 2015, Knowl. Based Syst..