The impact of deep learning on document classification using semantically rich representations

Abstract This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including document representation and document classification. In the first module, a document is enriched with semantics using background knowledge provided by an ontology and through the acquisition of its relevant terminology. Acquisition of terminology integrated to the ontology extends the capabilities of semantically rich document representations with an in depth-coverage of concepts, thereby capturing the whole conceptualization involved in documents. Semantically rich representations obtained from the first module will serve as input to the document classification module which aims at finding the most appropriate category for that document through deep learning. Three different deep learning networks each belonging to a different category of machine learning techniques for ontological document classification using a real-life ontology are used. Multiple simulations are carried out with various deep neural networks configurations, and our findings reveal that a three hidden layer feedforward network with 1024 neurons obtain the highest document classification performance on the INFUSE dataset. The performance in terms of F1 score is further increased by almost five percentage points to 78.10% for the same network configuration when the relevant terminology integrated to the ontology is applied to enrich document representation. Furthermore, we conducted a comparative performance evaluation using various state-of-the-art document representation approaches and classification techniques including shallow and conventional machine learning classifiers.

[1]  Hao Wang,et al.  Ontology-based deep learning for human behavior prediction with explanations in health social networks , 2017, Inf. Sci..

[2]  Ausif Mahmood,et al.  Convolutional Recurrent Deep Learning Model for Sentence Classification , 2018, IEEE Access.

[3]  Chin-Wan Chung,et al.  A Wikipedia Matching Approach to Contextual Advertising , 2010, World Wide Web.

[4]  Murat Can Ganiz,et al.  Semantic text classification: A survey of past and recent advances , 2018, Inf. Process. Manag..

[5]  Zongda Wu,et al.  An efficient Wikipedia semantic matching approach to text document classification , 2017, Inf. Sci..

[6]  Frederic P. Miller,et al.  Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance , 2009 .

[7]  Lynda Tamine,et al.  Biomedical concept extraction based on combining the content-based and word order similarities , 2011, SAC.

[8]  Sule Yildirim Yayilgan,et al.  SEMCON: A Semantic and Contextual Objective Metric for Enriching Domain Ontology Concepts , 2016, Int. J. Semantic Web Inf. Syst..

[9]  Luca Cagliero,et al.  Improving classification models with taxonomy information , 2013, Data Knowl. Eng..

[10]  Sule Yildirim Yayilgan,et al.  An Improved Concept Vector Space Model for Ontology Based Classification , 2015, 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[11]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[12]  Zhao Yang Dong,et al.  Research on Unstructured Text Data Mining and Fault Classification Based on RNN-LSTM with Malfunction Inspection Report , 2017 .

[13]  Nayat Sánchez Pi,et al.  Improving ontology-based text classification: An occupational health and security application , 2016, J. Appl. Log..

[14]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[15]  Sylvie Ranwez,et al.  User centered and ontology based information retrieval system for life sciences , 2010, BMC Bioinformatics.

[16]  Nayat Sánchez Pi,et al.  Text Classification Techniques in Oil Industry Applications , 2013, SOCO-CISIS-ICEUTE.

[17]  Han-joon Kim,et al.  Towards perfect text classification with Wikipedia-based semantic Naïve Bayes learning , 2018, Neurocomputing.

[18]  Yan Zhang,et al.  Adaptive Concept Resolution for document representation and its applications in text mining , 2015, Knowl. Based Syst..

[19]  Alan F. Smeaton,et al.  Ontology-Based MEDLINE Document Classification , 2007, BIRD.

[20]  Heri Ramampiaro,et al.  A Deep Network Model for Paraphrase Detection in Short Text Messages , 2017, Inf. Process. Manag..

[21]  Mostafa Keikha,et al.  Rich document representation and classification: An analysis , 2009, Knowl. Based Syst..

[22]  Sule Yildirim Yayilgan,et al.  Supervised Ontology-Based Document Classification Model , 2017, ICCDA '17.

[23]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[24]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[25]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[26]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[27]  Wen Zhang,et al.  DRI-RCNN: An approach to deceptive review identification using recurrent convolutional neural network , 2018, Inf. Process. Manag..

[28]  Donald E. Brown,et al.  HDLTex: Hierarchical Deep Learning for Text Classification , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[29]  Chris Hankin,et al.  Real-time processing of social media with SENTINEL: A syndromic surveillance system incorporating deep learning for health classification , 2019, Inf. Process. Manag..

[30]  Ausif Mahmood,et al.  Deep learning for sentence classification , 2017, 2017 IEEE Long Island Systems, Applications and Technology Conference (LISAT).