An Efficient Semantic Graph-Based Approach for Text Representation

Text document representation is one of the main issue in the text analysis areas such as topic extraction and text similarities. Standard Bag-of-Word representation does not deal with relationships between words. In order to overcome this limitation, we introduce a new approach based on the joint use of co-occurrence graph and semantic network of English language called Wordnet. To do this, a word sense disambiguation algorithm has been used in order to establish semantic links between terms given the surrounding context. Experimentations on standard datasets show good performances of the proposed approach. MOTS-CLÉS : Représentation des textes, WordNet, graphe, désambiguïsation des mots, sémantique.

[1]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[2]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[3]  Eneko Agirre,et al.  Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet , 2016, AAAI.

[4]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[5]  Asma Khazaal Abdulsahib Graph based text representation for document clustering , 2015 .

[6]  Kun-Lung Wu,et al.  Incremental k-core decomposition: algorithms and evaluation , 2016, The VLDB Journal.

[7]  Subhash Kumar,et al.  Graph based Text Document Clustering by Detecting Initial Centroids for k-Means , 2013 .

[8]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[9]  Zheng Chen,et al.  Representing document as dependency graph for document clustering , 2011, CIKM '11.

[10]  Rohini K. Srihari,et al.  Graph-based text representation and knowledge discovery , 2007, SAC '07.

[11]  Graeme Hirst,et al.  Semantic Interpretation and Ambiguity , 1988, Artif. Intell..

[12]  Spyros Kotoulas,et al.  Medical Text Classification using Convolutional Neural Networks , 2017, Studies in health technology and informatics.

[13]  Rafal A. Angryk,et al.  GDClust: A Graph-Based Document Clustering Technique , 2007 .

[14]  Yoohwan Kim,et al.  Text mining for security threat detection discovering hidden information in unstructured log messages , 2016, 2016 IEEE Conference on Communications and Network Security (CNS).

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[16]  Senén Barro,et al.  Linguistic Descriptions for Automatic Generation of Textual Short-Term Weather Forecasts on Real Prediction Data , 2015, IEEE Trans. Fuzzy Syst..

[17]  Atul Srivastava,et al.  Comparative Study of Web Page Ranking Algorithms , 2014 .

[18]  Mária Bieliková,et al.  From Ambiguous Words to Key-Concept Extraction , 2013, 2013 24th International Workshop on Database and Expert Systems Applications.