An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents

The majority of the documents produced and exchanged through medias and social networks are unstructured. Due to the amount of these unstructured documents on the Web, their exploitation represents a tedious or even impossible task for human beings without assistance by dedicated algorithms and specialized computer systems in document classification or information extraction. To be efficient and relevant, such systems have to understand the content of these unstructured documents. The context (or topic) of a document is one of the basic information essential for the understanding of its content, and the more precise the context of a document, the more relevant its understanding will be. This paper presents a precise context identification approach that is evaluated quantitatively and qualitatively on several reference corpora and compared to other context identification systems. The contexts identified by our model are much more precise than those identified by these others systems.

[1]  Jong-Mo Seo,et al.  A news-topic recommender system based on keywords extraction , 2017, Multimedia Tools and Applications.

[2]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[3]  Huifang Ma Hot topic extraction using time window , 2011, 2011 International Conference on Machine Learning and Cybernetics.

[4]  Hongli Zhang,et al.  Social Topic Detection for Web Forum , 2012, 2012 International Conference on Computer Science and Service System.

[5]  Jöran Beel,et al.  Docear's PDF inspector: title extraction from PDF files , 2013, JCDL '13.

[6]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[7]  Bernadette Bouchon-Meunier,et al.  A General Learning Method for Automatic Title Extraction from HTML Pages , 2009, MLDM.

[8]  Houda Benbrahim,et al.  A Text Classification based method for context extraction from online reviews , 2013, 2013 8th International Conference on Intelligent Systems: Theories and Applications (SITA).

[9]  Lu Liu,et al.  Multi-Level Topical Text Categorization with Wikipedia , 2016, 2016 IEEE/ACM 9th International Conference on Utility and Cloud Computing (UCC).

[10]  Ruifang He,et al.  Topic Extraction of Events on Social Media Using Reinforced Knowledge , 2018, KSEM.

[11]  Jian Yu,et al.  Document Topic Extraction Based on Wikipedia Category , 2011, 2011 Fourth International Joint Conference on Computational Sciences and Optimization.

[12]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[13]  Marcos André Gonçalves,et al.  Semantically-Enhanced Topic Modeling , 2018, CIKM.

[14]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[15]  Y. Radhika,et al.  Hot topic extraction based on frequency, position, scattering and topical weight for time sliced news documents , 2013, 2013 15th International Conference on Advanced Computing Technologies (ICACT).

[16]  Qi Sun,et al.  A Topic Detection Method Based on KeyGraph and Community Partition , 2018, ICCAI 2018.

[17]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[18]  Sadaqat Jan,et al.  Automatic Topic Modeling for Single Document Short Texts , 2017, 2017 International Conference on Frontiers of Information Technology (FIT).

[19]  Li Jianhua,et al.  A novel text subject extraction method , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[20]  Parag A. Kulkarni,et al.  Context based text document sharing system using association rule mining , 2014, 2014 Annual IEEE India Conference (INDICON).

[21]  Xiao Hu News hotspots detection and tracking based on LDA topic model , 2016, 2016 International Conference on Progress in Informatics and Computing (PIC).

[22]  Xue-Jie Zhang,et al.  Title extraction from Loosely Structured Data Records , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[23]  Komal Kumar Bhatia,et al.  Domain Identification and Classification of Web Pages Using Artificial Neural Network , 2013 .

[24]  Christian Borgelt,et al.  An implementation of the FP-growth algorithm , 2005 .